pith. sign in

Canonical reference

Checklists are better than reward models for aligning language models.CoRR, abs/2507.18624, 2025

Canonical reference. 100% of citing Pith papers cite this work as background.

10 Pith papers citing it
Background 100% of classified citations

citation-role summary

background 6

citation-polarity summary

years

2026 8 2025 2

verdicts

UNVERDICTED 10

roles

background 6

polarities

background 6

representative citing papers

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

RVPO: Risk-Sensitive Alignment via Variance Regularization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.

Reinforcement Learning with Robust Rubric Rewards

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

RLR³ extends RLVR to criterion-level rubric verification via dual execution paths, minimal exposure masking, hierarchical aggregation, and saturation mitigation, delivering 4.7-point gains over base on 15 benchmarks with Qwen3-VL-30B-A3B.

citing papers explorer

Showing 10 of 10 citing papers.