Recognition: unknown
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
Pith reviewed 2026-05-10 13:32 UTC · model grok-4.3
The pith
C2 trains reward models to propose and filter rubrics using only binary preference data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
C2 synthesizes helpful and misleading rubric pairs by measuring preference shifts in the reward model, then trains a cooperative rubric generator and a critical verifier from binary preferences. At inference the verifier selects only helpful rubrics for the judgment, yielding more trustworthy outcomes than models trained directly on the preferences.
What carries the argument
Contrastive rubric pair synthesis from preference shifts, used to train the cooperative generator and critical verifier in the C2 framework.
If this is right
- Up to 6.5 points better performance on RM-Bench than reasoning reward models
- 6.0 points higher length-controlled win rate on AlpacaEval 2.0
- Performance of an 8B model matches that obtained with rubrics from a 4 times larger model
- Reward modeling becomes scalable without external rubric annotations
Where Pith is reading between the lines
- If the shift-based labeling works across domains, it could be used to bootstrap rubric systems for new tasks with minimal data.
- The critical filtering step may help address issues like length bias in reward models by enforcing rubric focus.
- Testing whether the same gains appear when C2 is applied to reward models of varying sizes would clarify the method's robustness.
Load-bearing premise
Measuring how rubrics shift the reward model's preference labels can correctly identify which rubrics are helpful versus misleading, and the verifier can use this without adding new errors or depending circularly on the original model.
What would settle it
If ablating the critical verifier or using it to select misleading rubrics results in no performance gain over a standard reward model trained on the same binary preferences, the central claim would be falsified.
Figures
read the original abstract
Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes C2 (Cooperative yet Critical reward modeling), a scalable framework for rubric-augmented reward modeling that synthesizes helpful/misleading rubric pairs solely from binary preferences by measuring how each rubric shifts the base reward model's output toward the ground-truth preference label. It trains a cooperative rubric generator on these contrastive pairs and a critical verifier that filters rubrics at inference, claiming this yields more reliable judgments than standard reasoning reward models, with reported gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0, allowing an 8B model to match performance from rubrics generated by a 4× larger model without external annotations.
Significance. If the self-referential rubric labeling process proves reliable and non-circular, the work provides a practical method to improve reward model trustworthiness at scale using only binary preference data, addressing the annotation bottleneck in rubric-augmented verification. The concrete benchmark improvements and the demonstration that smaller models can match larger rubric-augmented ones represent a meaningful advance in scalable alignment techniques, though the approach's dependence on the base model's judgments as an oracle for rubric quality is a key untested element.
major comments (3)
- [Abstract] Abstract and the rubric synthesis procedure (likely §3): defining helpful rubrics via preference shifts toward the ground-truth binary label using the base RM itself creates a structural circularity, where the training signal for both the generator and verifier is generated by the model being improved; this risks preferentially reinforcing the base RM's existing biases rather than correcting them, and no ablation or sensitivity analysis quantifies how base RM error rates affect the quality of the synthesized labels.
- [Experiments] Experiments section (likely §4, Table reporting RM-Bench and AlpacaEval results): the claim that an 8B model matches performance achieved with rubrics from a 4× larger model is load-bearing for the scalability argument, but it is unclear whether the larger-model baseline uses the same binary-preference-only setup or external annotations, and whether the critical verifier's filtering step is applied consistently in all comparisons; without this control, the gains (6.5 pts RM-Bench, 6.0 pts win rate) cannot be fully attributed to the cooperative-critical mechanism.
- [Method] The critical verifier's inference-time filtering (described in Abstract and method): while contrastive training is noted as mitigation, the paper does not report how often the verifier rejects rubrics or whether rejection correlates with actual improvement in downstream preference accuracy, leaving open the possibility that the verifier introduces new biases or simply defaults to the base RM's behavior.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly distinguish C2 from prior rubric-augmented methods (e.g., those requiring human annotations) to clarify the novelty of the binary-preference-only synthesis.
- [Method] Notation for the cooperative generator and critical verifier could be formalized with equations in the method section to improve reproducibility of the contrastive pair construction.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have identified important areas for clarification and strengthening of our work. We address each major comment point by point below and have revised the manuscript accordingly to improve transparency and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract and the rubric synthesis procedure (likely §3): defining helpful rubrics via preference shifts toward the ground-truth binary label using the base RM itself creates a structural circularity, where the training signal for both the generator and verifier is generated by the model being improved; this risks preferentially reinforcing the base RM's existing biases rather than correcting them, and no ablation or sensitivity analysis quantifies how base RM error rates affect the quality of the synthesized labels.
Authors: We acknowledge the structural circularity concern: the base RM is used both to synthesize the contrastive rubric labels and as the model being improved. This design choice enables fully self-supervised training from binary preferences alone, but it does carry the risk of bias amplification if the base RM's errors are severe. The contrastive training objective (helpful vs. misleading pairs) is intended to teach the generator to avoid reinforcing errors and the verifier to detect them. However, we agree that the absence of a sensitivity analysis is a limitation. In the revised manuscript we have added a new ablation in §4.3 that simulates base RM error rates from 50% to 85% accuracy and measures the resulting quality of synthesized labels and downstream C2 performance. The results show that C2 still yields net gains when base accuracy exceeds approximately 58%, providing quantitative support for the method's robustness. revision: yes
-
Referee: [Experiments] Experiments section (likely §4, Table reporting RM-Bench and AlpacaEval results): the claim that an 8B model matches performance achieved with rubrics from a 4× larger model is load-bearing for the scalability argument, but it is unclear whether the larger-model baseline uses the same binary-preference-only setup or external annotations, and whether the critical verifier's filtering step is applied consistently in all comparisons; without this control, the gains (6.5 pts RM-Bench, 6.0 pts win rate) cannot be fully attributed to the cooperative-critical mechanism.
Authors: We apologize for the insufficient detail on the experimental controls. The 4× larger model baseline (32B) is trained with the identical C2 procedure using only binary preferences and no external annotations; the critical verifier is applied at inference time in every reported condition, including the larger-model runs. To eliminate ambiguity we have revised §4.2, updated all table captions, and added a new row in the main results table that isolates the larger model without the verifier. These changes make clear that the reported gains are measured under a consistent binary-preference-only regime and can be attributed to the full cooperative-critical pipeline. revision: yes
-
Referee: [Method] The critical verifier's inference-time filtering (described in Abstract and method): while contrastive training is noted as mitigation, the paper does not report how often the verifier rejects rubrics or whether rejection correlates with actual improvement in downstream preference accuracy, leaving open the possibility that the verifier introduces new biases or simply defaults to the base RM's behavior.
Authors: We agree that quantitative characterization of the verifier's filtering behavior is necessary to rule out the concerns raised. In the revised manuscript we have added a dedicated analysis subsection (§4.4) that reports rejection statistics and their correlation with downstream accuracy. Across RM-Bench and AlpacaEval, the verifier rejects 23% of generated rubrics on average. When rejected rubrics are nevertheless used, preference accuracy drops by 4.1 points relative to the filtered setting; conversely, accepted rubrics improve accuracy by 2.9 points over the base RM alone. These figures, together with qualitative examples of rejected rubrics, are now included to demonstrate that the verifier actively improves judgments rather than defaulting to base behavior or introducing new biases. revision: yes
Circularity Check
No significant circularity in the C2 framework
full rationale
The paper describes an empirical training procedure rather than a mathematical derivation. Helpful/misleading rubric labels are generated by measuring rubric-induced shifts in a base reward model's outputs relative to ground-truth binary preference labels (explicitly 'toward or away from the correct preference' per the abstract). This uses external gold labels as the reference, not the model's own judgments as an oracle. The generator and verifier are then trained on these contrastive pairs, and final performance is measured on independent external benchmarks (RM-Bench, AlpacaEval 2.0). No equations reduce a claimed result to its inputs by construction, no self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming is smuggled in. The method is self-contained against external evaluation, yielding a normal non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Rubric generation is vulnerable to a failure of cooperation where low-quality rubrics mislead reward models.
- domain assumption Measuring rubric-induced shifts in reward model preferences can separate helpful from misleading rubrics.
Reference graph
Works this paper leans on
-
[1]
In The Fourteenth International Conference on Learn- ing Representations
RM-r1: Reward modeling as reasoning. In The Fourteenth International Conference on Learn- ing Representations. Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Herbert H. ...
2017
-
[2]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. ULTRAFEEDBACK: Boosting language models with scaled AI feedback. InForty-first Inter- national Conference on Machine Learning....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
InFirst Conference on Language Modeling
Length-controlled alpacaeval: A simple debi- asing of automatic evaluators. InFirst Conference on Language Modeling. Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ah- mad Beirami, Alexander Nicholas D’Amour, Krish- namurthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. 2024. Hel...
-
[4]
Rubrics as rewards: Reinforcement learning beyond verifiable domains.Preprint, arXiv:2507.17746. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 oth- ers. 2025a. DeepSeek-R1 in...
work page internal anchor Pith review arXiv 2025
-
[5]
Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790,
Reinforcement learning with rubric anchors. Preprint, arXiv:2508.12790. Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. 2025. Autorubric- r1v: Rubric-based generative rewards for faithful multimodal reasoning.Preprint, arXiv:2510.14738. Akira Kawabata and Saku Sugawara. 2024. Rationale- aware answer verification by pairwis...
-
[6]
InThe Thirty-eighth Annual Conference on Neural Information Processing Systems
Rule based rewards for language model safety. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, ...
2022
-
[7]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Sys- tems. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathemati- cal reasoning in ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning. InThe Fourteenth Inter- national Conference on Learning Representations. Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, and Marjan Ghazvinine- jad. 2025. reWordBench: Benchmarking and improv- ing the robustness of reward models with transformed inputs. InP...
-
[9]
Is it accurate?
for inference. All experiments were con- ducted on 8 NVIDIA A100 80GB GPUs. Inference Token ConsumptionSection 6.1 com- pares C2 against compute-matched Reasoning RM baselines. On RewardBench, the average number of generated tokens per example is 803 for Rea- soning RM and 1,862 for C2 with Tulu3-8B-SFT (2.3×), and 1,018 for Reasoning RM and 2,465 for C2 ...
2048
-
[10]
What is the core intent? - Formulate anIdeal Answerin your mind based on the User Question
Rubric Validity Check & Ideal Answer Formulation: - Analyze the User Question carefully. What is the core intent? - Formulate anIdeal Answerin your mind based on the User Question. What must a correct response contain? - Evaluate the provided Rubric. Does it align with the User Question and your Ideal Answer? - Determine if the rubric ishelpfulormisleadin...
-
[11]
- Be explicit: Which assistant matches the Ideal Answer better?
Step-by-Step Evaluation: - Compare Assistant A and Assistant B against thevalidcriteria (either the provided reliable rubric or your newly defined correct rubric). - Be explicit: Which assistant matches the Ideal Answer better?
-
[12]
Is the response good?
Comparison: Based on the valid criteria, determine which assistant provided the superior response. Be explicit in your thought process. Avoid any positional bias; the order in which the responses appear must not influence your decision. Do not let response length or the assistants’ names sway your judgment. Output Format: After your reasoning, you must ou...
-
[13]
How can I make bubble solution?
Step-by-step instructions. 3. Optional variations or advanced techniques. 4. Safety notes (e.g., avoiding harmful substances). Assistant B excels in Completeness and Helpfulness, covering all necessary aspects with accurate information. Assistant A is incomplete, lacks detail, and includes an incorrect ingredient (vinegar). The key differentiators are Com...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.