arxiv: 2604.13618 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.LG

Recognition: unknown

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Akira Kawabata , Saku Sugawara

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:32 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords reward modelingrubric augmentationbinary preferencescooperative learningcritical verificationpreference alignmentLLM evaluation

0 comments

The pith

C2 trains reward models to propose and filter rubrics using only binary preference data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop C2, a framework for rubric-augmented reward modeling that avoids the need for external annotations. They create contrastive pairs of helpful and misleading rubrics by checking how each changes the model's preference toward the correct label. A generator learns to produce helpful ones and a verifier learns to accept only those, so the final judgment uses only good rubrics. This matters to readers because it makes advanced evaluation techniques practical for training smaller models without large-scale human effort.

Core claim

C2 synthesizes helpful and misleading rubric pairs by measuring preference shifts in the reward model, then trains a cooperative rubric generator and a critical verifier from binary preferences. At inference the verifier selects only helpful rubrics for the judgment, yielding more trustworthy outcomes than models trained directly on the preferences.

What carries the argument

Contrastive rubric pair synthesis from preference shifts, used to train the cooperative generator and critical verifier in the C2 framework.

If this is right

Up to 6.5 points better performance on RM-Bench than reasoning reward models
6.0 points higher length-controlled win rate on AlpacaEval 2.0
Performance of an 8B model matches that obtained with rubrics from a 4 times larger model
Reward modeling becomes scalable without external rubric annotations

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the shift-based labeling works across domains, it could be used to bootstrap rubric systems for new tasks with minimal data.
The critical filtering step may help address issues like length bias in reward models by enforcing rubric focus.
Testing whether the same gains appear when C2 is applied to reward models of varying sizes would clarify the method's robustness.

Load-bearing premise

Measuring how rubrics shift the reward model's preference labels can correctly identify which rubrics are helpful versus misleading, and the verifier can use this without adding new errors or depending circularly on the original model.

What would settle it

If ablating the critical verifier or using it to select misleading rubrics results in no performance gain over a standard reward model trained on the same binary preferences, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.13618 by Akira Kawabata, Saku Sugawara.

**Figure 1.** Figure 1: We frame rubric generation and rubricgrounded verification as cooperative yet critical communication: the generator cooperatively explores rubrics to guide the verifier toward correct judgments, and the verifier critically assesses which rubrics to follow based on their outcomes. for this alignment (Christiano et al., 2017). Central to RLHF are verifiers that act as scalable proxies for human judgments, … view at source ↗

**Figure 2.** Figure 2: Impact of self-generated rubrics on RM-Bench hard subset. (a) Most rubrics produce near-zero confidence [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our C2 framework. (Step 1) Helpful and misleading rubrics are synthesized by measuring [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of C2 and compute-matched Reasoning RM with majority voting on RewardBench. We [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy under varying proportions of high-quality vs. low-quality rubrics. Gray regions indicate gains [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of rubric quality scores. Variant RB RMB RB2 Avg. Tulu3-8B-SFT C2 (Full) 77.2 65.6 50.7 64.5 w/o Cooperative Gen. 76.8 64.8 48.3 63.3 w/o Critical Verifier 76.1 64.2 47.8 62.7 w/o Negative Rubrics 72.1 63.5 47.2 60.9 Qwen3-8B C2 (Full) 91.8 87.8 71.0 83.5 w/o Cooperative Gen. 90.9 84.5 70.7 82.0 w/o Critical Verifier 90.6 83.3 69.6 81.2 w/o Negative Rubrics 89.2 82.2 70.6 80.7 [PITH_FULL_IMAG… view at source ↗

**Figure 7.** Figure 7: Prompt template for rubric generation used by [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for rubric-free verification used by [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for rubric-augmented verification used by [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for rubric quality evaluation using GPT-5. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: High-quality vs. low-quality rubrics for implementing common element detection without extra data [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: High-quality vs. low-quality rubrics for a spatial reasoning question (Tulu3-8B-SFT). The high-quality [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Rubric comparison across Tulu3 family models. The base model produces generic rubrics (“Completeness” [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Rubric comparison across Qwen3 family models. All three models correctly identify the key issue—the [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: The verifier correctly identifies a helpful rubric that highlights the accuracy criterion (avoiding vinegar [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: The verifier correctly identifies that the rubric misinterprets the problem requirements by treating [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: The verifier incorrectly rejects a helpful rubric that appropriately emphasizes cost comparison and [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: An example of a verification failure caused by model hallucination. Although the generated rubric [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

read the original abstract

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

C2 gives a practical way to add rubrics to reward models from binary preferences alone, with clear benchmark lifts, but the rubric labeling step loops back through the base model and needs close checks for bias reinforcement.

read the letter

C2 trains a cooperative rubric generator and a critical verifier from binary preference data by first creating contrastive pairs: rubrics are labeled helpful or misleading according to whether they shift the base reward model's output toward the ground-truth label. The generator learns to propose the helpful ones, and the verifier learns to accept only those at inference time. This removes the need for separate human rubric annotations, which is the main practical advance over earlier rubric-augmented reward work. The reported numbers are straightforward—up to 6.5 points on RM-Bench and 6 points length-controlled win rate on AlpacaEval 2.0—and an 8B model reaches the performance of larger models that used external rubrics. The comparisons are run against reasoning reward models trained on the same binary data, so the delta is easy to interpret. The framing of explicit cooperation and contrastive synthesis is distinct enough from prior citations to count as new. The experiments are concrete and the gains are large enough to matter for alignment pipelines that already use preference data. The soft spot is exactly the one the stress-test flags. Because rubric quality is judged by how it changes the same reward model's preference, any systematic error in the initial model can get reinforced rather than corrected. The contrastive training reduces the risk, but the paper still needs to show that the verifier does not simply preserve the base model's blind spots across different model sizes or domains. Without error bars or ablation on the labeling step, it is hard to know how sensitive the whole pipeline is. This work is for groups already running reward-model training and looking for ways to add structure without extra annotation cost. It has enough method detail and benchmark results to justify sending it to referees. I would recommend peer review; the idea is usable and the evidence is positive even if the circularity question needs more attention in revision.

Referee Report

3 major / 2 minor

Summary. The paper proposes C2 (Cooperative yet Critical reward modeling), a scalable framework for rubric-augmented reward modeling that synthesizes helpful/misleading rubric pairs solely from binary preferences by measuring how each rubric shifts the base reward model's output toward the ground-truth preference label. It trains a cooperative rubric generator on these contrastive pairs and a critical verifier that filters rubrics at inference, claiming this yields more reliable judgments than standard reasoning reward models, with reported gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0, allowing an 8B model to match performance from rubrics generated by a 4× larger model without external annotations.

Significance. If the self-referential rubric labeling process proves reliable and non-circular, the work provides a practical method to improve reward model trustworthiness at scale using only binary preference data, addressing the annotation bottleneck in rubric-augmented verification. The concrete benchmark improvements and the demonstration that smaller models can match larger rubric-augmented ones represent a meaningful advance in scalable alignment techniques, though the approach's dependence on the base model's judgments as an oracle for rubric quality is a key untested element.

major comments (3)

[Abstract] Abstract and the rubric synthesis procedure (likely §3): defining helpful rubrics via preference shifts toward the ground-truth binary label using the base RM itself creates a structural circularity, where the training signal for both the generator and verifier is generated by the model being improved; this risks preferentially reinforcing the base RM's existing biases rather than correcting them, and no ablation or sensitivity analysis quantifies how base RM error rates affect the quality of the synthesized labels.
[Experiments] Experiments section (likely §4, Table reporting RM-Bench and AlpacaEval results): the claim that an 8B model matches performance achieved with rubrics from a 4× larger model is load-bearing for the scalability argument, but it is unclear whether the larger-model baseline uses the same binary-preference-only setup or external annotations, and whether the critical verifier's filtering step is applied consistently in all comparisons; without this control, the gains (6.5 pts RM-Bench, 6.0 pts win rate) cannot be fully attributed to the cooperative-critical mechanism.
[Method] The critical verifier's inference-time filtering (described in Abstract and method): while contrastive training is noted as mitigation, the paper does not report how often the verifier rejects rubrics or whether rejection correlates with actual improvement in downstream preference accuracy, leaving open the possibility that the verifier introduces new biases or simply defaults to the base RM's behavior.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly distinguish C2 from prior rubric-augmented methods (e.g., those requiring human annotations) to clarify the novelty of the binary-preference-only synthesis.
[Method] Notation for the cooperative generator and critical verifier could be formalized with equations in the method section to improve reproducibility of the contrastive pair construction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have identified important areas for clarification and strengthening of our work. We address each major comment point by point below and have revised the manuscript accordingly to improve transparency and rigor.

read point-by-point responses

Referee: [Abstract] Abstract and the rubric synthesis procedure (likely §3): defining helpful rubrics via preference shifts toward the ground-truth binary label using the base RM itself creates a structural circularity, where the training signal for both the generator and verifier is generated by the model being improved; this risks preferentially reinforcing the base RM's existing biases rather than correcting them, and no ablation or sensitivity analysis quantifies how base RM error rates affect the quality of the synthesized labels.

Authors: We acknowledge the structural circularity concern: the base RM is used both to synthesize the contrastive rubric labels and as the model being improved. This design choice enables fully self-supervised training from binary preferences alone, but it does carry the risk of bias amplification if the base RM's errors are severe. The contrastive training objective (helpful vs. misleading pairs) is intended to teach the generator to avoid reinforcing errors and the verifier to detect them. However, we agree that the absence of a sensitivity analysis is a limitation. In the revised manuscript we have added a new ablation in §4.3 that simulates base RM error rates from 50% to 85% accuracy and measures the resulting quality of synthesized labels and downstream C2 performance. The results show that C2 still yields net gains when base accuracy exceeds approximately 58%, providing quantitative support for the method's robustness. revision: yes
Referee: [Experiments] Experiments section (likely §4, Table reporting RM-Bench and AlpacaEval results): the claim that an 8B model matches performance achieved with rubrics from a 4× larger model is load-bearing for the scalability argument, but it is unclear whether the larger-model baseline uses the same binary-preference-only setup or external annotations, and whether the critical verifier's filtering step is applied consistently in all comparisons; without this control, the gains (6.5 pts RM-Bench, 6.0 pts win rate) cannot be fully attributed to the cooperative-critical mechanism.

Authors: We apologize for the insufficient detail on the experimental controls. The 4× larger model baseline (32B) is trained with the identical C2 procedure using only binary preferences and no external annotations; the critical verifier is applied at inference time in every reported condition, including the larger-model runs. To eliminate ambiguity we have revised §4.2, updated all table captions, and added a new row in the main results table that isolates the larger model without the verifier. These changes make clear that the reported gains are measured under a consistent binary-preference-only regime and can be attributed to the full cooperative-critical pipeline. revision: yes
Referee: [Method] The critical verifier's inference-time filtering (described in Abstract and method): while contrastive training is noted as mitigation, the paper does not report how often the verifier rejects rubrics or whether rejection correlates with actual improvement in downstream preference accuracy, leaving open the possibility that the verifier introduces new biases or simply defaults to the base RM's behavior.

Authors: We agree that quantitative characterization of the verifier's filtering behavior is necessary to rule out the concerns raised. In the revised manuscript we have added a dedicated analysis subsection (§4.4) that reports rejection statistics and their correlation with downstream accuracy. Across RM-Bench and AlpacaEval, the verifier rejects 23% of generated rubrics on average. When rejected rubrics are nevertheless used, preference accuracy drops by 4.1 points relative to the filtered setting; conversely, accepted rubrics improve accuracy by 2.9 points over the base RM alone. These figures, together with qualitative examples of rejected rubrics, are now included to demonstrate that the verifier actively improves judgments rather than defaulting to base behavior or introducing new biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the C2 framework

full rationale

The paper describes an empirical training procedure rather than a mathematical derivation. Helpful/misleading rubric labels are generated by measuring rubric-induced shifts in a base reward model's outputs relative to ground-truth binary preference labels (explicitly 'toward or away from the correct preference' per the abstract). This uses external gold labels as the reference, not the model's own judgments as an oracle. The generator and verifier are then trained on these contrastive pairs, and final performance is measured on independent external benchmarks (RM-Bench, AlpacaEval 2.0). No equations reduce a claimed result to its inputs by construction, no self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming is smuggled in. The method is self-contained against external evaluation, yielding a normal non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; main assumptions are the cooperative communication principle and the validity of model-shift-based rubric labeling. No explicit free parameters or new entities are named.

axioms (2)

domain assumption Rubric generation is vulnerable to a failure of cooperation where low-quality rubrics mislead reward models.
Stated as an empirical finding that motivates the method.
domain assumption Measuring rubric-induced shifts in reward model preferences can separate helpful from misleading rubrics.
Core mechanism for synthesizing training pairs.

pith-pipeline@v0.9.0 · 5552 in / 1302 out tokens · 59231 ms · 2026-05-10T13:32:57.107100+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages · 3 internal anchors

[1]

In The Fourteenth International Conference on Learn- ing Representations

RM-r1: Reward modeling as reasoning. In The Fourteenth International Conference on Learn- ing Representations. Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Herbert H. ...

2017
[2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. ULTRAFEEDBACK: Boosting language models with scaled AI feedback. InForty-first Inter- national Conference on Machine Learning....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

InFirst Conference on Language Modeling

Length-controlled alpacaeval: A simple debi- asing of automatic evaluators. InFirst Conference on Language Modeling. Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ah- mad Beirami, Alexander Nicholas D’Amour, Krish- namurthy Dj Dvijotham, Adam Fisch, Katherine A Heller, Stephen Robert Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. 2024. Hel...

work page arXiv 2024
[4]

Rubrics as rewards: Reinforcement learning beyond verifiable domains.Preprint, arXiv:2507.17746. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 oth- ers. 2025a. DeepSeek-R1 in...

work page internal anchor Pith review arXiv 2025
[5]

Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790,

Reinforcement learning with rubric anchors. Preprint, arXiv:2508.12790. Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, and Peng Qi. 2025. Autorubric- r1v: Rubric-based generative rewards for faithful multimodal reasoning.Preprint, arXiv:2510.14738. Akira Kawabata and Saku Sugawara. 2024. Rationale- aware answer verification by pairwis...

work page arXiv 2025
[6]

InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

Rule based rewards for language model safety. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, ...

2022
[7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Sys- tems. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathemati- cal reasoning in ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Alternating reinforcement learning for rubric-based reward modeling in non-verifiable LLM post-training.arXiv preprint arXiv:2602.01511,

J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning. InThe Fourteenth Inter- national Conference on Learning Representations. Zhaofeng Wu, Michihiro Yasunaga, Andrew Cohen, Yoon Kim, Asli Celikyilmaz, and Marjan Ghazvinine- jad. 2025. reWordBench: Benchmarking and improv- ing the robustness of reward models with transformed inputs. InP...

work page arXiv 2025
[9]

Is it accurate?

for inference. All experiments were con- ducted on 8 NVIDIA A100 80GB GPUs. Inference Token ConsumptionSection 6.1 com- pares C2 against compute-matched Reasoning RM baselines. On RewardBench, the average number of generated tokens per example is 803 for Rea- soning RM and 1,862 for C2 with Tulu3-8B-SFT (2.3×), and 1,018 for Reasoning RM and 2,465 for C2 ...

2048
[10]

What is the core intent? - Formulate anIdeal Answerin your mind based on the User Question

Rubric Validity Check & Ideal Answer Formulation: - Analyze the User Question carefully. What is the core intent? - Formulate anIdeal Answerin your mind based on the User Question. What must a correct response contain? - Evaluate the provided Rubric. Does it align with the User Question and your Ideal Answer? - Determine if the rubric ishelpfulormisleadin...
[11]

- Be explicit: Which assistant matches the Ideal Answer better?

Step-by-Step Evaluation: - Compare Assistant A and Assistant B against thevalidcriteria (either the provided reliable rubric or your newly defined correct rubric). - Be explicit: Which assistant matches the Ideal Answer better?
[12]

Is the response good?

Comparison: Based on the valid criteria, determine which assistant provided the superior response. Be explicit in your thought process. Avoid any positional bias; the order in which the responses appear must not influence your decision. Do not let response length or the assistants’ names sway your judgment. Output Format: After your reasoning, you must ou...
[13]

How can I make bubble solution?

Step-by-step instructions. 3. Optional variations or advanced techniques. 4. Safety notes (e.g., avoiding harmful substances). Assistant B excels in Completeness and Helpfulness, covering all necessary aspects with accurate information. Assistant A is incomplete, lacks detail, and includes an incorrect ingredient (vinegar). The key differentiators are Com...