C2 synthesizes contrastive helpful/misleading rubric pairs from binary preferences to train cooperative generators and critical verifiers, yielding up to 6.5-point gains on RM-Bench and enabling smaller models to match larger rubric-augmented ones.
InThe Thirty-eighth Annual Conference on Neural Information Processing Systems
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 synthesizes contrastive helpful/misleading rubric pairs from binary preferences to train cooperative generators and critical verifiers, yielding up to 6.5-point gains on RM-Bench and enabling smaller models to match larger rubric-augmented ones.