hub

RewardBench 2: Advancing Reward Model Evaluation

· 2025 · cs.CL · arXiv 2506.01937

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

open full Pith review browse 16 citing papers arXiv PDF

abstract

Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 2

citation-polarity summary

background 3 use dataset 1

representative citing papers

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

stat.ME · 2026-05-10 · unverdicted · novelty 6.0

Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.

Reflective Context Learning: Studying the Optimization Primitives of Context Space

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

SignCert-PO mitigates reward hacking in RLHF by down-weighting completions whose advantage signs are not robust to small reward-model perturbations, using a certified preservation radius derived at the policy optimization stage.

Robust Policy Optimization to Prevent Catastrophic Forgetting

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.

Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.

Reinforcement Learning from Human Feedback

cs.LG · 2025-04-16 · unverdicted · novelty 2.0

The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

On Cost-Effective LLM-as-a-Judge Improvement Techniques

cs.CL · 2026-04-15

citing papers explorer

Showing 16 of 16 citing papers.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 53 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CL · 2026-05-08 · unverdicted · none · ref 8 · internal anchor
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass cs.CV · 2026-04-13 · unverdicted · none · ref 5 · internal anchor
A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models cs.LG · 2026-04-03 · unverdicted · none · ref 8 · internal anchor
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment cs.CL · 2026-05-17 · unverdicted · none · ref 86 · internal anchor
Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge cs.AI · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges stat.ME · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV · 2026-05-08 · unverdicted · none · ref 21 · internal anchor
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 47 · internal anchor
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization cs.CL · 2026-04-08 · unverdicted · none · ref 10 · internal anchor
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
Reflective Context Learning: Studying the Optimization Primitives of Context Space cs.LG · 2026-04-03 · unverdicted · none · ref 8 · internal anchor
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness cs.LG · 2026-04-03 · unverdicted · none · ref 2 · internal anchor
SignCert-PO mitigates reward hacking in RLHF by down-weighting completions whose advantage signs are not robust to small reward-model perturbations, using a certified preservation radius derived at the policy optimization stage.
Robust Policy Optimization to Prevent Catastrophic Forgetting cs.LG · 2026-02-09 · unverdicted · none · ref 33 · internal anchor
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning cs.LG · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 89 · internal anchor
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
On Cost-Effective LLM-as-a-Judge Improvement Techniques cs.CL · 2026-04-15 · unreviewed · ref 10 · internal anchor

RewardBench 2: Advancing Reward Model Evaluation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer