hub

Generative reward models

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fr¨anken, Chelsea Finn, Alon Albalak · 2024 · arXiv 2410.12832

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

cs.LG · 2026-05-20 · conditional · novelty 7.0

VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.

RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

RMGAP benchmark shows state-of-the-art reward models reach at most 49.27% Best-of-N accuracy when forced to select responses matching diverse preferences.

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

cs.LG · 2026-04-12 · unverdicted · novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

cs.LG · 2026-03-13 · unverdicted · novelty 7.0

A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

RationalRewards recovers rationales from preference data via PARROT to create a critique-first reward model that improves visual generators at both training time through RL and test time through prompt refinement, matching RL fine-tuning performance while using far less data.

Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.

ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

cs.AI · 2026-04-08 · unverdicted · novelty 6.0

ReflectRM improves generative reward models by adding self-reflection on analysis quality within a unified training setup for response and analysis preferences, yielding accuracy gains and reduced positional bias on benchmarks.

UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

UniCreative uses reference-free RL with an adaptive constraint-aware reward model to unify long-form coherence and short-form creativity in AI writing, producing an emergent ability to switch between planning and direct generation.

PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

cs.LG · 2025-10-28 · unverdicted · novelty 6.0

PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

cs.LG · 2025-08-27 · conditional · novelty 6.0

GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

cs.AI · 2026-04-08 · unverdicted · novelty 5.0

ConsistRM improves generative reward models via consistency-aware self-training, outperforming vanilla RFT by 1.5% on average across five benchmarks and four base models.

Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

cs.LG · 2025-10-16 · unverdicted · novelty 5.0

GenCluster scales test-time compute via large-scale generation, behavioral clustering, ranking, and round-robin submission to achieve IOI gold medal performance with the open-weight gpt-oss-120b model.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

citing papers explorer

Showing 15 of 15 citing papers.

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering cs.LG · 2026-05-20 · conditional · none · ref 12
VerifySteer selectively steers hidden states at paragraph boundaries using latent correctness signals to control verifier strictness and outperform baselines on ProcessBench and Hard2Verify with lower compute.
RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences cs.CL · 2026-05-03 · unverdicted · none · ref 2
RMGAP benchmark shows state-of-the-art reward models reach at most 49.27% Best-of-N accuracy when forced to select responses matching diverse preferences.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning cs.LG · 2026-04-12 · unverdicted · none · ref 32
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents cs.LG · 2026-03-13 · unverdicted · none · ref 18
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time cs.AI · 2026-04-13 · unverdicted · none · ref 1
RationalRewards recovers rationales from preference data via PARROT to create a critique-first reward model that improves visual generators at both training time through RL and test time through prompt refinement, matching RL fine-tuning performance while using far less data.
Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation cs.CL · 2026-04-13 · unverdicted · none · ref 3
MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.
ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework cs.AI · 2026-04-08 · unverdicted · none · ref 2
ReflectRM improves generative reward models by adding self-reflection on analysis quality within a unified training setup for response and analysis preferences, yielding accuracy gains and reduced positional bias on benchmarks.
UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning cs.AI · 2026-04-07 · unverdicted · none · ref 2
UniCreative uses reference-free RL with an adaptive constraint-aware reward model to unify long-form coherence and short-form creativity in AI writing, producing an emergent ability to switch between planning and direct generation.
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling cs.LG · 2025-10-28 · unverdicted · none · ref 12
PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs cs.LG · 2025-08-27 · conditional · none · ref 23
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
RewardBench 2: Advancing Reward Model Evaluation cs.CL · 2025-06-02 · unverdicted · none · ref 60
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 133
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training cs.AI · 2026-04-08 · unverdicted · none · ref 3
ConsistRM improves generative reward models via consistency-aware self-training, outperforming vanilla RFT by 1.5% on average across five benchmarks and four base models.
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models cs.LG · 2025-10-16 · unverdicted · none · ref 14
GenCluster scales test-time compute via large-scale generation, behavioral clustering, ranking, and round-robin submission to achieve IOI gold medal performance with the open-weight gpt-oss-120b model.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 87
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Generative reward models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer