hub Canonical reference

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh · 2025 · cs.AI · arXiv 2506.10947

Canonical reference. 80% of citing Pith papers cite this work as background.

32 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 32 citing papers arXiv PDF

abstract

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 4 unclear 1

representative citing papers

Reasoning with Sampling: Cutting at Decision Points

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Entropy-Cut Metropolis-Hastings targets high-entropy decision points for resampling, yielding mixing time that scales with the number of decisions and consistent gains over baselines on MATH500, HumanEval, GPQA Diamond, and AIME26.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

cs.LG · 2026-05-14 · conditional · novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions

cs.LG · 2026-04-19 · accept · novelty 7.0

The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies

cs.CL · 2025-09-22 · conditional · novelty 7.0

A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.

Consolidating Rewarded Perturbations for LLM Post-Training

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

CoRP consolidates reward-weighted perturbations into a single model via low-rank structure, improving base LLMs by 8.1 points on average while using one-tenth the budget of prior ensembles and one forward pass.

Label-Free Reinforcement Learning via Cross-Model Entropy

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

GRPO-based RL with execution feedback improves zero-shot Text-to-SPARQL on DBLP-QuAD for a 1.7B model but trails supervised DoRA fine-tuning.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

cs.LG · 2026-04-26 · conditional · novelty 6.0

Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

ThetaEvolve: Test-time Learning on Open Problems

cs.LG · 2025-11-28 · conditional · novelty 6.0

ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.

Auditing Data Membership in Reinforcement Learning With Verifiable Rewards

cs.CR · 2025-11-18 · unverdicted · novelty 6.0

DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

cs.CV · 2025-07-01 · unverdicted · novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

One-shot GRPO on a single biased example induces generalizing stereotype bias in post-trained LLMs, with susceptibility varying by initial bias likelihood.

citing papers explorer

Showing 8 of 8 citing papers after filters.

The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies cs.CL · 2025-09-22 · conditional · none · ref 33 · internal anchor
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
ThetaEvolve: Test-time Learning on Open Problems cs.LG · 2025-11-28 · conditional · none · ref 1 · internal anchor
ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.
Auditing Data Membership in Reinforcement Learning With Verifiable Rewards cs.CR · 2025-11-18 · unverdicted · none · ref 37 · internal anchor
DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 41 · internal anchor
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
What Is Preference Optimization Doing, and Why? cs.LG · 2025-11-30 · unverdicted · none · ref 8 · internal anchor
Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning cs.LG · 2025-10-21 · unverdicted · none · ref 16 · 2 links · internal anchor
SePT alternates self-generation of responses at controlled temperatures with training on the latest model outputs, yielding gains over a strong no-training baseline on six math reasoning benchmarks.
Self-Rewarding Vision-Language Model via Reasoning Decomposition cs.CV · 2025-08-27 · unverdicted · none · ref 16 · internal anchor
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
PRL: Prompts from Reinforcement Learning cs.AI · 2025-05-20 · unverdicted · none · ref 5 · internal anchor
PRL is a reinforcement learning method that generates novel prompts and achieves state-of-the-art results on text classification, simplification, and summarization benchmarks, outperforming APE and EvoPrompt.

Spurious Rewards: Rethinking Training Signals in RLVR

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer