arxiv: 2506.10947 · v2 · submitted 2025-06-12 · 💻 cs.AI · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao , Shuyue Stella Li , Rui Xin , Scott Geng , Yiping Wang , Sewoong Oh , Simon Shaolei Du , Nathan Lambert

show 6 more authors

Sewon Min Ranjay Krishna Yulia Tsvetkov Hannaneh Hajishirzi Pang Wei Koh Luke Zettlemoyer

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:33 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords RLVRGRPOspurious rewardsclipping biasmathematical reasoninglanguage modelspretraining behaviorsmodel dependence

0 comments

The pith

Reinforcement learning with verifiable rewards improves math performance in some models even when rewards are random or spurious.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that GRPO-based RLVR training on Qwen2.5-Math-7B raises MATH-500 accuracy by 21.4 points using randomly assigned rewards, close to the 29.1-point improvement from ground-truth rewards. The gains arise because the clip term in GRPO introduces a bias that amplifies behaviors already common in pretraining, such as reasoning in code without execution. This effect proves highly model-dependent: the same spurious rewards produce no comparable gains on Llama3 or OLMo2 models. The results indicate that reported RLVR improvements may sometimes reflect exploitation of pretraining priors rather than genuine new reasoning ability.

Core claim

RLVR with GRPO elicits strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or negative correlation with correctness. On Qwen2.5-Math-7B, random rewards deliver a 21.4 percentage point gain on MATH-500, nearly matching the 29.1-point gain from ground-truth rewards. GRPO's clipping bias amplifies high-prior pretraining behaviors, one example being code reasoning whose frequency rises from 65 percent to over 90 percent. The presence of such amplifiable behaviors is model-dependent, and spurious rewards effective for Qwen models typically fail on Llama3 and OLMo2.

What carries the argument

The clipping bias from the clip term in GRPO, which amplifies high-prior pretraining behaviors without requiring informative rewards.

If this is right

Large apparent gains in RLVR can occur without any reward signal that tracks correctness when the model possesses amplifiable pretraining behaviors.
RLVR methods must be tested across multiple model families rather than on a single de facto choice such as Qwen.
Code-reasoning frequency in Qwen models rises sharply under spurious rewards even though no code execution occurs.
Spurious rewards that succeed on Qwen models produce no gains on Llama3 or OLMo2, showing the effect does not generalize.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithms sharing a similar clipping mechanism may produce misleading capability estimates whenever pretraining has already installed useful priors.
Pretraining data composition could determine which models are susceptible to spurious-reward training, suggesting a need to audit training corpora for such priors.
Developers might design reward or loss terms that actively suppress amplification of pretraining behaviors to isolate genuine post-training improvements.

Load-bearing premise

The performance gains with spurious rewards are driven primarily by the clipping bias in GRPO rather than other unaccounted factors in training or model-specific quirks.

What would settle it

Training the same Qwen2.5-Math-7B model with a modified GRPO that removes or neutralizes the clip term and checking whether the 21-point gain from random rewards disappears.

read the original abstract

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Random rewards produce large MATH gains on Qwen models via GRPO clipping, but the causal mechanism is not isolated.

read the letter

The main thing to know is that GRPO training on Qwen2.5-Math-7B lifts MATH-500 by 21.4 points even when rewards are assigned randomly, close to the 29.1-point lift from correct answers. The paper traces this to the clip term in GRPO amplifying pretraining behaviors such as code-reasoning, which rises from 65% to over 90% under spurious rewards. They also show the effect is model-dependent and does not appear on Llama3 or OLMo2 with the same random signals. That empirical contrast is the clearest contribution. The numbers are reported directly and the code-reasoning frequency change is a concrete behavioral marker. The work is useful because it flags that large reported gains in RLVR may not reflect new reasoning capability. The soft spot is the causal claim. The clipping bias is presented as the driver, yet the paper does not run the obvious control that keeps advantage normalization and group-relative baseline while disabling or relaxing the clip term. Without that ablation it remains possible that any non-zero reward signal would produce similar movement under the same optimizer. The model-dependence is acknowledged but not explored in depth beyond the three families tested. This paper is for groups running RLVR on math or code models who need to know that apparent progress can be an artifact of the algorithm interacting with pretraining priors. It is worth sending to peer review because the core observation is reproducible from the reported setup and the caution it raises is timely, even if the mechanism needs tighter controls.

Referee Report

1 major / 2 minor

Summary. The paper shows that RLVR with GRPO elicits large MATH-500 gains (21.4 pp on Qwen2.5-Math-7B) even under randomly assigned or negatively correlated rewards, nearly matching ground-truth reward gains (29.1 pp). It attributes the effect to a clipping bias in GRPO that amplifies pretraining priors such as code-reasoning (frequency rising from 65% to >90%), demonstrates that the phenomenon is model-dependent (absent in Llama3 and OLMo2), and concludes that RLVR results must be validated across model families rather than on a single de-facto choice.

Significance. If the empirical observation holds, the work identifies a previously under-appreciated optimization artifact in GRPO that can produce large apparent capability gains without informative rewards. The model-dependence result and the explicit call for cross-family validation are valuable contributions to the RLVR literature, especially given the growing reliance on GRPO-style methods.

major comments (1)

[explanation of clipping bias and GRPO update] The causal claim that clipping bias is the dominant driver of the 21.4 pp gain under random rewards (abstract and explanation section) rests on an untested assumption. No ablation is reported that removes or relaxes only the clip term while keeping advantage normalization, group-relative baseline, and sampling dynamics fixed; without this isolation, other GRPO components cannot be ruled out as the source of the observed behavioral shift toward code-reasoning.

minor comments (2)

[case study on code reasoning] The exact numerical value and measurement protocol for the >90% code-reasoning frequency should be stated in the main text rather than left as an inequality; likewise, the precise definition of 'randomly assigned rewards' (e.g., uniform sampling over answer tokens or fixed random labels) needs an explicit equation or pseudocode.
[main results] Table or figure reporting the MATH-500 scores should include standard deviations across seeds or runs to allow assessment of the stability of the 21.4 pp and 29.1 pp deltas.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for the constructive feedback on strengthening the causal analysis. We address the major comment below and commit to revisions that will further isolate the clipping mechanism.

read point-by-point responses

Referee: The causal claim that clipping bias is the dominant driver of the 21.4 pp gain under random rewards (abstract and explanation section) rests on an untested assumption. No ablation is reported that removes or relaxes only the clip term while keeping advantage normalization, group-relative baseline, and sampling dynamics fixed; without this isolation, other GRPO components cannot be ruled out as the source of the observed behavioral shift toward code-reasoning.

Authors: We appreciate the referee's emphasis on isolating the clip term. Our current attribution relies on the explicit form of the GRPO surrogate objective (where the clip term creates an asymmetric update favoring high-prior tokens) together with the observed code-reasoning frequency shift and the absence of similar gains under PPO-style clipping in related experiments. Nevertheless, we agree that a direct ablation—removing or relaxing only the clip term while freezing advantage normalization, group-relative baselines, and sampling—would provide stronger causal evidence. In the revised manuscript we will add this ablation, reporting MATH-500 performance and code-reasoning rates under random rewards for both the clipped and unclipped GRPO variants on Qwen2.5-Math-7B. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on direct measurements

full rationale

The paper reports concrete experimental outcomes—21.4 pp MATH-500 gain under random rewards versus 29.1 pp under ground-truth rewards for Qwen2.5-Math-7B, plus the measured rise in code-reasoning frequency from 65% to >90%—obtained by running GRPO training and counting observable behaviors. These quantities are computed from held-out test sets and token-level traces, not derived from fitted parameters or self-referential definitions. The clipping-bias explanation is offered as a post-hoc interpretation of the same runs rather than a load-bearing premise that reduces to prior self-citations or ansatzes. No equations or uniqueness theorems are invoked that collapse back to the input data by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical study relying on standard reinforcement learning techniques and model evaluations; no new parameters, axioms beyond standard ones, or invented entities are introduced.

axioms (1)

standard math The GRPO algorithm and its clipping mechanism as previously defined in prior work
The paper builds upon the GRPO method from prior work without re-deriving it.

pith-pipeline@v0.9.0 · 5596 in / 1345 out tokens · 103094 ms · 2026-05-16T13:33:45.992784+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
cs.LG 2026-05 conditional novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
cs.LG 2026-04 accept novelty 7.0

The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Reward Hacking in Rubric-Based Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
H\"older Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
cs.AI 2026-05 unverdicted novelty 6.0

APPS approximates power sampling for LLM reasoning via parallel particle propagation with future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs on benchmarks.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
cs.AI 2026-05 unverdicted novelty 6.0

APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training...
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
cs.LG 2026-04 conditional novelty 6.0

Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
cs.CL 2026-04 unverdicted novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
ThetaEvolve: Test-time Learning on Open Problems
cs.LG 2025-11 conditional novelty 6.0

ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
Beyond Distribution Sharpening: The Importance of Task Rewards
cs.LG 2026-04 unverdicted novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 18 Pith papers · 2 internal anchors

[1]

org/CorpusID:271571434

URL https://api.semanticscholar. org/CorpusID:271571434. Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., and Goodman, N. D. Cognitive behaviors that enable self- improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025. Gao, J., Xu, S., Ye, W., Liu, W., He, C., Fu, W., Mei, Z., Wang, G., and Wu, Y . On design...

work page doi:10.1038/s41586-025-09422-z 2025
[2]

Notion Blog. OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y ., Huang, S., Jordan, M., Lambert, N., Schwenk, D., Tafjord, O., Anderson, T., Atkinson, D., Brahman, F., Clark, C., Dasigi, P., Dziri, N., Guerquin, M., Ivison, H., Koh, P. W., Liu, J., Malik, S., Merrill, W., Miranda, L. J. V ., Morrison, J. D., Murray, ...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

org/CorpusID:275213098

URL https://api.semanticscholar. org/CorpusID:275213098. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with...

work page arXiv 2022
[4]

ISBN 979-8-89176-288-6

Association for Computational Linguistics. ISBN 979-8-89176-288-6. doi: 10.18653/v1/2025.acl-industry

work page doi:10.18653/v1/2025.acl-industry 2025
[5]

acl-industry.24/

URL https://aclanthology.org/2025. acl-industry.24/. Xie, T., Gao, Z., Ren, Q., Luo, H., Hong, Y ., Dai, B., Zhou, J., Qiu, K., Wu, Z., and Luo, C. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768, 2025. Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K...

work page arXiv 2025
[6]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

URL https://openreview.net/forum? id=2a36EMSSTp. Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and ta...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

clipping bias

For each prompt, we collect 16 rollouts to compute advantages for GRPO update. We use a sampling temperature τ= 1 . We do not apply KL divergence loss or entropy loss in our training. A.4. Decoding Configuration We use temperature of 0.0 for pass at 1, and temperature of 0.6 for pass at k during decoding time. All other hyperparameters are the default val...

work page 2025
[8]

The distancerfrom the origin to the point: r= p x2 +y 2 wherex= 0andy= 3

work page
[9]

r={r},θ= {theta}

The angleθformed with the positive x-axis: For the point(0,3): •r= √ 02 + 32 = 3 • Sincex= 0andy= 3, we are in the case wherex= 0andy >0, soθ= π 2 . Let’s write the Python code to confirm these calculations and output the polar coordinates in the desired format. 1import math 2# Rectangular coordinates 3x = 0 4y = 3 5# Calculate r 6r = math.sqrt(x **2 + y ...

work page
[10]

Divide the number by 6 and record the remainder

work page
[12]

Repeat steps 1 and 2 until the quotient is 0

work page
[13]

The base-six representation is the sequence of remainders read from bottom to top. Let’s implement this in Python to get the accurate result: 1def decimal_to_base_six(n): 2if n == 0: 3return "0" 4 5base_six_digits = [] 6while n > 0: 7remainder = n % 6 8base_six_digits.append(str(remainder)) 9n = n // 6 10 11# The base-six representation is the reverse of ...

work page
[14]

Divide the number by 6, record the remainder, and replace the number with the quotient. 2. Repeat until the quotient is 0. 3. The base 6 number is the sequence of remainders read in reverse order. Let’s do this process using Python code. 1def decimal_to_base_six(n): 2if n == 0: 3return ’0’ 4 5base_six_digits = [] 6while n > 0: 7remainder = n % 6 8base_six...

work page
[17]

Update the number to the quotient of the division

work page
[18]

Repeat steps 1-3 until the quotient is 0

work page
[19]

The base six number is the sequence of remainders read from bottom to top. Let’s do this step-by-step for99810: 998÷6 = 166remainder2(4) 166÷6 = 27remainder4(5) 27÷6 = 4remainder3(6) 4÷6 = 0remainder4(7) The remainders, read from bottom to top, give us the base six number. So,998 10 in base six is4342 6. We can verify this by converting back to decimal: 4...

work page
[20]

Divide the number by 6

work page
[21]

Record the remainder

work page
[22]

Update the number to the quotient from the division

work page
[23]

Repeat the process until the quotient is 0

work page
[24]

Let’s convert10010 to base six using Python

The base-six number is the sequence of remainders read in reverse order. Let’s convert10010 to base six using Python. def decimal_to_base_six(n): if n == 0: return "0" base_six_digits = [] while n > 0: remainder = n % 6 base_six_digits.append(str(remainder)) n = n // 6 # The base-six number is the sequence of remainders read # in reverse order base_six_di...

work page