Recognition: 1 theorem link
· Lean TheoremSpurious Rewards: Rethinking Training Signals in RLVR
Pith reviewed 2026-05-16 13:33 UTC · model grok-4.3
The pith
Reinforcement learning with verifiable rewards improves math performance in some models even when rewards are random or spurious.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RLVR with GRPO elicits strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or negative correlation with correctness. On Qwen2.5-Math-7B, random rewards deliver a 21.4 percentage point gain on MATH-500, nearly matching the 29.1-point gain from ground-truth rewards. GRPO's clipping bias amplifies high-prior pretraining behaviors, one example being code reasoning whose frequency rises from 65 percent to over 90 percent. The presence of such amplifiable behaviors is model-dependent, and spurious rewards effective for Qwen models typically fail on Llama3 and OLMo2.
What carries the argument
The clipping bias from the clip term in GRPO, which amplifies high-prior pretraining behaviors without requiring informative rewards.
If this is right
- Large apparent gains in RLVR can occur without any reward signal that tracks correctness when the model possesses amplifiable pretraining behaviors.
- RLVR methods must be tested across multiple model families rather than on a single de facto choice such as Qwen.
- Code-reasoning frequency in Qwen models rises sharply under spurious rewards even though no code execution occurs.
- Spurious rewards that succeed on Qwen models produce no gains on Llama3 or OLMo2, showing the effect does not generalize.
Where Pith is reading between the lines
- Algorithms sharing a similar clipping mechanism may produce misleading capability estimates whenever pretraining has already installed useful priors.
- Pretraining data composition could determine which models are susceptible to spurious-reward training, suggesting a need to audit training corpora for such priors.
- Developers might design reward or loss terms that actively suppress amplification of pretraining behaviors to isolate genuine post-training improvements.
Load-bearing premise
The performance gains with spurious rewards are driven primarily by the clipping bias in GRPO rather than other unaccounted factors in training or model-specific quirks.
What would settle it
Training the same Qwen2.5-Math-7B model with a modified GRPO that removes or neutralizes the clip term and checking whether the 21-point gain from random rewards disappears.
read the original abstract
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper shows that RLVR with GRPO elicits large MATH-500 gains (21.4 pp on Qwen2.5-Math-7B) even under randomly assigned or negatively correlated rewards, nearly matching ground-truth reward gains (29.1 pp). It attributes the effect to a clipping bias in GRPO that amplifies pretraining priors such as code-reasoning (frequency rising from 65% to >90%), demonstrates that the phenomenon is model-dependent (absent in Llama3 and OLMo2), and concludes that RLVR results must be validated across model families rather than on a single de-facto choice.
Significance. If the empirical observation holds, the work identifies a previously under-appreciated optimization artifact in GRPO that can produce large apparent capability gains without informative rewards. The model-dependence result and the explicit call for cross-family validation are valuable contributions to the RLVR literature, especially given the growing reliance on GRPO-style methods.
major comments (1)
- [explanation of clipping bias and GRPO update] The causal claim that clipping bias is the dominant driver of the 21.4 pp gain under random rewards (abstract and explanation section) rests on an untested assumption. No ablation is reported that removes or relaxes only the clip term while keeping advantage normalization, group-relative baseline, and sampling dynamics fixed; without this isolation, other GRPO components cannot be ruled out as the source of the observed behavioral shift toward code-reasoning.
minor comments (2)
- [case study on code reasoning] The exact numerical value and measurement protocol for the >90% code-reasoning frequency should be stated in the main text rather than left as an inequality; likewise, the precise definition of 'randomly assigned rewards' (e.g., uniform sampling over answer tokens or fixed random labels) needs an explicit equation or pseudocode.
- [main results] Table or figure reporting the MATH-500 scores should include standard deviations across seeds or runs to allow assessment of the stability of the 21.4 pp and 29.1 pp deltas.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's significance and for the constructive feedback on strengthening the causal analysis. We address the major comment below and commit to revisions that will further isolate the clipping mechanism.
read point-by-point responses
-
Referee: The causal claim that clipping bias is the dominant driver of the 21.4 pp gain under random rewards (abstract and explanation section) rests on an untested assumption. No ablation is reported that removes or relaxes only the clip term while keeping advantage normalization, group-relative baseline, and sampling dynamics fixed; without this isolation, other GRPO components cannot be ruled out as the source of the observed behavioral shift toward code-reasoning.
Authors: We appreciate the referee's emphasis on isolating the clip term. Our current attribution relies on the explicit form of the GRPO surrogate objective (where the clip term creates an asymmetric update favoring high-prior tokens) together with the observed code-reasoning frequency shift and the absence of similar gains under PPO-style clipping in related experiments. Nevertheless, we agree that a direct ablation—removing or relaxing only the clip term while freezing advantage normalization, group-relative baselines, and sampling—would provide stronger causal evidence. In the revised manuscript we will add this ablation, reporting MATH-500 performance and code-reasoning rates under random rewards for both the clipped and unclipped GRPO variants on Qwen2.5-Math-7B. revision: yes
Circularity Check
No circularity: empirical results rest on direct measurements
full rationale
The paper reports concrete experimental outcomes—21.4 pp MATH-500 gain under random rewards versus 29.1 pp under ground-truth rewards for Qwen2.5-Math-7B, plus the measured rise in code-reasoning frequency from 65% to >90%—obtained by running GRPO training and counting observable behaviors. These quantities are computed from held-out test sets and token-level traces, not derived from fitted parameters or self-referential definitions. The clipping-bias explanation is offered as a post-hoc interpretation of the same runs rather than a load-bearing premise that reduces to prior self-citations or ansatzes. No equations or uniqueness theorems are invoked that collapse back to the input data by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The GRPO algorithm and its clipping mechanism as previously defined in prior work
Forward citations
Cited by 20 Pith papers
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
-
A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
-
H\"older Policy Optimisation
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
-
Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models
AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
-
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
APPS approximates power sampling for LLM reasoning via parallel particle propagation with future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs on benchmarks.
-
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training...
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
-
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
-
ThetaEvolve: Test-time Learning on Open Problems
ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.
Reference graph
Works this paper leans on
-
[1]
URL https://api.semanticscholar. org/CorpusID:271571434. Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., and Goodman, N. D. Cognitive behaviors that enable self- improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025. Gao, J., Xu, S., Ye, W., Liu, W., He, C., Fu, W., Mei, Z., Wang, G., and Wu, Y . On design...
-
[2]
Notion Blog. OLMo, T., Walsh, P., Soldaini, L., Groeneveld, D., Lo, K., Arora, S., Bhagia, A., Gu, Y ., Huang, S., Jordan, M., Lambert, N., Schwenk, D., Tafjord, O., Anderson, T., Atkinson, D., Brahman, F., Clark, C., Dasigi, P., Dziri, N., Guerquin, M., Ivison, H., Koh, P. W., Liu, J., Malik, S., Merrill, W., Miranda, L. J. V ., Morrison, J. D., Murray, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
URL https://api.semanticscholar. org/CorpusID:275213098. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with...
-
[4]
Association for Computational Linguistics. ISBN 979-8-89176-288-6. doi: 10.18653/v1/2025.acl-industry
-
[5]
URL https://aclanthology.org/2025. acl-industry.24/. Xie, T., Gao, Z., Ren, Q., Luo, H., Hong, Y ., Dai, B., Zhou, J., Qiu, K., Wu, Z., and Luo, C. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768, 2025. Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K...
-
[6]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
URL https://openreview.net/forum? id=2a36EMSSTp. Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. Zeng, W., Huang, Y ., Liu, Q., Liu, W., He, K., Ma, Z., and He, J. Simplerl-zoo: Investigating and ta...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
For each prompt, we collect 16 rollouts to compute advantages for GRPO update. We use a sampling temperature τ= 1 . We do not apply KL divergence loss or entropy loss in our training. A.4. Decoding Configuration We use temperature of 0.0 for pass at 1, and temperature of 0.6 for pass at k during decoding time. All other hyperparameters are the default val...
work page 2025
-
[8]
The distancerfrom the origin to the point: r= p x2 +y 2 wherex= 0andy= 3
-
[9]
The angleθformed with the positive x-axis: For the point(0,3): •r= √ 02 + 32 = 3 • Sincex= 0andy= 3, we are in the case wherex= 0andy >0, soθ= π 2 . Let’s write the Python code to confirm these calculations and output the polar coordinates in the desired format. 1import math 2# Rectangular coordinates 3x = 0 4y = 3 5# Calculate r 6r = math.sqrt(x **2 + y ...
-
[10]
Divide the number by 6 and record the remainder
-
[12]
Repeat steps 1 and 2 until the quotient is 0
-
[13]
The base-six representation is the sequence of remainders read from bottom to top. Let’s implement this in Python to get the accurate result: 1def decimal_to_base_six(n): 2if n == 0: 3return "0" 4 5base_six_digits = [] 6while n > 0: 7remainder = n % 6 8base_six_digits.append(str(remainder)) 9n = n // 6 10 11# The base-six representation is the reverse of ...
-
[14]
Divide the number by 6, record the remainder, and replace the number with the quotient. 2. Repeat until the quotient is 0. 3. The base 6 number is the sequence of remainders read in reverse order. Let’s do this process using Python code. 1def decimal_to_base_six(n): 2if n == 0: 3return ’0’ 4 5base_six_digits = [] 6while n > 0: 7remainder = n % 6 8base_six...
-
[17]
Update the number to the quotient of the division
-
[18]
Repeat steps 1-3 until the quotient is 0
-
[19]
The base six number is the sequence of remainders read from bottom to top. Let’s do this step-by-step for99810: 998÷6 = 166remainder2(4) 166÷6 = 27remainder4(5) 27÷6 = 4remainder3(6) 4÷6 = 0remainder4(7) The remainders, read from bottom to top, give us the base six number. So,998 10 in base six is4342 6. We can verify this by converting back to decimal: 4...
-
[20]
Divide the number by 6
-
[21]
Record the remainder
-
[22]
Update the number to the quotient from the division
-
[23]
Repeat the process until the quotient is 0
-
[24]
Let’s convert10010 to base six using Python
The base-six number is the sequence of remainders read in reverse order. Let’s convert10010 to base six using Python. def decimal_to_base_six(n): if n == 0: return "0" base_six_digits = [] while n > 0: remainder = n % 6 base_six_digits.append(str(remainder)) n = n // 6 # The base-six number is the sequence of remainders read # in reverse order base_six_di...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.