pith. sign in

hub Canonical reference

Spurious Rewards: Rethinking Training Signals in RLVR

Canonical reference. 80% of citing Pith papers cite this work as background.

32 Pith papers citing it
Background 80% of classified citations
abstract

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.

hub tools

citation-role summary

background 5

citation-polarity summary

years

2026 24 2025 8

roles

background 5

polarities

background 4 unclear 1

clear filters

representative citing papers

Reasoning with Sampling: Cutting at Decision Points

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Entropy-Cut Metropolis-Hastings targets high-entropy decision points for resampling, yielding mixing time that scales with the number of decisions and consistent gains over baselines on MATH500, HumanEval, GPQA Diamond, and AIME26.

Consolidating Rewarded Perturbations for LLM Post-Training

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

CoRP consolidates reward-weighted perturbations into a single model via low-rank structure, improving base LLMs by 8.1 points on average while using one-tenth the budget of prior ensembles and one forward pass.

Label-Free Reinforcement Learning via Cross-Model Entropy

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Cross-Model Entropy supplies a continuous label-free reward for RL post-training by averaging a generator's response log-likelihood under an independent verifier model, yielding win-rate gains on instruction following.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

ThetaEvolve: Test-time Learning on Open Problems

cs.LG · 2025-11-28 · conditional · novelty 6.0

ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies cs.CL · 2025-09-22 · conditional · none · ref 33 · internal anchor

    A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.

  • ThetaEvolve: Test-time Learning on Open Problems cs.LG · 2025-11-28 · conditional · none · ref 1 · internal anchor

    ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.

  • Auditing Data Membership in Reinforcement Learning With Verifiable Rewards cs.CR · 2025-11-18 · unverdicted · none · ref 37 · internal anchor

    DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.

  • GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 41 · internal anchor

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  • What Is Preference Optimization Doing, and Why? cs.LG · 2025-11-30 · unverdicted · none · ref 8 · internal anchor

    Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.

  • A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning cs.LG · 2025-10-21 · unverdicted · none · ref 16 · 2 links · internal anchor

    SePT alternates self-generation of responses at controlled temperatures with training on the latest model outputs, yielding gains over a strong no-training baseline on six math reasoning benchmarks.

  • Self-Rewarding Vision-Language Model via Reasoning Decomposition cs.CV · 2025-08-27 · unverdicted · none · ref 16 · internal anchor

    Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.

  • PRL: Prompts from Reinforcement Learning cs.AI · 2025-05-20 · unverdicted · none · ref 5 · internal anchor

    PRL is a reinforcement learning method that generates novel prompts and achieves state-of-the-art results on text classification, simplification, and summarization benchmarks, outperforming APE and EvoPrompt.