FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
hub Canonical reference
Spurious Rewards: Rethinking Training Signals in RLVR
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
GRPO-based RL with execution feedback improves zero-shot Text-to-SPARQL on DBLP-QuAD for a 1.7B model but trails supervised DoRA fine-tuning.
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.
DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.
Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.
SePT alternates self-generation of responses at controlled temperatures with training on the latest model outputs, yielding gains over a strong no-training baseline on six math reasoning benchmarks.
citing papers explorer
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
-
A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
-
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
-
Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP
GRPO-based RL with execution feedback improves zero-shot Text-to-SPARQL on DBLP-QuAD for a 1.7B model but trails supervised DoRA fine-tuning.
-
Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
-
Holder Policy Optimisation
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
-
Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models
AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
-
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
-
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.
-
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
-
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
-
ThetaEvolve: Test-time Learning on Open Problems
ThetaEvolve enables small open-source LLMs to achieve new best-known bounds on open problems such as circle packing by combining test-time RL with a large program database and lazy penalties.
-
Auditing Data Membership in Reinforcement Learning With Verifiable Rewards
DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.
-
What Is Preference Optimization Doing, and Why?
Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.
-
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning
SePT alternates self-generation of responses at controlled temperatures with training on the latest model outputs, yielding gains over a strong no-training baseline on six math reasoning benchmarks.
-
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
-
PRL: Prompts from Reinforcement Learning
PRL is a reinforcement learning method that generates novel prompts and achieves state-of-the-art results on text classification, simplification, and summarization benchmarks, outperforming APE and EvoPrompt.