Recognition: 2 theorem links
· Lean TheoremUnderstanding R1-Zero-Like Training: A Critical Perspective
Pith reviewed 2026-05-11 04:20 UTC · model grok-4.3
The pith
Certain base models already contain reasoning ability, and removing a length bias from GRPO allows a minimalist RL method to achieve state-of-the-art math performance with 7B models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeek-V3-Base already exhibits an 'Aha moment', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates. GRPO has an optimization bias that artificially increases response length especially for incorrect outputs. Dr. GRPO is introduced as an unbiased optimization method that improves token efficiency while maintaining reasoning performance. A minimalist R1-Zero recipe achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art.
What carries the argument
Dr. GRPO, an unbiased optimization method derived from GRPO that eliminates the artificial increase in length of incorrect responses during training.
If this is right
- Base models with built-in reasoning traits can reduce the need for elaborate prompting or additional supervised data.
- Removing the length bias in GRPO leads to more efficient use of tokens during training.
- A minimalist recipe can set new performance records on difficult math tests like AIME for models as small as 7B parameters.
- Insights into pretraining characteristics help in selecting better starting points for RL-based reasoning enhancement.
Where Pith is reading between the lines
- This indicates that much of the reasoning capability may already be present in well-pretrained base models, reducing the role of complex post-training.
- Similar optimization biases might be present in other RL algorithms used for LLMs, warranting checks in future work.
- The minimalist recipe could potentially be adapted to other reasoning domains such as coding or scientific problem solving.
- Reproducing the results on additional benchmarks would help confirm how broadly the base model advantages apply.
Load-bearing premise
The accuracy improvements result mainly from the base model properties and the Dr. GRPO correction instead of differences in training hyperparameters, data filtering, or evaluation setup.
What would settle it
Reproducing the 43.3% accuracy on AIME 2024 by following the minimalist recipe with the 7B base model; achieving it supports the claim while falling short suggests other factors drive the gains.
read the original abstract
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper critically examines R1-Zero-like RL training for LLMs by studying the influence of base-model pretraining characteristics (e.g., 'Aha moments' in DeepSeek-V3-Base and prompt-independent reasoning in Qwen2.5) and an optimization bias in GRPO that inflates response lengths, particularly for incorrect outputs. It introduces Dr. GRPO to remove this bias, then presents a minimalist training recipe achieving 43.3% accuracy on AIME 2024 with a 7B model, claimed as new SOTA. Public code is released.
Significance. If the accuracy gains are shown to stem specifically from the diagnosed base-model properties and Dr. GRPO rather than unstated implementation differences, the work supplies useful mechanistic insights into efficient RL for reasoning and a practical recipe that could guide training of small-scale reasoning models. The public code release strengthens reproducibility and allows direct verification of the claims.
major comments (2)
- [Experiments and Results (around the minimalist recipe and Table reporting AIME scores)] The central SOTA claim (43.3% AIME 2024 with 7B model) is load-bearing for the paper's contribution, yet the manuscript provides no explicit controls or matched ablations confirming that data filtering, prompt templates, learning-rate schedules, and evaluation protocols are identical to those used in the baselines being surpassed. Without this isolation, attribution of gains to the identified 'Aha moments,' pretraining biases, or Dr. GRPO remains unverified.
- [Analysis of GRPO optimization bias] The diagnosis of GRPO's length bias (especially for incorrect outputs) is presented as a key insight motivating Dr. GRPO, but the supporting analysis lacks reported statistical tests, multiple random seeds, or quantitative comparison of length distributions before/after the correction across the full set of base models.
minor comments (2)
- [Abstract] The abstract states experiments across 'a wide range of base models' but only names DeepSeek-V3-Base and Qwen2.5; an explicit list or table of all models and their sizes would improve clarity.
- [Method section introducing Dr. GRPO] The acronym 'Dr. GRPO' is introduced without spelling out the full name or explaining the 'Dr.' prefix on first use in the main text.
Simulated Author's Rebuttal
Thank you for your detailed review and constructive feedback. We appreciate the emphasis on rigorous experimental controls and statistical analysis to support our claims. We address each major comment below and outline the revisions to be incorporated in the updated version of the manuscript.
read point-by-point responses
-
Referee: The central SOTA claim (43.3% AIME 2024 with 7B model) is load-bearing for the paper's contribution, yet the manuscript provides no explicit controls or matched ablations confirming that data filtering, prompt templates, learning-rate schedules, and evaluation protocols are identical to those used in the baselines being surpassed. Without this isolation, attribution of gains to the identified 'Aha moments,' pretraining biases, or Dr. GRPO remains unverified.
Authors: We agree that matched ablations are essential for robust attribution. Our public code release already implements the full minimalist recipe with explicit details on data filtering, prompts, schedules, and evaluation, enabling direct verification against baselines. In the revision, we will add a new ablation table and section that explicitly matches these elements to the reported setups in the baseline works (e.g., DeepSeek-R1 and related papers). This will include side-by-side results isolating the contributions of base-model pretraining properties and Dr. GRPO, confirming the 43.3% AIME score stems from the diagnosed factors rather than unstated differences. revision: yes
-
Referee: The diagnosis of GRPO's length bias (especially for incorrect outputs) is presented as a key insight motivating Dr. GRPO, but the supporting analysis lacks reported statistical tests, multiple random seeds, or quantitative comparison of length distributions before/after the correction across the full set of base models.
Authors: We acknowledge the value of greater statistical rigor. We have rerun the relevant experiments with 5 random seeds and will report means, standard deviations, and paired t-test p-values for length differences between correct and incorrect outputs. The revision will include quantitative before/after length distribution comparisons (including histograms and summary statistics) across all base models tested. These additions provide stronger quantitative support for the bias diagnosis and Dr. GRPO's corrective effect while preserving the original mechanistic insights. revision: yes
Circularity Check
No circularity: empirical analysis and new method are self-contained
full rationale
The paper conducts empirical investigations across base models, diagnoses a length bias in GRPO, introduces Dr. GRPO as an unbiased alternative, and validates a minimalist training recipe via direct experiments on AIME 2024. No equations, predictions, or central claims reduce to fitted inputs or self-citations by construction. The work is benchmarked against external results and releases public code, satisfying the criteria for a non-circular empirical contribution.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
-
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
-
Tracing Uncertainty in Language Model "Reasoning"
Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
-
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
-
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
-
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
A multi-agent binary reward system with unbiased GRPO post-training on ICLR-320 data outperforms baselines on expert-rated novelty, feasibility, and effectiveness for scientific idea generation.
-
Foresight Optimization for Strategic Reasoning in Large Language Models
FoPO trains LLMs for strategic reasoning by combining self-interest with opponent modeling in policy optimization, yielding gains on two new datasets and better out-of-domain generalization than standard baselines.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
AsymGRPO refines policy entropy in RLVR by preserving informative entropy on positive rollouts and suppressing spurious entropy on negative ones, outperforming baselines.
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
H\"older Policy Optimisation
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
-
UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization
UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.
-
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
G-Zero: Self-Play for Open-Ended Generation from Zero Data
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Binary Rewards and Reinforcement Learning: Fundamental Challenges
Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
-
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space
GloRank reformulates list-wise reranking as token generation over a global item identifier space, using supervised pre-training followed by reinforcement learning to maximize list-wise utility and outperforming baseli...
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
-
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
-
TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
TRN-R1-Zero is an RL-only post-training method that lets LLMs perform zero-shot node, edge, and graph reasoning on text-rich networks without supervised data or larger-model distillation.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
Calibration-Aware Policy Optimization for Reasoning LLMs
CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
-
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...
-
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
SPPO enables stable, sample-efficient alignment of LLMs on long-horizon reasoning tasks by using a decoupled scalar value function for low-variance advantages without multi-sampling.
-
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.