Understanding R1-Zero-Like Training: A Critical Perspective

Changyu Chen; Chao Du; Min Lin; Penghui Qi; Tianyu Pang; Wee Sun Lee; Wenjun Li; Zichen Liu

arxiv: 2503.20783 · v2 · submitted 2025-03-26 · 💻 cs.LG · cs.AI· cs.CL

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu , Changyu Chen , Wenjun Li , Penghui Qi , Tianyu Pang , Chao Du , Wee Sun Lee , Min Lin This is my paper

Pith reviewed 2026-05-11 04:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningLLM reasoningGRPObase modelsAIMEoptimization biasDr. GRPOR1-Zero

0 comments

The pith

Certain base models already contain reasoning ability, and removing a length bias from GRPO allows a minimalist RL method to achieve state-of-the-art math performance with 7B models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates R1-Zero-like reinforcement learning for improving LLM reasoning without supervised fine-tuning. It tests various base models and discovers that some naturally exhibit 'Aha moments' or strong reasoning even without prompt templates, pointing to pretraining effects. The work also uncovers that GRPO tends to favor longer incorrect responses, which it corrects with a new unbiased method called Dr. GRPO. Using these observations, the authors build a simple training recipe that reaches 43.3% accuracy on the AIME 2024 benchmark using a 7B model. This approach shows how understanding base models and fixing optimization biases can lead to more effective and efficient reasoning training.

Core claim

DeepSeek-V3-Base already exhibits an 'Aha moment', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates. GRPO has an optimization bias that artificially increases response length especially for incorrect outputs. Dr. GRPO is introduced as an unbiased optimization method that improves token efficiency while maintaining reasoning performance. A minimalist R1-Zero recipe achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art.

What carries the argument

Dr. GRPO, an unbiased optimization method derived from GRPO that eliminates the artificial increase in length of incorrect responses during training.

If this is right

Base models with built-in reasoning traits can reduce the need for elaborate prompting or additional supervised data.
Removing the length bias in GRPO leads to more efficient use of tokens during training.
A minimalist recipe can set new performance records on difficult math tests like AIME for models as small as 7B parameters.
Insights into pretraining characteristics help in selecting better starting points for RL-based reasoning enhancement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This indicates that much of the reasoning capability may already be present in well-pretrained base models, reducing the role of complex post-training.
Similar optimization biases might be present in other RL algorithms used for LLMs, warranting checks in future work.
The minimalist recipe could potentially be adapted to other reasoning domains such as coding or scientific problem solving.
Reproducing the results on additional benchmarks would help confirm how broadly the base model advantages apply.

Load-bearing premise

The accuracy improvements result mainly from the base model properties and the Dr. GRPO correction instead of differences in training hyperparameters, data filtering, or evaluation setup.

What would settle it

Reproducing the 43.3% accuracy on AIME 2024 by following the minimalist recipe with the 7B base model; achieving it supports the claim while falling short suggests other factors drive the gains.

read the original abstract

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a length bias in GRPO that favors longer wrong answers and offers Dr. GRPO as a fix, plus shows that certain base models already carry reasoning sparks from pretraining, but the 43.3% AIME claim needs tighter controls to pin the gains on those factors.

read the letter

The main points are straightforward. The authors run RL on several base models and notice that DeepSeek-V3-Base already produces the sudden correct answers people call Aha moments, while Qwen2.5 models do decent reasoning even without fancy prompts. They also show that standard GRPO increases response length more for incorrect outputs than correct ones, which looks like an optimization artifact. Their Dr. GRPO variant removes that bias, keeps accuracy, and improves token efficiency. They then combine the observations into a simple training recipe that reaches 43.3% on AIME 2024 with a 7B model and call it new state-of-the-art. Code is released, which helps verification.

Referee Report

2 major / 2 minor

Summary. The paper critically examines R1-Zero-like RL training for LLMs by studying the influence of base-model pretraining characteristics (e.g., 'Aha moments' in DeepSeek-V3-Base and prompt-independent reasoning in Qwen2.5) and an optimization bias in GRPO that inflates response lengths, particularly for incorrect outputs. It introduces Dr. GRPO to remove this bias, then presents a minimalist training recipe achieving 43.3% accuracy on AIME 2024 with a 7B model, claimed as new SOTA. Public code is released.

Significance. If the accuracy gains are shown to stem specifically from the diagnosed base-model properties and Dr. GRPO rather than unstated implementation differences, the work supplies useful mechanistic insights into efficient RL for reasoning and a practical recipe that could guide training of small-scale reasoning models. The public code release strengthens reproducibility and allows direct verification of the claims.

major comments (2)

[Experiments and Results (around the minimalist recipe and Table reporting AIME scores)] The central SOTA claim (43.3% AIME 2024 with 7B model) is load-bearing for the paper's contribution, yet the manuscript provides no explicit controls or matched ablations confirming that data filtering, prompt templates, learning-rate schedules, and evaluation protocols are identical to those used in the baselines being surpassed. Without this isolation, attribution of gains to the identified 'Aha moments,' pretraining biases, or Dr. GRPO remains unverified.
[Analysis of GRPO optimization bias] The diagnosis of GRPO's length bias (especially for incorrect outputs) is presented as a key insight motivating Dr. GRPO, but the supporting analysis lacks reported statistical tests, multiple random seeds, or quantitative comparison of length distributions before/after the correction across the full set of base models.

minor comments (2)

[Abstract] The abstract states experiments across 'a wide range of base models' but only names DeepSeek-V3-Base and Qwen2.5; an explicit list or table of all models and their sizes would improve clarity.
[Method section introducing Dr. GRPO] The acronym 'Dr. GRPO' is introduced without spelling out the full name or explaining the 'Dr.' prefix on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed review and constructive feedback. We appreciate the emphasis on rigorous experimental controls and statistical analysis to support our claims. We address each major comment below and outline the revisions to be incorporated in the updated version of the manuscript.

read point-by-point responses

Referee: The central SOTA claim (43.3% AIME 2024 with 7B model) is load-bearing for the paper's contribution, yet the manuscript provides no explicit controls or matched ablations confirming that data filtering, prompt templates, learning-rate schedules, and evaluation protocols are identical to those used in the baselines being surpassed. Without this isolation, attribution of gains to the identified 'Aha moments,' pretraining biases, or Dr. GRPO remains unverified.

Authors: We agree that matched ablations are essential for robust attribution. Our public code release already implements the full minimalist recipe with explicit details on data filtering, prompts, schedules, and evaluation, enabling direct verification against baselines. In the revision, we will add a new ablation table and section that explicitly matches these elements to the reported setups in the baseline works (e.g., DeepSeek-R1 and related papers). This will include side-by-side results isolating the contributions of base-model pretraining properties and Dr. GRPO, confirming the 43.3% AIME score stems from the diagnosed factors rather than unstated differences. revision: yes
Referee: The diagnosis of GRPO's length bias (especially for incorrect outputs) is presented as a key insight motivating Dr. GRPO, but the supporting analysis lacks reported statistical tests, multiple random seeds, or quantitative comparison of length distributions before/after the correction across the full set of base models.

Authors: We acknowledge the value of greater statistical rigor. We have rerun the relevant experiments with 5 random seeds and will report means, standard deviations, and paired t-test p-values for length differences between correct and incorrect outputs. The revision will include quantitative before/after length distribution comparisons (including histograms and summary statistics) across all base models tested. These additions provide stronger quantitative support for the bias diagnosis and Dr. GRPO's corrective effect while preserving the original mechanistic insights. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis and new method are self-contained

full rationale

The paper conducts empirical investigations across base models, diagnoses a length bias in GRPO, introduces Dr. GRPO as an unbiased alternative, and validates a minimalist training recipe via direct experiments on AIME 2024. No equations, predictions, or central claims reduce to fitted inputs or self-citations by construction. The work is benchmarked against external results and releases public code, satisfying the criteria for a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning study; no explicit free parameters, axioms, or invented entities are introduced beyond standard RL training components and the Dr. GRPO modification of existing GRPO.

pith-pipeline@v0.9.0 · 5532 in / 1114 out tokens · 54434 ms · 2026-05-11T04:20:27.690359+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
cs.LG 2026-05 unverdicted novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
cs.CV 2026-05 unverdicted novelty 7.0

GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
cs.CV 2026-05 unverdicted novelty 7.0

CurveBench is a new benchmark for recovering rooted containment trees from images of nested Jordan curves, where the strongest model reaches only 19.1% accuracy on hard cases and fine-tuning lifts an open model to 33....
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
cs.CV 2026-05 unverdicted novelty 7.0

CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
cs.AI 2026-05 unverdicted novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
Tracing Uncertainty in Language Model "Reasoning"
cs.LG 2026-05 unverdicted novelty 7.0

Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
cs.LG 2026-05 unverdicted novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
cs.HC 2026-05 unverdicted novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
cs.LG 2026-04 unverdicted novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
cs.AI 2026-04 unverdicted novelty 7.0

A multi-agent binary reward system with unbiased GRPO post-training on ICLR-320 data outperforms baselines on expert-rated novelty, feasibility, and effectiveness for scientific idea generation.
Foresight Optimization for Strategic Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

FoPO trains LLMs for strategic reasoning by combining self-interest with opponent modeling in policy optimization, yielding gains on two new datasets and better out-of-domain generalization than standard baselines.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
cs.CL 2026-04 unverdicted novelty 7.0

AsymGRPO refines policy entropy in RLVR by preserving informative entropy on positive rollouts and suppressing spurious entropy on negative ones, outperforming baselines.
DeonticBench: A Benchmark for Reasoning over Rules
cs.CL 2026-04 unverdicted novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
cs.LG 2026-02 unverdicted novelty 7.0

Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models
cs.AI 2026-01 conditional novelty 7.0

Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding...
SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models
cs.AI 2026-01 unverdicted novelty 7.0

SCRIBE introduces skill-conditioned rewards with intermediate behavioral evaluation to reduce noise in training tool-augmented agents, raising AIME25 accuracy from 43.3% to 63.3% on a Qwen3-4B model.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
cs.LG 2025-07 unverdicted novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
cs.AI 2025-05 unverdicted novelty 7.0

UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
cs.LG 2025-04 accept novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
SeedER: Seed-and-Expand Retrieval from Knowledge Graphs
cs.LG 2026-05 unverdicted novelty 6.0

SeedER uses initial dense seeding followed by RL-driven selective expansion to improve recall on compositional KG queries while limiting candidate set size.
F-TIS: Harnessing Diverse Models in Collaborative GRPO
cs.LG 2026-05 unverdicted novelty 6.0

F-TIS enables heterogeneous model collaboration in GRPO by filtering off-policy samples, matching on-policy convergence while improving out-of-distribution performance by up to 12% in some setups.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

OPPO computes token-level advantages via Bayesian recursion on oracle signals, recovering distillation methods as a special case and improving over GRPO on math and code benchmarks.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains ...
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

IH-GRPO introduces implicit hierarchical control via a surrogate loss to decouple tool invocation from execution in LLMs, reporting 1.87-2.53% gains on math reasoning benchmarks.
Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains
cs.CL 2026-05 unverdicted novelty 6.0

K2V extends RLVR to knowledge-intensive domains by synthesizing verifiable data and verifying reasoning processes, yielding improved domain reasoning with preserved general capabilities.
Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

CodeThinker improves LLM code reasoning via consistency-based RL with stepwise training data, dynamic beam sampling, and consistency rewards, reaching SOTA on benchmarks with 4.3% gains on Qwen2.5-Coder-7B.
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
cs.AI 2026-05 unverdicted novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
cs.AI 2026-05 unverdicted novelty 6.0

SAPO computes per-reasoning-step group-relative advantages in RL to improve credit assignment for structured generation of semantic identifiers in recommendation systems.
Self-Supervised On-Policy Distillation for Reasoning Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIM...
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
cs.AI 2026-05 unverdicted novelty 6.0

NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on...
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Holder Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Holder Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization
cs.HC 2026-05 unverdicted novelty 6.0

UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
cs.LG 2026-05 unverdicted novelty 6.0

FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
G-Zero: Self-Play for Open-Ended Generation from Zero Data
cs.LG 2026-05 unverdicted novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.