arxiv: 2503.20783 · v2 · submitted 2025-03-26 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu , Changyu Chen , Wenjun Li , Penghui Qi , Tianyu Pang , Chao Du , Wee Sun Lee , Min Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningLLM reasoningGRPObase modelsAIMEoptimization biasDr. GRPOR1-Zero

0 comments

The pith

Certain base models already contain reasoning ability, and removing a length bias from GRPO allows a minimalist RL method to achieve state-of-the-art math performance with 7B models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates R1-Zero-like reinforcement learning for improving LLM reasoning without supervised fine-tuning. It tests various base models and discovers that some naturally exhibit 'Aha moments' or strong reasoning even without prompt templates, pointing to pretraining effects. The work also uncovers that GRPO tends to favor longer incorrect responses, which it corrects with a new unbiased method called Dr. GRPO. Using these observations, the authors build a simple training recipe that reaches 43.3% accuracy on the AIME 2024 benchmark using a 7B model. This approach shows how understanding base models and fixing optimization biases can lead to more effective and efficient reasoning training.

Core claim

DeepSeek-V3-Base already exhibits an 'Aha moment', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates. GRPO has an optimization bias that artificially increases response length especially for incorrect outputs. Dr. GRPO is introduced as an unbiased optimization method that improves token efficiency while maintaining reasoning performance. A minimalist R1-Zero recipe achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art.

What carries the argument

Dr. GRPO, an unbiased optimization method derived from GRPO that eliminates the artificial increase in length of incorrect responses during training.

If this is right

Base models with built-in reasoning traits can reduce the need for elaborate prompting or additional supervised data.
Removing the length bias in GRPO leads to more efficient use of tokens during training.
A minimalist recipe can set new performance records on difficult math tests like AIME for models as small as 7B parameters.
Insights into pretraining characteristics help in selecting better starting points for RL-based reasoning enhancement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This indicates that much of the reasoning capability may already be present in well-pretrained base models, reducing the role of complex post-training.
Similar optimization biases might be present in other RL algorithms used for LLMs, warranting checks in future work.
The minimalist recipe could potentially be adapted to other reasoning domains such as coding or scientific problem solving.
Reproducing the results on additional benchmarks would help confirm how broadly the base model advantages apply.

Load-bearing premise

The accuracy improvements result mainly from the base model properties and the Dr. GRPO correction instead of differences in training hyperparameters, data filtering, or evaluation setup.

What would settle it

Reproducing the 43.3% accuracy on AIME 2024 by following the minimalist recipe with the 7B base model; achieving it supports the claim while falling short suggests other factors drive the gains.

read the original abstract

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a length bias in GRPO that favors longer wrong answers and offers Dr. GRPO as a fix, plus shows that certain base models already carry reasoning sparks from pretraining, but the 43.3% AIME claim needs tighter controls to pin the gains on those factors.

read the letter

The main points are straightforward. The authors run RL on several base models and notice that DeepSeek-V3-Base already produces the sudden correct answers people call Aha moments, while Qwen2.5 models do decent reasoning even without fancy prompts. They also show that standard GRPO increases response length more for incorrect outputs than correct ones, which looks like an optimization artifact. Their Dr. GRPO variant removes that bias, keeps accuracy, and improves token efficiency. They then combine the observations into a simple training recipe that reaches 43.3% on AIME 2024 with a 7B model and call it new state-of-the-art. Code is released, which helps verification.

Referee Report

2 major / 2 minor

Summary. The paper critically examines R1-Zero-like RL training for LLMs by studying the influence of base-model pretraining characteristics (e.g., 'Aha moments' in DeepSeek-V3-Base and prompt-independent reasoning in Qwen2.5) and an optimization bias in GRPO that inflates response lengths, particularly for incorrect outputs. It introduces Dr. GRPO to remove this bias, then presents a minimalist training recipe achieving 43.3% accuracy on AIME 2024 with a 7B model, claimed as new SOTA. Public code is released.

Significance. If the accuracy gains are shown to stem specifically from the diagnosed base-model properties and Dr. GRPO rather than unstated implementation differences, the work supplies useful mechanistic insights into efficient RL for reasoning and a practical recipe that could guide training of small-scale reasoning models. The public code release strengthens reproducibility and allows direct verification of the claims.

major comments (2)

[Experiments and Results (around the minimalist recipe and Table reporting AIME scores)] The central SOTA claim (43.3% AIME 2024 with 7B model) is load-bearing for the paper's contribution, yet the manuscript provides no explicit controls or matched ablations confirming that data filtering, prompt templates, learning-rate schedules, and evaluation protocols are identical to those used in the baselines being surpassed. Without this isolation, attribution of gains to the identified 'Aha moments,' pretraining biases, or Dr. GRPO remains unverified.
[Analysis of GRPO optimization bias] The diagnosis of GRPO's length bias (especially for incorrect outputs) is presented as a key insight motivating Dr. GRPO, but the supporting analysis lacks reported statistical tests, multiple random seeds, or quantitative comparison of length distributions before/after the correction across the full set of base models.

minor comments (2)

[Abstract] The abstract states experiments across 'a wide range of base models' but only names DeepSeek-V3-Base and Qwen2.5; an explicit list or table of all models and their sizes would improve clarity.
[Method section introducing Dr. GRPO] The acronym 'Dr. GRPO' is introduced without spelling out the full name or explaining the 'Dr.' prefix on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed review and constructive feedback. We appreciate the emphasis on rigorous experimental controls and statistical analysis to support our claims. We address each major comment below and outline the revisions to be incorporated in the updated version of the manuscript.

read point-by-point responses

Referee: The central SOTA claim (43.3% AIME 2024 with 7B model) is load-bearing for the paper's contribution, yet the manuscript provides no explicit controls or matched ablations confirming that data filtering, prompt templates, learning-rate schedules, and evaluation protocols are identical to those used in the baselines being surpassed. Without this isolation, attribution of gains to the identified 'Aha moments,' pretraining biases, or Dr. GRPO remains unverified.

Authors: We agree that matched ablations are essential for robust attribution. Our public code release already implements the full minimalist recipe with explicit details on data filtering, prompts, schedules, and evaluation, enabling direct verification against baselines. In the revision, we will add a new ablation table and section that explicitly matches these elements to the reported setups in the baseline works (e.g., DeepSeek-R1 and related papers). This will include side-by-side results isolating the contributions of base-model pretraining properties and Dr. GRPO, confirming the 43.3% AIME score stems from the diagnosed factors rather than unstated differences. revision: yes
Referee: The diagnosis of GRPO's length bias (especially for incorrect outputs) is presented as a key insight motivating Dr. GRPO, but the supporting analysis lacks reported statistical tests, multiple random seeds, or quantitative comparison of length distributions before/after the correction across the full set of base models.

Authors: We acknowledge the value of greater statistical rigor. We have rerun the relevant experiments with 5 random seeds and will report means, standard deviations, and paired t-test p-values for length differences between correct and incorrect outputs. The revision will include quantitative before/after length distribution comparisons (including histograms and summary statistics) across all base models tested. These additions provide stronger quantitative support for the bias diagnosis and Dr. GRPO's corrective effect while preserving the original mechanistic insights. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis and new method are self-contained

full rationale

The paper conducts empirical investigations across base models, diagnoses a length bias in GRPO, introduces Dr. GRPO as an unbiased alternative, and validates a minimalist training recipe via direct experiments on AIME 2024. No equations, predictions, or central claims reduce to fitted inputs or self-citations by construction. The work is benchmarked against external results and releases public code, satisfying the criteria for a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning study; no explicit free parameters, axioms, or invented entities are introduced beyond standard RL training components and the Dr. GRPO modification of existing GRPO.

pith-pipeline@v0.9.0 · 5532 in / 1114 out tokens · 54434 ms · 2026-05-11T04:20:27.690359+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
cs.CV 2026-05 unverdicted novelty 7.0

CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
cs.AI 2026-05 unverdicted novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
Tracing Uncertainty in Language Model "Reasoning"
cs.LG 2026-05 unverdicted novelty 7.0

Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
cs.LG 2026-05 unverdicted novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
cs.HC 2026-05 unverdicted novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
cs.LG 2026-04 unverdicted novelty 7.0

EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
cs.AI 2026-04 unverdicted novelty 7.0

A multi-agent binary reward system with unbiased GRPO post-training on ICLR-320 data outperforms baselines on expert-rated novelty, feasibility, and effectiveness for scientific idea generation.
Foresight Optimization for Strategic Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

FoPO trains LLMs for strategic reasoning by combining self-interest with opponent modeling in policy optimization, yielding gains on two new datasets and better out-of-domain generalization than standard baselines.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
cs.CL 2026-04 unverdicted novelty 7.0

AsymGRPO refines policy entropy in RLVR by preserving informative entropy on positive rollouts and suppressing spurious entropy on negative ones, outperforming baselines.
DeonticBench: A Benchmark for Reasoning over Rules
cs.CL 2026-04 unverdicted novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
H\"older Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization
cs.HC 2026-05 unverdicted novelty 6.0

UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
cs.LG 2026-05 unverdicted novelty 6.0

FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
G-Zero: Self-Play for Open-Ended Generation from Zero Data
cs.LG 2026-05 unverdicted novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
cs.LG 2026-05 conditional novelty 6.0

Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.
Gradient Extrapolation-Based Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
cs.LG 2026-05 unverdicted novelty 6.0

LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Binary Rewards and Reinforcement Learning: Fundamental Challenges
cs.LG 2026-05 unverdicted novelty 6.0

Binary rewards make the set of reward-maximizing policies infinite in policy gradients; KL control selects the filtered base model but misspecification drives collapse to concentrated valid outputs instead.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space
cs.IR 2026-04 unverdicted novelty 6.0

GloRank reformulates list-wise reranking as token generation over a global item identifier space, using supervised pre-training followed by reinforcement learning to maximize list-wise utility and outperforming baseli...
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
cs.CL 2026-04 unverdicted novelty 6.0

TRN-R1-Zero is an RL-only post-training method that lets LLMs perform zero-shot node, edge, and graph reasoning on text-rich networks without supervised data or larger-model distillation.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
cs.LG 2026-04 unverdicted novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Calibration-Aware Policy Optimization for Reasoning LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
cs.LG 2026-04 unverdicted novelty 6.0

Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
cs.LG 2026-04 unverdicted novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

SPPO enables stable, sample-efficient alignment of LLMs on long-horizon reasoning tasks by using a decoupled scalar value function for low-variance advantages without multi-sampling.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.