Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Aditya Grover; Feiyu Chen; Guan Pang; Jing Huang; Mengchen Liu; Siyan Zhao; Zhihui Xie

arxiv: 2601.18734 · v3 · submitted 2026-01-26 · 💻 cs.LG · cs.CL

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao , Zhihui Xie , Mengchen Liu , Jing Huang , Guan Pang , Feiyu Chen , Aditya Grover This is my paper

Pith reviewed 2026-05-12 03:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords on-policy self-distillationlarge language modelsmathematical reasoningknowledge distillationLLM reasoningself-distillationon-policy learning

0 comments

The pith

A single LLM can improve its reasoning by distilling from a privileged-context version of itself on its own rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents On-Policy Self-Distillation as a way for one large language model to serve as both teacher and student. The teacher version receives verified reasoning traces as extra context while the student version sees only the question. Training then aligns the two by minimizing per-token divergence on trajectories the student itself generates. This setup uses ground-truth solutions from reasoning datasets without requiring a separate larger teacher model. Experiments on mathematical reasoning benchmarks show gains in performance and token efficiency over both reinforcement learning and standard off-policy distillation.

Core claim

On-Policy Self-Distillation (OPSD) lets a single LLM act simultaneously as teacher and student with different contexts: the teacher policy conditions on privileged information such as verified reasoning traces, the student policy receives only the question, and training minimizes the per-token divergence between the two distributions evaluated on the student's own rollouts, yielding better results on mathematical reasoning benchmarks than reinforcement learning or off-policy distillation.

What carries the argument

On-Policy Self-Distillation (OPSD), the single-model teacher-student setup that supplies privileged reasoning traces to one context and none to the other, then aligns their output distributions via divergence minimization on the student's self-generated trajectories.

If this is right

The method achieves superior token efficiency compared to reinforcement learning approaches on math reasoning tasks.
It delivers better performance than off-policy distillation methods that rely on separate teacher models.
Ground-truth solutions available in reasoning datasets can be leveraged directly without external models.
The single-model setup reduces the need for maintaining a larger teacher LLM during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Iterating the self-distillation process could enable repeated self-improvement cycles if each round produces stronger verified traces.
The approach might transfer to non-math domains whenever verified step-by-step solutions can be supplied as privileged context.
Because training stays on-policy, the learned policy may generalize more reliably at inference time than off-policy alternatives.

Load-bearing premise

A sufficiently capable LLM can rationalize external privileged reasoning traces and use them to teach its weaker self.

What would settle it

If applying OPSD training produces no accuracy gain or lower accuracy than standard supervised fine-tuning or RL baselines on held-out mathematical reasoning problems, the central claim would be falsified.

read the original abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. Code repo: https://github.com/siyan-zhao/OPSD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A single-model self-distillation method for LLM reasoning that uses privileged context, but the experimental backing is still light.

read the letter

The paper introduces On-Policy Self-Distillation, or OPSD, where a single LLM serves as both the teacher and the student. The teacher version gets extra context like verified reasoning traces or ground-truth solutions, while the student only gets the question. They then have the student sample its own trajectories and train by reducing the per-token divergence between the two policies' outputs. This is meant to fix the distribution shift in off-policy distillation without needing a bigger separate teacher model. What the work does well is to take the on-policy idea and apply it in this self-supervised way, using the privileged information that is often available in reasoning datasets. The results on math benchmarks suggest it can match or beat off-policy methods and use fewer tokens than RL approaches. The public code repository is a good move for anyone wanting to try it out or build on it. Where it could be stronger is in the supporting evidence. The abstract gives high-level claims but skips over things like statistical significance, run-to-run variation, or detailed ablations that would show how much the context difference is really contributing versus just the on-policy sampling. The assumption that the model can meaningfully 'rationalize' the privileged traces into better token predictions for its weaker self is central, and if that doesn't hold strongly on the sampled trajectories, the whole thing might not add much beyond existing regularization techniques. This kind of paper is useful for groups working on scaling reasoning in LLMs under compute constraints, where running multiple models or heavy RL is not practical. A reader looking for practical distillation tricks would find the setup worth examining. Overall, I would send this to peer review. The method is clearly described and the motivation is sound, so getting expert feedback on the experiments would help clarify if the gains are reliable.

Referee Report

3 major / 2 minor

Summary. The paper introduces On-Policy Self-Distillation (OPSD), an algorithm in which a single LLM serves as both teacher and student. The teacher policy conditions on privileged information such as verified reasoning traces while the student policy receives only the question; training minimizes per-token divergence between the two distributions evaluated on the student's own rollouts. The authors report that OPSD yields better performance on mathematical reasoning benchmarks than off-policy distillation baselines and superior token efficiency relative to reinforcement-learning methods.

Significance. If the empirical claims hold under rigorous controls, the work offers a practical route to self-improvement of LLM reasoning that avoids the need for a separate larger teacher model and exploits ground-truth solutions already present in reasoning datasets. The public code repository is a positive factor for reproducibility.

major comments (3)

[§3] §3 (Method): The central modeling assumption—that conditioning the shared-parameter model on privileged reasoning traces produces a teacher distribution that systematically assigns higher probability to correct reasoning steps than the student distribution on the student's own rollouts—is stated but not accompanied by any diagnostic analysis (e.g., per-step probability differences or KL divergence conditioned on correctness). Without such evidence the objective risks reducing to on-policy regularization, undermining the claimed advantage over off-policy distillation.
[§4] §4 (Experiments): Performance gains on GSM8K, MATH, and other benchmarks are presented without error bars, multiple random seeds, or statistical significance tests. In addition, the section does not report ablations that isolate the contribution of the privileged-context teacher (e.g., teacher with ground-truth vs. model-generated traces, or teacher context ablated entirely). These omissions are load-bearing for the superiority claims.
[§4.3] §4.3 (Baselines): The comparison with RL methods reports better token efficiency, yet the paper does not specify the exact metric (training tokens, inference tokens, or total environment interactions) or control for equivalent compute budgets. This detail is required to substantiate the efficiency claim.

minor comments (2)

[Abstract] The abstract states that the method achieves “superior token efficiency compared to reinforcement learning methods,” but the main text should explicitly define the token-efficiency metric and report the corresponding numbers in a table.
[§3.1] Notation for the teacher and student policies (π_teacher and π_student) is introduced without a clear statement of whether parameters are tied or how gradients flow through the shared model during the divergence minimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and indicate the revisions planned for the manuscript to strengthen the presentation and empirical support.

read point-by-point responses

Referee: [§3] §3 (Method): The central modeling assumption—that conditioning the shared-parameter model on privileged reasoning traces produces a teacher distribution that systematically assigns higher probability to correct reasoning steps than the student distribution on the student's own rollouts—is stated but not accompanied by any diagnostic analysis (e.g., per-step probability differences or KL divergence conditioned on correctness). Without such evidence the objective risks reducing to on-policy regularization, undermining the claimed advantage over off-policy distillation.

Authors: We agree that explicit diagnostic evidence would better substantiate the modeling assumption and distinguish OPSD from generic on-policy regularization. In the revised manuscript we will add a diagnostic analysis (new figure or subsection in §3 or the appendix) that reports per-step log-probability differences between the teacher and student distributions, conditioned on whether each reasoning step is correct according to the ground-truth trace. We will also report the average KL divergence separately for correct and incorrect steps on held-out rollouts. These diagnostics will be computed on the same student-generated trajectories used for training. revision: yes
Referee: [§4] §4 (Experiments): Performance gains on GSM8K, MATH, and other benchmarks are presented without error bars, multiple random seeds, or statistical significance tests. In addition, the section does not report ablations that isolate the contribution of the privileged-context teacher (e.g., teacher with ground-truth vs. model-generated traces, or teacher context ablated entirely). These omissions are load-bearing for the superiority claims.

Authors: We acknowledge that reporting variability and isolating the privileged-context component are necessary for rigorous claims. In the revision we will rerun the main experiments with at least three independent random seeds, report means and standard deviations, and include statistical significance tests (paired t-tests) against the strongest baselines. We will also add the requested ablations: (i) teacher conditioned on ground-truth traces versus model-generated traces, and (ii) an ablation in which the teacher receives no privileged context at all. These results will be presented in §4 and the appendix. revision: yes
Referee: [§4.3] §4.3 (Baselines): The comparison with RL methods reports better token efficiency, yet the paper does not specify the exact metric (training tokens, inference tokens, or total environment interactions) or control for equivalent compute budgets. This detail is required to substantiate the efficiency claim.

Authors: We will clarify the definition in the revised §4.3: token efficiency is measured as the total number of tokens processed during training (student rollout tokens plus teacher supervision tokens). We will also add a compute-matched comparison by reporting approximate total FLOPs for OPSD and the RL baselines and, where feasible, re-running the RL methods under a matched token or FLOP budget. A new table will summarize these controlled comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical algorithm validated externally

full rationale

The paper proposes On-Policy Self-Distillation (OPSD) as an empirical training algorithm for LLMs, with the teacher policy conditioning on privileged reasoning traces and the student on the question alone, minimizing per-token KL divergence over student-sampled trajectories. No derivation chain exists that reduces a claimed result to its inputs by construction: performance claims are measured on external mathematical reasoning benchmarks (e.g., GSM8K, MATH) against RL and off-policy baselines, with code released for reproduction. The motivating intuition about rationalizing privileged traces is stated explicitly as inspiration rather than a self-referential axiom, and no self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked to force the central result. The method is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an LLM can effectively use privileged reasoning traces to provide useful token-level supervision to its own weaker policy; no explicit free parameters or invented entities are named in the abstract, and axioms are standard LLM capabilities.

axioms (1)

domain assumption A sufficiently capable LLM can rationalize external privileged reasoning traces and provide useful supervision to its weaker self.
Explicitly stated as the core intuition enabling the single-model setup.

pith-pipeline@v0.9.0 · 5521 in / 1325 out tokens · 40912 ms · 2026-05-12T03:47:35.543653+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
cs.AI 2026-05 unverdicted novelty 7.0

EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
cs.CV 2026-05 unverdicted novelty 7.0

GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
cs.LG 2026-05 conditional novelty 7.0

CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction
eess.IV 2026-05 unverdicted novelty 7.0

Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Learning from Language Feedback via Variational Policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
cs.LG 2026-05 unverdicted novelty 7.0

RLRT augments GRPO by reinforcing tokens on correct student rollouts that the teacher would not have predicted, outperforming standard self-distillation and exploration baselines on Qwen3 models.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
cs.AI 2026-05 unverdicted novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
cs.CL 2026-05 unverdicted novelty 7.0

LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
cs.LG 2026-05 unverdicted novelty 7.0

PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
cs.LG 2026-04 unverdicted novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
cs.AI 2026-03 conditional novelty 7.0

PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Life-Harness evolves reusable runtime interventions from training failures to improve frozen LLM agents by 88.5% on average across 126 settings in seven deterministic environments while transferring across 18 model backbones.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

OPPO computes token-level advantages via Bayesian recursion on oracle signals, recovering distillation methods as a special case and improving over GRPO on math and code benchmarks.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
cs.LG 2026-05 conditional novelty 6.0

On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME...
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
cs.AI 2026-05 unverdicted novelty 6.0

SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
cs.CV 2026-05 unverdicted novelty 6.0

Vision-OPD uses on-policy self-distillation from crop-conditioned to full-image policies within the same MLLM to close the regional-to-global perception gap.
Post-Trained MoE Can Skip Half Experts via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
cs.CL 2026-05 unverdicted novelty 6.0

MixSD achieves superior memorization-retention trade-off in knowledge injection by using mixed self-generated supervision from the base model's conditionals, retaining up to 100% held-out capability versus 1% for stan...
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
cs.CL 2026-05 unverdicted novelty 6.0

MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and b...
VSPO: Vector-Steered Policy Optimization for Behavioral Control
cs.LG 2026-05 unverdicted novelty 6.0

VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
cs.LG 2026-05 conditional novelty 6.0

On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
cs.CL 2026-05 conditional novelty 6.0

OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
GRAFT: Graph-Tokenized LLMs for Tool Planning
cs.LG 2026-05 unverdicted novelty 6.0

GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
cs.CV 2026-05 unverdicted novelty 6.0

A new distillation method uses token-wise salient reasoning-prefix masking and self-paced scheduling to anchor student VLM thinking on visual inputs, outperforming prior distillation approaches on multimodal reasoning...
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
cs.CV 2026-05 unverdicted novelty 6.0

A masking-based think-answer distillation method for VLMs that selectively hides reasoning prefixes and uses self-paced scheduling to improve visual anchoring and benchmark performance.
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
cs.CV 2026-05 unverdicted novelty 6.0

A reasoning-prefix masking strategy during VLM distillation encourages students to anchor their thinking on visual evidence, yielding better multimodal reasoning than prior distillation baselines.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on...
ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
cs.LG 2026-05 unverdicted novelty 6.0

ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing d...
ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage
cs.LG 2026-05 unverdicted novelty 6.0

ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 79 Pith papers · 21 internal anchors

[1]

Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q

URL https: //hkunlp.github.io/blog/2025/Polaris. Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong lan- guage models. InInternational Conference on Machine Learning, pp. 6621–6642. PMLR,

work page 2025
[2]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y ., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V ., Levine, S., and Ma, Y . Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review arXiv
[3]

OpenThoughts: Data Recipes for Reasoning Models

URLhttps://arxiv.org/abs/2506.04178. Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998,

work page internal anchor Pith review arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URL https: //arxiv.org/abs/1503.02531. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

URL https:// openreview.net/forum?id=nZeVKeeFYf9. Huan, M., Li, Y ., Zheng, T., Xu, X., Kim, S., Du, M., Poovendran, R., Neubig, G., and Yue, X. Does math reasoning improve general llm capabilities? understand- ing transferability of llm reasoning.arXiv preprint arXiv:2507.00432,

work page internal anchor Pith review arXiv
[7]

Reinforcement Learning via Self-Distillation

H¨ubotter, J., L¨ubeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C., and Krause, A. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review arXiv
[8]

and Rush, A

Kim, Y . and Rush, A. M. Sequence-level knowledge distilla- tion. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,

work page 2016
[9]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

20251026

doi: 10.64434/tml. 20251026. https://thinkingmachines.ai/blog/on-policy- distillation. Mitra, P. and Ulukus, S. Semantic soft bootstrapping: Long context reasoning in llms without reinforcement learning. arXiv preprint arXiv:2512.05105,

work page doi:10.64434/tml
[12]

s1: Simple test-time scaling

10 On-Policy Self-Distillation for Large Language Models Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Magistral.arXiv preprint arXiv:2506.10910, 2025

Rastogi, A., Jiang, A. Q., Lo, A., Berrada, G., Lample, G., Rute, J., Barmentlo, J., Yadav, K., Khandelwal, K., Chandu, K. R., et al. Magistral.arXiv preprint arXiv:2506.10910,

work page arXiv
[14]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[15]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Self-Distillation Enables Continual Learning

URL https://arxiv.org/abs/2601.19897. Snell, C., Klein, D., and Zhong, R. Learning by distilling context.arXiv preprint arXiv:2209.15189,

work page internal anchor Pith review arXiv
[18]

Kimi K2: Open Agentic Intelligence

Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Qwen3 Technical Report

Team, O. Open Thoughts. https://open-thoughts.ai, January 2025a. Team, Q. Qwen3 technical report, 2025b. URL https: //arxiv.org/abs/2505.09388. Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

MiMo-V2-Flash Technical Report

URL https://arxiv.org/abs/2601.02780. Xu, W., Han, R., Wang, Z., Le, L., Madeka, D., Li, L., Wang, W. Y ., Agarwal, R., Lee, C.-Y ., and Pfister, T. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling. InThe Thirteenth Interna- tional Conference on Learning Representations, 2024a. Xu, X., Li, M., Tao, C., Shen...

work page internal anchor Pith review arXiv
[21]

LIMO: Less is More for Reasoning

URL https: //arxiv.org/abs/2502.03387. 11 On-Policy Self-Distillation for Large Language Models Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review arXiv
[22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yue, Y ., Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review arXiv
[24]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhang, Z., Zheng, C., Wu, Y ., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., and Lin, J. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301,

work page internal anchor Pith review arXiv
[25]

arXiv preprint arXiv:2509.10396 , year=

Zhao, S., Liu, M., Huang, J., Liu, M., Wang, C., Liu, B., Tian, Y ., Pang, G., Bell, S., Grover, A., et al. Inpainting- guided policy optimization for diffusion large language models.arXiv preprint arXiv:2509.10396,

work page arXiv
[26]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q

URL https: //hkunlp.github.io/blog/2025/Polaris. Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong lan- guage models. InInternational Conference on Machine Learning, pp. 6621–6642. PMLR,

work page 2025

[2] [2]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y ., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V ., Levine, S., and Ma, Y . Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review arXiv

[3] [3]

OpenThoughts: Data Recipes for Reasoning Models

URLhttps://arxiv.org/abs/2506.04178. Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998,

work page internal anchor Pith review arXiv

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

URL https: //arxiv.org/abs/1503.02531. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

URL https:// openreview.net/forum?id=nZeVKeeFYf9. Huan, M., Li, Y ., Zheng, T., Xu, X., Kim, S., Du, M., Poovendran, R., Neubig, G., and Yue, X. Does math reasoning improve general llm capabilities? understand- ing transferability of llm reasoning.arXiv preprint arXiv:2507.00432,

work page internal anchor Pith review arXiv

[7] [7]

Reinforcement Learning via Self-Distillation

H¨ubotter, J., L¨ubeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C., and Krause, A. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review arXiv

[8] [8]

and Rush, A

Kim, Y . and Rush, A. M. Sequence-level knowledge distilla- tion. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,

work page 2016

[9] [9]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

20251026

doi: 10.64434/tml. 20251026. https://thinkingmachines.ai/blog/on-policy- distillation. Mitra, P. and Ulukus, S. Semantic soft bootstrapping: Long context reasoning in llms without reinforcement learning. arXiv preprint arXiv:2512.05105,

work page doi:10.64434/tml

[12] [12]

s1: Simple test-time scaling

10 On-Policy Self-Distillation for Large Language Models Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Magistral.arXiv preprint arXiv:2506.10910, 2025

Rastogi, A., Jiang, A. Q., Lo, A., Berrada, G., Lample, G., Rute, J., Barmentlo, J., Yadav, K., Khandelwal, K., Chandu, K. R., et al. Magistral.arXiv preprint arXiv:2506.10910,

work page arXiv

[14] [14]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[15] [15]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Self-Distillation Enables Continual Learning

URL https://arxiv.org/abs/2601.19897. Snell, C., Klein, D., and Zhong, R. Learning by distilling context.arXiv preprint arXiv:2209.15189,

work page internal anchor Pith review arXiv

[18] [18]

Kimi K2: Open Agentic Intelligence

Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Qwen3 Technical Report

Team, O. Open Thoughts. https://open-thoughts.ai, January 2025a. Team, Q. Qwen3 technical report, 2025b. URL https: //arxiv.org/abs/2505.09388. Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

MiMo-V2-Flash Technical Report

URL https://arxiv.org/abs/2601.02780. Xu, W., Han, R., Wang, Z., Le, L., Madeka, D., Li, L., Wang, W. Y ., Agarwal, R., Lee, C.-Y ., and Pfister, T. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling. InThe Thirteenth Interna- tional Conference on Learning Representations, 2024a. Xu, X., Li, M., Tao, C., Shen...

work page internal anchor Pith review arXiv

[21] [21]

LIMO: Less is More for Reasoning

URL https: //arxiv.org/abs/2502.03387. 11 On-Policy Self-Distillation for Large Language Models Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review arXiv

[22] [22]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yue, Y ., Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review arXiv

[24] [24]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhang, Z., Zheng, C., Wu, Y ., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., and Lin, J. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301,

work page internal anchor Pith review arXiv

[25] [25]

arXiv preprint arXiv:2509.10396 , year=

Zhao, S., Liu, M., Huang, J., Liu, M., Wang, C., Liu, B., Tian, Y ., Pang, G., Bell, S., Grover, A., et al. Inpainting- guided policy optimization for diffusion large language models.arXiv preprint arXiv:2509.10396,

work page arXiv

[26] [26]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv