pith. machine review for the scientific record. sign in

arxiv: 2605.07865 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

KL for a KL: On-Policy Distillation with Control Variate Baseline

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords on-policy distillationcontrol variate baselinevariance reductionpolicy gradientlarge language modelsknowledge distillationreinforcement learning
0
0 comments X

The pith

vOPD stabilizes on-policy distillation by subtracting a closed-form per-token negative reverse KL value function as a detached baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

On-policy distillation for large language models is unstable because its single-sample Monte Carlo gradient estimator has high variance. The paper frames OPD as a policy-gradient reinforcement learning problem and derives an exact value function for it. This value function equals the per-token negative reverse KL divergence between the student and teacher, which is already computed in the standard forward pass. Subtracting this quantity as a detached baseline reduces variance while preserving an unbiased estimator and the original lightweight sampling procedure. A further top-k truncation of the baseline cuts cost with no loss in performance, and the resulting method beats vanilla OPD while matching the accuracy of far more expensive full-vocabulary baselines on reasoning benchmarks.

Core claim

By recasting on-policy distillation as policy-gradient RL, the paper shows that the associated value function is exactly the per-token negative reverse KL divergence between student and teacher policies. Because this quantity is produced by the ordinary forward pass, it can be used directly as a control-variate baseline that is subtracted from the single-sample reward signal, lowering gradient variance without bias, without extra critics, and without computing the full-vocabulary reverse KL.

What carries the argument

The control-variate baseline formed by the per-token negative reverse KL divergence between student and teacher, which is detached from the gradient and subtracted from the Monte Carlo estimator.

If this is right

  • The single-sample estimator stays unbiased and computationally cheap because the baseline is detached and requires no extra forward passes.
  • A top-k truncation of the baseline further reduces compute while preserving downstream task performance.
  • Training stability improves enough for vOPD to match the accuracy of full-vocabulary reverse-KL methods at lower cost.
  • The same baseline construction applies directly to any OPD run that already computes per-token log-probabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same closed-form baseline trick could be tested on other single-sample policy-gradient objectives in language-model alignment where the value function also has an analytic expression.
  • If top-k truncation works reliably, it suggests that full-vocabulary KL computations are often unnecessary for variance reduction even outside distillation.
  • The approach leaves open whether learned or adaptive baselines could be added while still avoiding any extra inference cost.

Load-bearing premise

The value function of the on-policy distillation objective admits an exact closed-form expression as the per-token negative reverse KL divergence that can be read off the existing forward pass with no additional computation or model.

What would settle it

An empirical measurement showing that subtracting the proposed baseline either introduces measurable bias into the gradient estimates or leaves their variance unchanged, or that vOPD-trained models fail to outperform vanilla OPD on the mathematical and scientific reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2605.07865 by Gyubin Choi, Minjae Oh, Sangjun Song, Yohan Jo, Yunho Choi.

Figure 1
Figure 1. Figure 1: Token-level reward and advantage distributions. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Average accuracy for various k on vOPDtop-k. Right: Mean squared error of the top-k KL baseline relative to the full-vocabulary KL. We further analyze the token-level effect of this shift in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-step wall-clock time for OPD variants at 1.7B and 4B scale. The error bars denote [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gradient norm in training. Gradient Norm. To further understand the em￾pirical stabilization effect, we plot gradient norms throughout training for OPD and vOPDfull-V in the Qwen3-1.7B setting in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scatter plot [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces vOPD, a variance-reduced variant of On-Policy Distillation (OPD) for LLMs. It frames the OPD objective J(θ) = E[sum_t KL(π_θ(·|s_t) || π_teacher(·|s_t))] as a policy-gradient RL problem and proposes a control-variate baseline given by the per-token negative reverse KL, which is asserted to be available directly from the forward pass with no extra computation or critic. The method subtracts this detached baseline from the single-sample Monte Carlo estimator to reduce variance while preserving unbiasedness, and further applies a top-k approximation to the baseline for lower cost. Experiments on mathematical and scientific reasoning benchmarks show vOPD outperforming vanilla OPD and matching the performance of the more expensive full-vocabulary KL baseline.

Significance. If the closed-form baseline is correctly derived and the empirical gains are robust, vOPD would supply a lightweight, theoretically grounded stabilization technique for OPD that avoids both the overhead of full-vocabulary KL computation and the bias of restricted-support approximations. The explicit use of RL control-variate machinery inside distillation is a clean idea that could generalize to other on-policy objectives in LLM post-training.

major comments (2)
  1. [Abstract] Abstract: The claim that 'the OPD value function admits a closed form as the per-token negative reverse KL divergence' is load-bearing for the unbiasedness argument. Under the stated objective, the true value function satisfies the Bellman relation V(s_t) = -KL(s_t) + E[V(s_{t+1}) | s_t] (assuming maximization of -J). Using only the instantaneous term as the baseline is not the value function unless future expected KL terms are identically zero or the derivation explicitly treats the return as myopic; either case requires a detailed derivation showing why the gradient estimator remains unbiased and why variance is reduced.
  2. [Abstract] Abstract and method description: The paper states that subtracting the (detached) baseline 'keeps the gradient unbiased while reducing variance.' Because the per-token KL is a function of the state s_t only (not the sampled action a_t), any state-dependent baseline preserves unbiasedness in principle. However, if the baseline is set exactly to the instantaneous KL term that appears in the estimator, the resulting multiplier becomes zero for that component, which would nullify the contribution of the current token; the manuscript must clarify the precise form of the return estimator and the subtraction to avoid this degeneracy.
minor comments (3)
  1. The experimental protocol (number of seeds, statistical tests, exact hyper-parameters for the top-k approximation) is not described in sufficient detail to allow reproduction or to assess whether the reported gains are statistically significant.
  2. Ablation results isolating the contribution of the baseline versus the top-k approximation would strengthen the empirical claims; currently only aggregate benchmark numbers are referenced.
  3. Notation for the single-sample estimator and the precise control-variate formula should be given explicitly (e.g., as an equation) rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our work. We believe the suggested clarifications will strengthen the presentation of the theoretical foundations of vOPD. We address the major comments point-by-point below and plan to incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'the OPD value function admits a closed form as the per-token negative reverse KL divergence' is load-bearing for the unbiasedness argument. Under the stated objective, the true value function satisfies the Bellman relation V(s_t) = -KL(s_t) + E[V(s_{t+1}) | s_t] (assuming maximization of -J). Using only the instantaneous term as the baseline is not the value function unless future expected KL terms are identically zero or the derivation explicitly treats the return as myopic; either case requires a detailed derivation showing why the gradient estimator remains unbiased and why variance is reduced.

    Authors: We thank the referee for this observation. Our derivation shows that the control variate baseline corresponds to the negative reverse KL, which arises as the expected per-token reward when sampling actions from the teacher distribution rather than the student. This choice ensures the baseline is state-dependent only and thus preserves unbiasedness of the policy gradient estimator. Although the standard value function under a myopic reward of -KL would follow the mentioned Bellman equation, our setup uses the pointwise log-ratio as the reward signal for the single-sample estimator. We will include a detailed step-by-step derivation in the revised manuscript that demonstrates the unbiasedness and variance reduction properties explicitly, addressing the myopic vs. non-myopic aspects. revision: yes

  2. Referee: [Abstract] Abstract and method description: The paper states that subtracting the (detached) baseline 'keeps the gradient unbiased while reducing variance.' Because the per-token KL is a function of the state s_t only (not the sampled action a_t), any state-dependent baseline preserves unbiasedness in principle. However, if the baseline is set exactly to the instantaneous KL term that appears in the estimator, the resulting multiplier becomes zero for that component, which would nullify the contribution of the current token; the manuscript must clarify the precise form of the return estimator and the subtraction to avoid this degeneracy.

    Authors: We appreciate the need for this clarification. In our method, the single-sample estimator uses the pointwise reward r(a_t, s_t) = log(π_θ(a_t|s_t)) - log(π_teacher(a_t|s_t)), and the baseline is the negative reverse KL, which equals the expectation of r under the teacher distribution: b(s_t) = E_{a ~ π_teacher} [r(a, s_t)]. The adjusted multiplier is therefore r(a_t, s_t) - b(s_t), which is nonzero in general since the sampled action a_t is drawn from the student policy, not the teacher. This does not lead to degeneracy. We will update the method section and abstract to explicitly define the return estimator, the baseline, and the subtraction operation to make this clear. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard RL variance reduction to existing OPD terms.

full rationale

The paper identifies the per-token reverse KL (already computed in the distillation forward pass) as the value function for a control-variate baseline and invokes canonical RL theory for unbiased gradient reduction. This identification is presented as a direct mathematical consequence rather than a redefinition of inputs as outputs. No self-citation chains, fitted parameters renamed as predictions, ansatz smuggling, or uniqueness theorems are load-bearing. The top-k baseline approximation is an efficiency step that does not alter the core derivation. The central claim remains independent of its own fitted values and is self-contained against external RL benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL assumptions about unbiased control variates and the domain-specific claim that the OPD objective admits an exact closed-form value function from the forward pass.

axioms (2)
  • domain assumption Casting on-policy distillation as a policy-gradient reinforcement learning problem permits the direct application of control-variate variance reduction.
    This reframing is the foundational step that enables the baseline technique.
  • domain assumption The value function for the OPD objective has a closed-form expression equal to the per-token negative reverse KL divergence between student and teacher.
    This is asserted to be computable from the existing forward pass without extra cost.

pith-pipeline@v0.9.0 · 5557 in / 1455 out tokens · 56658 ms · 2026-05-11T03:23:22.153351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 26 internal anchors

  1. [1]

    Gkd: Generalized knowledge distillation for auto- regressive sequence models,

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2306.13649

  2. [2]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InAnnual Meeting of the Association for Computa- tional Linguistics, 2024. URLhttps://aclanthology.org/2024.acl-long.662/

  3. [3]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025. URL https://arxiv.org/abs/ 2506.13585

  4. [4]

    Deepseek-v4: Towards highly efficient million-token context intelligence,

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence,

  5. [5]

    URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

  6. [6]

    Sciknoweval: Evaluating multi-level scientific knowledge of large language models

    Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024. URLhttps://arxiv.org/abs/2406.09098

  7. [7]

    Bartlett, and Jonathan Baxter

    Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 2004. URLhttps://www.jmlr.org/papers/volume5/greensmith04a/greensmith04a.pdf

  8. [8]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2306.08543

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. URL https: //arxiv.org/abs/2501.12948

  10. [10]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URLhttps://arxiv.org/abs/2103.03874

  11. [11]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://arxiv.org/ abs/2106.09685

  12. [12]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026. URL https://arxiv. org/abs/2601.20802

  13. [13]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. URLhttps://arxiv.org/abs/2412.16720

  14. [14]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. URLhttps://arxiv.org/abs/2603.07079

  15. [15]

    Dhillon, David Brandfonbrener, and Rishabh Agarwal

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms.arXiv preprint arXiv:2510.13786, 2025. URL https://arxiv.org/abs/2510.13786. 11

  16. [16]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026. URL https://arxiv.org/abs/ 2603.24472

  17. [17]

    arXiv preprint arXiv:2603.11137 , year =

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026. URLhttps://arxiv.org/abs/2603.11137

  18. [18]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, 2023. URLhttps://arxiv.org/pdf/2309.06180

  19. [19]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024. URL https://arxiv.org/abs/2411.15124

  20. [20]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://arxiv.org/abs/2206.14858

  21. [22]

    URLhttps://arxiv.org/abs/2604.13016

  22. [23]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations, 2024. URL https://arxiv.org/ abs/2305.20050

  23. [24]

    Part i: Tricks or traps? a deep dive into rl for llm reasoning, 2025

    Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, and Bo Zheng. Tricks or traps? a deep dive into RL for LLM reasoning. InInternational Conference on Learning Representations, 2026. URLhttps://arxiv.org/abs/2508.08221

  24. [25]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. doi: 10.64434/tml.20251026. URL https://thinkingmachines.ai/blog/ on-policy-distillation

  25. [26]

    American Invitational Mathematics Examination, 2026

    Mathematical Association of America. American Invitational Mathematics Examination, 2026. URLhttps://maa.org/maa-invitational-competitions/

  26. [27]

    American Mathematics Competitions, 2026

    Mathematical Association of America. American Mathematics Competitions, 2026. URL https://maa.org/student-programs/amc/

  27. [28]

    Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783,

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational conference on machine learning, 2016. URL https://arxiv.org/abs/1602.01783

  28. [29]

    Olmo 3

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025. URLhttps://arxiv.org/abs/2512.13961

  29. [30]

    Training language mod- els to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language mod- els to follow instructions with human feedback. InAdvances in Neural Information Pro- cessing Systems, 2022. URL https://proceedings.neurips.cc/paper/2022/hash/ b1efde53be364a73914f588...

  30. [31]

    Unlocking on-policy distillation for any model family, 2025

    Carlos Miguel Patiño, Kashif Rasul, Quentin Gallouédec, Ben Burtenshaw, Sergio Paniego, Vaibhav Srivastav, Thibaud Frere, Ed Beeching, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Unlocking on-policy distillation for any model family, 2025. URL https: //huggingface.co/spaces/HuggingFaceH4/on-policy-distillation

  31. [32]

    1511.06732 , archiveprefix =

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representa- tions, 2016. URLhttps://arxiv.org/abs/1511.06732

  32. [33]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InConference on Language Modeling, 2024. URL https://arxiv.org/abs/ 2311.12022

  33. [34]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. InInternational Con- ference on Learning Representations, 2016. URLhttps://arxiv.org/abs/1506.02438

  34. [35]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv. org/abs/1707.06347

  35. [36]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URL https: //arxiv.org/abs/2402.03300

  36. [37]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026. URL https://arxiv.org/abs/ 2601.19897

  37. [38]

    Policy gradient methods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems, 1999. URL https://proceedings.neurips.cc/paper_ files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

  38. [39]

    arXiv preprint arXiv:2506.09477 , year=

    Yunhao Tang and Rémi Munos. On a few pitfalls in kl divergence gradient estimation for rl. arXiv preprint arXiv:2506.09477, 2025. URLhttps://arxiv.org/abs/2506.09477

  39. [40]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. URL https://arxiv.org/abs/ 2501.12599

  40. [41]

    TRL: Transformers Rein- forcement Learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl

  41. [42]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URL https: //arxiv.org/abs/2201.11903

  42. [43]

    Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992. URL https://link.springer.com/ content/pdf/10.1007/BF00992696.pdf

  43. [44]

    Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

    Zhongwen Xu and Zihan Ding. Single-stream policy optimization. InInternational Conference on Learning Representations, 2026. URLhttps://arxiv.org/abs/2509.13232

  44. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388. 13

  45. [46]

    Nemotron-Cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

    Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, et al. Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026. URLhttps://arxiv.org/abs/2603.19220

  46. [47]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  47. [48]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026. URLhttps://arxiv.org/abs/2602.15763

  48. [49]

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026. URL https: //arxiv.org/abs/2602.15260

  49. [50]

    The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

    Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026. URLhttps://arxiv.org/abs/2604.16830

  50. [51]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. URLhttps://arxiv.org/abs/2507.18071. 14 A Theoretical Derivations This appendix provides derivations omitted from § 2 and § 3. Following the main sections, we d...