Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Anhao Zhao; Haoran Xin; Junlong Tong; Wenjie Li; Xiaoyu Shen; Yingqi Fan

arxiv: 2605.16826 · v1 · pith:LNW3CB62new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Anhao Zhao , Haoran Xin , Yingqi Fan , Junlong Tong , Wenjie Li , Xiaoyu Shen This is my paper

Pith reviewed 2026-05-19 20:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM distillationknowledge distillationKL divergenceon-policy distillationoff-policy distillationSFToffline RLmath reasoning

0 comments

The pith

Decoupling prefix source from token-level KL direction reveals four distinct LLM distillation objectives that unify SFT, DAgger, offline RL, and OPD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard off-policy and on-policy distillation methods for large language models unintentionally tie together two separate design choices: the source of response prefixes and the direction of token-level KL divergence. Breaking down the overall sequence-level KL between teacher and student distributions demonstrates that forward KL naturally pairs with teacher prefixes while reverse KL pairs with student prefixes. Removing this coupling produces four separate objectives, each matching a known training paradigm through explicit gradient identities. Experiments on math reasoning tasks identify clear tradeoffs among these choices and motivate two practical refinements, KL mixing and an entropy-gated length curriculum, that improve accuracy and efficiency.

Core claim

Decomposing sequence-level KL divergence over autoregressive response distributions shows that off-policy distillation pairs teacher prefixes with token-level forward KL while on-policy distillation pairs student prefixes with token-level reverse KL. This implicit coupling is unnecessary; separating the two axes yields four valid objectives whose gradients recover SFT-style cross-entropy matching, DAgger-style on-policy correction, offline-RL-style dense reward signals, and on-policy distillation. Controlled experiments confirm that KL direction controls an accuracy-entropy tradeoff, prefix source controls a quality-compute tradeoff, and training horizon controls an accuracy-stability trade.

What carries the argument

Decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes over autoregressive teacher and student distributions.

If this is right

Forward KL with teacher prefixes recovers SFT-style cross-entropy matching to teacher soft targets.
Reverse KL with student prefixes recovers an RL-style policy-gradient update using the teacher-student log-ratio as a dense reward.
KL direction induces a measurable accuracy-entropy tradeoff in reasoning tasks.
Prefix source induces a measurable quality-compute tradeoff during data generation.
Training length induces an accuracy-stability tradeoff that can be mitigated by curriculum design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be applied to other sequence-level divergences to generate additional distillation variants beyond the four objectives.
KL mixing weights could be scheduled dynamically based on current response entropy rather than fixed ratios.
The entropy-gated curriculum may generalize to non-reasoning generation tasks where length inflation harms latency.
Initializations from the four objectives could be tested for their effect on sample efficiency in subsequent reinforcement learning stages.

Load-bearing premise

The decomposition of sequence-level KL divergence into independent prefix-source and token-level KL-direction axes is valid and produces four distinct, usable objectives without hidden inconsistencies or additional constraints.

What would settle it

A controlled run in which one of the four decoupled objectives produces training dynamics or final performance that cannot be recovered from the claimed gradient identities for forward or reverse KL.

Figures

Figures reproduced from arXiv: 2605.16826 by Anhao Zhao, Haoran Xin, Junlong Tong, Wenjie Li, Xiaoyu Shen, Yingqi Fan.

**Figure 1.** Figure 1: Overview of our decoupled distillation framework. Conventional objectives couple prefix source with KL direction; decoupling these axes yields four objectives that correspond to classical training regimes. Off-policy distillation aligns the student’s token-level output distribution with the teacher’s along teacher-generated prefixes. In practice, this is often implemented as supervised fine-tuning (SFT) on… view at source ↗

**Figure 2.** Figure 2: Distillation training dynamics with Qwen3-4B as teacher and Qwen3-0.6B as student. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: RL training dynamics after Qwen3-4B-teacher distillation warmup. Top/bottom rows: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distillation training dynamics with Qwen3-8B as teacher and Qwen3-0.6B as student. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: RL dynamics after Qwen3-8B-teacher warmup. Top/bottom rows: 128/4096-token [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Training dynamics of KL mixing on MATH500. Columns, left to [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly separates prefix source from token KL direction to link four distillation objectives via gradient identities, with useful experiments on math reasoning, though the mismatched combinations need tighter checks on what they actually optimize.

read the letter

The paper's main contribution is a decomposition that separates the choice of prefix source from the direction of the token-level KL. This split turns the usual off-policy and on-policy distillation into four separate objectives, each linked by gradient identities to SFT, DAgger, offline RL, and OPD. The authors do a good job establishing those gradient identities from the autoregressive structure. Forward KL with teacher prefixes reduces to cross-entropy matching with soft targets, while reverse KL with student prefixes gives a dense log-ratio reward for a policy-gradient style update. Their controlled experiments on math reasoning tasks compare the four objectives as standalone methods and as initializations for later RL. This leads to identifying three tradeoffs and proposing KL mixing to balance long-sequence training plus an entropy-gated curriculum. The curriculum improves Avg@k by 3.6 points and Pass@k by up to 5.8 points while cutting average response length by a factor of about three. The math part follows from standard properties of sequence KL in autoregressive models, so it holds up without circularity. The experiments appear to have enough controls to make the tradeoff claims credible, and the practical suggestions show measurable benefits. The softer area is the treatment of the mismatched combinations. When prefix source and KL direction do not align, the resulting objective does not equal the sequence-level forward or reverse KL. The paper presents all four as valid and usable, but if the full derivations do not include importance sampling or stability analysis for the crossed cases, readers might question whether gradients remain unbiased. This is not a load-bearing flaw for the whole paper, but it is worth clarifying. This is for researchers in LLM distillation and post-training who are looking for a more organized way to select objectives, especially when preparing models for RL stages. A reader interested in the design space would find the framework helpful. I recommend sending it for peer review. The unifying perspective and the empirical results are substantive enough to merit referee feedback.

Referee Report

2 major / 2 minor

Summary. The paper claims that sequence-level KL divergence over autoregressive distributions decomposes into orthogonal choices of prefix source (teacher vs. student) and token-level KL direction (forward vs. reverse). This decomposition reveals that standard off-policy distillation and on-policy distillation (OPD) implicitly couple these axes, while decoupling them produces four distinct objectives. Gradient-level identities are derived showing that forward KL recovers SFT-style cross-entropy with soft targets and reverse KL recovers an RL-style policy gradient with dense log-ratio rewards, thereby unifying SFT, DAgger, offline RL-style distillation, and OPD. A controlled empirical study on math reasoning evaluates the four objectives standalone and as RL initializations, identifies accuracy-entropy, quality-compute, and accuracy-stability tradeoffs, and proposes KL mixing plus an entropy-gated length curriculum that improves Avg@k and Pass@k while reducing response length.

Significance. If the gradient identities and validity of the four decoupled objectives hold without hidden inconsistencies, the work supplies a principled taxonomy that connects distillation to RL and supplies practical levers (KL mixing, entropy-gated curriculum) for balancing accuracy, diversity, and compute in reasoning models. The controlled study and proposed methods are concrete strengths that could guide future post-training design.

major comments (2)

[§3] §3 (decomposition and gradient identities): The matched cases (teacher prefixes + forward token KL; student prefixes + reverse token KL) correctly recover sequence-level forward and reverse KL. However, the decoupled case of teacher prefixes with reverse token KL yields E[prefix ~ teacher] KL(student token || teacher token), which equals neither sequence-level KL nor its reverse. The manuscript must explicitly derive whether this objective's gradients are unbiased estimators of any target or whether importance-sampling corrections are required; without this, the claim that all four combinations are 'valid objectives' rests on an unverified assumption.
[Experimental section] Experimental section and appendix (controlled study): The abstract and results claim three tradeoffs and improvements from KL mixing and the entropy-gated curriculum (3.6–5.8 points on Avg@k/Pass@k, ~3× length reduction). Full details on data exclusion rules, exact hyperparameter grids, and statistical significance tests are needed to verify that the observed differences are attributable to the decoupled objectives rather than confounding factors in the math-reasoning setup.

minor comments (2)

[§3] Notation for the four objectives should be introduced with a single summary table or diagram early in §3 to make the prefix/KL-direction axes immediately legible.
[§3] The abstract states 'we establish gradient-level identities'; the main text should include the precise gradient expressions (with expectations and baselines) rather than only high-level descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments help strengthen the theoretical clarity around the decomposition and improve the reproducibility of the experimental results. We address each major comment point by point below.

read point-by-point responses

Referee: [§3] §3 (decomposition and gradient identities): The matched cases (teacher prefixes + forward token KL; student prefixes + reverse token KL) correctly recover sequence-level forward and reverse KL. However, the decoupled case of teacher prefixes with reverse token KL yields E[prefix ~ teacher] KL(student token || teacher token), which equals neither sequence-level KL nor its reverse. The manuscript must explicitly derive whether this objective's gradients are unbiased estimators of any target or whether importance-sampling corrections are required; without this, the claim that all four combinations are 'valid objectives' rests on an unverified assumption.

Authors: We acknowledge that the teacher-prefix + reverse-token-KL combination does not recover the full sequence-level reverse KL. This objective is nevertheless well-defined and distinct, corresponding to an off-policy reverse-KL variant that uses teacher-generated prefixes while applying token-level reverse KL. In the revised manuscript we will add an explicit gradient derivation in §3 for all four combinations. For the teacher-prefix + reverse case the gradient reduces to an unbiased estimator of the expected dense log-ratio reward under the fixed teacher prefix distribution; no additional importance-sampling correction is required beyond the standard on-policy token sampling. This derivation confirms that each of the four objectives possesses a valid, optimizable gradient and supports their status as distinct targets. revision: yes
Referee: [Experimental section] Experimental section and appendix (controlled study): The abstract and results claim three tradeoffs and improvements from KL mixing and the entropy-gated curriculum (3.6–5.8 points on Avg@k/Pass@k, ~3× length reduction). Full details on data exclusion rules, exact hyperparameter grids, and statistical significance tests are needed to verify that the observed differences are attributable to the decoupled objectives rather than confounding factors in the math-reasoning setup.

Authors: We agree that expanded experimental details are required for full reproducibility and to rule out confounds. In the revised version we will augment both the main experimental section and the appendix with: (i) the precise data exclusion rules applied to the math-reasoning benchmarks, (ii) the complete hyperparameter grid (learning rates, batch sizes, KL coefficients, entropy thresholds, and curriculum schedules) together with the final selected values, and (iii) statistical significance results including standard errors over multiple random seeds and paired tests (e.g., bootstrap confidence intervals or Wilcoxon signed-rank) on the reported Avg@k and Pass@k gains. These additions will allow readers to attribute the observed accuracy, length, and stability improvements directly to the proposed objectives and methods. revision: yes

Circularity Check

0 steps flagged

KL decomposition and gradient identities follow from standard autoregressive properties

full rationale

The paper derives its four objectives by applying the chain rule to sequence-level KL divergence over autoregressive token distributions, yielding the known pairings (teacher prefixes with forward token KL; student prefixes with reverse token KL) plus the two decoupled combinations. These identities are standard consequences of the definition of KL for product distributions and do not rely on fitted parameters, self-referential definitions, or load-bearing self-citations. The connections to SFT-style cross-entropy and RL-style policy gradients are obtained by direct expansion of the resulting objectives, which remain mathematically independent of the target results. The empirical study on math reasoning evaluates the objectives as standalone methods and RL initializations without circular reduction to inputs. No step reduces by construction to its own outputs or to unverified prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of decomposing sequence-level KL over autoregressive distributions and on the resulting gradient identities; these are treated as standard domain assumptions rather than new postulates.

axioms (1)

domain assumption Sequence-level KL over autoregressive response distributions decomposes into a choice of prefix source paired with either forward or reverse token-level KL.
This decomposition is invoked to identify the implicit coupling in prevailing methods and to derive the four objectives.

pith-pipeline@v0.9.0 · 5876 in / 1180 out tokens · 54698 ms · 2026-05-19T20:45:16.306817+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL... Their Cartesian product yields four token-level distillation objectives
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL(pT ∥ qθ)(x) = Σ Est∼dtT [KL(pT(·|st)∥qθ(·|st))]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 19 internal anchors

[1]

On-policy distillation of language models: Learning from self- generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[2]

American mathematics competitions, 2023

AMC2023. American mathematics competitions, 2023. URL https:// artofproblemsolving.com/wiki/index.php/AMC_Problems_and_Solutions

work page 2023
[3]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

work page 2015
[4]

Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874

work page 2025
[5]

Unveiling the key factors for distilling chain-of-thought reasoning

Xinghao Chen, Zhijing Sun, Guo Wenjin, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, and Xiaoyu Shen. Unveiling the key factors for distilling chain-of-thought reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics...

work page doi:10.18653/v1/2025.findings-acl.782 2025
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Deepseek-v4 technical report, 2026

DeepSeek-AI. Deepseek-v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical report

work page 2026
[9]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[11]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[12]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

Liger Kernel: Efficient Triton Kernels for

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahismail, Haowen Dong, Anirudh Patel, and Bryan Roth. Liger Kernel: Efficient triton kernels for LLM training.arXiv preprint arXiv:2410.10989, 2024

work page arXiv 2024
[15]

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

work page internal anchor Pith review arXiv 2026
[17]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation, 2016. URL https://arxiv.org/abs/1606.07947

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026
[19]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning, 2021. URLhttps://arxiv.org/abs/2110.06169

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Conservative Q-Learning for Offline Reinforcement Learning, August 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning, 2020. URLhttps://arxiv.org/abs/2006.04779

work page arXiv 2020
[21]

Kat-coder-v2 technical report, 2026

KwaiKAT Team. Kat-coder-v2 technical report, 2026. URL https://arxiv.org/abs/2603. 27703

work page 2026
[22]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[23]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[24]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[26]

MiMo-V2-Flash Technical Report

LLM-Core Xiaomi. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

On-policy distillation

Kevin Lu and Thinking Machines Lab. On-policy distillation. https://thinkingmachines. ai/blog/on-policy-distillation/, 2025

work page 2025
[28]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

work page 2025
[30]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Gordon, and Drew Bagnell

Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011. 11

work page 2011
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Sky-t1: Train your own o1 preview model within $450

NovaSky Team. Sky-t1: Train your own o1 preview model within $450. https://novasky- ai.github.io/posts/sky-t1, 2025. Accessed: 2025-01-09

work page 2025
[35]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

work page 2026
[36]

Hy-mt1.5 technical report, 2025

Tencent Hunyuan Team. Hy-mt1.5 technical report, 2025. URL https://arxiv.org/abs/ 2512.24092

work page arXiv 2025
[37]

Hy-embodied-0.5: Embodied foundation models for real-world agents,

Tencent Robotics X, HY Vision Team, :, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yong- ming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Linus, and Shunyu Yao. Hy-embodied-0.5: Embodied foundation models for real-world agents,

work page
[38]

URLhttps://arxiv.org/abs/2604.07430

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Nemotron-cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexan- der Bukharin, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.1...

work page arXiv 2026
[40]

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[42]

GLM-5: from Vibe Coding to Agentic Engineering

Zhipu AI and Tsinghua University. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026. 12 Appendix A Related Work Off-policy distillation for LLMs.Off-policy distillation Knowledge distillation was originally developed for model compression [13]. In modern LLM post-training, Off-policy distillation has increasingly become ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

On-policy distillation of language models: Learning from self- generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[2] [2]

American mathematics competitions, 2023

AMC2023. American mathematics competitions, 2023. URL https:// artofproblemsolving.com/wiki/index.php/AMC_Problems_and_Solutions

work page 2023

[3] [3]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

work page 2015

[4] [4]

Retaining by doing: The role of on-policy data in mitigating forgetting, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025. URL https://arxiv.org/abs/2510. 18874

work page 2025

[5] [5]

Unveiling the key factors for distilling chain-of-thought reasoning

Xinghao Chen, Zhijing Sun, Guo Wenjin, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, and Xiaoyu Shen. Unveiling the key factors for distilling chain-of-thought reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics...

work page doi:10.18653/v1/2025.findings-acl.782 2025

[6] [6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Deepseek-v4 technical report, 2026

DeepSeek-AI. Deepseek-v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical report

work page 2026

[9] [9]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024

[11] [11]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[12] [12]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

Liger Kernel: Efficient Triton Kernels for

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahismail, Haowen Dong, Anirudh Patel, and Bryan Roth. Liger Kernel: Efficient triton kernels for LLM training.arXiv preprint arXiv:2410.10989, 2024

work page arXiv 2024

[15] [15]

Stable On-Policy Distillation through Adaptive Target Reformulation

Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

work page internal anchor Pith review arXiv 2026

[17] [17]

Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation, 2016. URL https://arxiv.org/abs/1606.07947

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

work page arXiv 2026

[19] [19]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning, 2021. URLhttps://arxiv.org/abs/2110.06169

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Conservative Q-Learning for Offline Reinforcement Learning, August 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning, 2020. URLhttps://arxiv.org/abs/2006.04779

work page arXiv 2020

[21] [21]

Kat-coder-v2 technical report, 2026

KwaiKAT Team. Kat-coder-v2 technical report, 2026. URL https://arxiv.org/abs/2603. 27703

work page 2026

[22] [22]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[23] [23]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[24] [24]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[26] [26]

MiMo-V2-Flash Technical Report

LLM-Core Xiaomi. MiMo-V2-Flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

On-policy distillation

Kevin Lu and Thinking Machines Lab. On-policy distillation. https://thinkingmachines. ai/blog/on-policy-distillation/, 2025

work page 2025

[28] [28]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length inflation and stabilization strategies for large language models.arXiv preprint arXiv:2604.08527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

work page 2025

[30] [30]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Gordon, and Drew Bagnell

Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011. 11

work page 2011

[33] [33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Sky-t1: Train your own o1 preview model within $450

NovaSky Team. Sky-t1: Train your own o1 preview model within $450. https://novasky- ai.github.io/posts/sky-t1, 2025. Accessed: 2025-01-09

work page 2025

[35] [35]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

work page 2026

[36] [36]

Hy-mt1.5 technical report, 2025

Tencent Hunyuan Team. Hy-mt1.5 technical report, 2025. URL https://arxiv.org/abs/ 2512.24092

work page arXiv 2025

[37] [37]

Hy-embodied-0.5: Embodied foundation models for real-world agents,

Tencent Robotics X, HY Vision Team, :, Xumin Yu, Zuyan Liu, Ziyi Wang, He Zhang, Yong- ming Rao, Fangfu Liu, Yani Zhang, Ruowen Zhao, Oran Wang, Yves Liang, Haitao Lin, Minghui Wang, Yubo Dong, Kevin Cheng, Bolin Ni, Rui Huang, Han Hu, Zhengyou Zhang, Linus, and Shunyu Yao. Hy-embodied-0.5: Embodied foundation models for real-world agents,

work page

[38] [38]

URLhttps://arxiv.org/abs/2604.07430

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Nemotron-cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.19220, 2026

Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexan- der Bukharin, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade 2: Post-training LLMs with cascade RL and multi-domain on-policy distillation.arXiv preprint arXiv:2603.1...

work page arXiv 2026

[40] [40]

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye, Caiming Xiong, and Chien-Sheng Wu. The illusion of certainty: Decoupling capability and calibration in on-policy distillation.arXiv preprint arXiv:2604.16830, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024

[42] [42]

GLM-5: from Vibe Coding to Agentic Engineering

Zhipu AI and Tsinghua University. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026. 12 Appendix A Related Work Off-policy distillation for LLMs.Off-policy distillation Knowledge distillation was originally developed for model compression [13]. In modern LLM post-training, Off-policy distillation has increasingly become ...

work page internal anchor Pith review Pith/arXiv arXiv 2026