pith. sign in

arxiv: 2606.05152 · v2 · pith:ZAPMIWALnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI· cs.CL

Reinforcement Learning from Rich Feedback with Distributional DAgger

Pith reviewed 2026-06-28 06:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningimitation learningDAggerrich feedbackdistributional methodspolicy improvementreasoning modelscross-entropy objective
0
0 comments X

The pith

Forward cross-entropy from distributional DAgger guarantees monotonic policy improvement when training on rich feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a distributional variant of DAgger allows the use of rich feedback signals such as execution traces and expert corrections in place of single-bit rewards. The resulting forward cross-entropy objective ensures that each policy update improves performance and supplies regret bounds, whereas reverse KL and Jensen-Shannon objectives used in prior self-distillation methods can increase probability on worse actions. The same objective also optimizes a lower bound on teacher-weighted likelihood of success. These properties hold when the learner has local access to an expert distribution over states visited by the current policy. Experiments show gains over standard RLVR and self-distillation baselines on scientific reasoning, coding, and hard mathematics tasks.

Core claim

The forward cross-entropy objective obtained from distributional DAgger admits monotonic policy improvement and regret guarantees, conducts sequence-level credit assignment by propagating future expert-student disagreement backward, and optimizes a lower bound on teacher-weighted likelihood of success; in contrast, reverse KL and Jensen-Shannon objectives used in prior RL with self-distillation fail to guarantee monotonic improvement even when the expert has higher reward.

What carries the argument

The forward cross-entropy objective induced by distributional DAgger, which uses local access to an expert distribution on states visited by the current policy.

If this is right

  • The objective works with a blackbox expert and does not require full expert trajectories.
  • Sequence-level gradients propagate future disagreement back to earlier decisions for richer credit assignment.
  • The objective yields improved Pass@N compared with binary-reward RLVR.
  • Empirical gains appear across scientific reasoning, coding, and hard mathematical problem solving.
  • Prior reverse KL or Jensen-Shannon self-distillation methods lack monotonic improvement guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The monotonic improvement property could be tested directly by measuring policy value after each update on held-out tasks.
  • The regret bounds may imply reduced sample complexity when rich feedback is available instead of only final-answer correctness.
  • The method could be applied to other imitation-learning settings that supply distributional feedback on visited states.

Load-bearing premise

The learner has local access to an expert distribution on states visited by the current policy.

What would settle it

An experiment in which the forward cross-entropy updates produce a policy whose expected reward is lower than the previous policy or whose regret exceeds the claimed bound.

read the original abstract

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DistIL, a distributional variant of DAgger for reinforcement learning from rich feedback (execution traces, tool outputs, expert corrections, self-evaluations) in reasoning models. It proposes a forward cross-entropy objective that admits a black-box expert and claims to deliver monotonic policy improvement, regret guarantees, and optimization of a lower bound on teacher-weighted likelihood of success (improving Pass@N), while outperforming RLVR and reverse-KL/Jensen-Shannon self-distillation baselines on scientific reasoning, coding, and hard math tasks.

Significance. If the central assumption of local expert-distribution access can be realized from the listed feedback types, the work would supply a theoretically motivated alternative to binary-reward RLVR with explicit monotonicity and regret properties that prior self-distillation objectives lack; the empirical gains would then indicate practical value for richer credit assignment.

major comments (2)
  1. [Abstract] Abstract (paragraph on distributional DAgger): the monotonic policy improvement, regret guarantees, and lower-bound claim on teacher-weighted likelihood are all derived under the assumption that the learner has local access to an expert distribution over states visited by the current policy. No construction is supplied that produces this exact distribution from execution traces, tool outputs, expert corrections, or self-evaluations; without such a mapping the theoretical results do not apply to the domains stated in the abstract.
  2. [Abstract] Abstract: the manuscript asserts theoretical results on monotonic improvement and regret for forward cross-entropy yet supplies no proof sketches, theorem statements, or derivation outlines in the provided text, leaving the soundness of the contrast with reverse KL and Jensen-Shannon objectives unexamined.
minor comments (1)
  1. [Abstract] Abstract: the clause 'whose sequence-level gradient {conduct rich credit assignment by propagating}' contains an apparent placeholder or grammatical error and should be rewritten for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for pointing out areas where the presentation of the theoretical results could be improved. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on distributional DAgger): the monotonic policy improvement, regret guarantees, and lower-bound claim on teacher-weighted likelihood are all derived under the assumption that the learner has local access to an expert distribution over states visited by the current policy. No construction is supplied that produces this exact distribution from execution traces, tool outputs, expert corrections, or self-evaluations; without such a mapping the theoretical results do not apply to the domains stated in the abstract.

    Authors: The full manuscript details the construction of the expert distribution from rich feedback in Section 3. For instance, tool outputs and execution traces allow us to define expert actions at each state visited by the policy, and expert corrections provide direct samples from the expert distribution. Self-evaluations can be used to weight or select expert-like responses. We agree that the abstract would benefit from a brief clarification on this point and will revise it accordingly to explicitly note how the local expert distribution is obtained from the feedback types. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript asserts theoretical results on monotonic improvement and regret for forward cross-entropy yet supplies no proof sketches, theorem statements, or derivation outlines in the provided text, leaving the soundness of the contrast with reverse KL and Jensen-Shannon objectives unexamined.

    Authors: We note that the full manuscript includes the theorem statements and proofs in Section 4 (Monotonic Improvement and Regret Analysis) and the appendix. The abstract summarizes these results. To make the theoretical claims more transparent, we will add references to the specific theorems in the abstract or include a short outline of the key derivation in the main body. The contrast with reverse KL and JS is examined in the proofs, where we show that those objectives do not guarantee monotonic improvement even when the expert has higher reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a distributional DAgger setup with local expert distribution access and derives a forward cross-entropy objective from it. It then shows (via standard analysis) that this objective admits monotonic policy improvement and regret bounds, while prior reverse-KL/JS self-distillation objectives do not. These are presented as mathematical properties of the chosen objective under the explicit assumption, not as quantities fitted to data or renamed from prior results. No self-citation chain, self-definitional loop, or fitted-input-called-prediction pattern appears in the derivation of the central claims. The assumption about realizing the expert distribution from rich feedback is stated separately and does not reduce the theorems to tautology. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the modeling choice that rich feedback supplies an accessible expert distribution; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Rich feedback (execution traces, tool outputs, corrections) can be represented as samples from an expert distribution over actions at visited states.
    Invoked to justify the distributional DAgger setup and local expert access.

pith-pipeline@v0.9.1-grok · 5776 in / 1221 out tokens · 28174 ms · 2026-06-28T06:42:31.493283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Dataset reset policy optimization for rlhf.arXiv preprint arXiv:2404.08495,

    Jonathan D Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D Lee, and Wen Sun. Dataset reset policy optimization for rlhf.arXiv preprint arXiv:2404.08495,

  2. [2]

    Step-level value preference optimization for mathematical reasoning

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7889–7903,

  3. [3]

    Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025a

    Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025a. Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, and Weijie Yu. Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms.arXiv pr...

  4. [4]

    Part of the POPE (Privileged On-Policy Exploration) dataset collection

    https://huggingface.co/datasets/CMU-AIRe/ POPE-HARD-w-oracle-solution. Part of the POPE (Privileged On-Policy Exploration) dataset collection. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, m...

  5. [5]

    Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  6. [6]

    S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

    Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

  7. [7]

    Sciknoweval: Evaluating multi-level scientific knowledge of large language models.arXiv preprint arXiv:2406.09098,

    Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models.arXiv preprint arXiv:2406.09098,

  8. [8]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  9. [9]

    Self-distillation zero: Self-revision turns binary rewards into dense supervision

    Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. Self-distillation zero: Self-revision turns binary rewards into dense supervision. arXiv preprint arXiv:2604.12002,

  10. [10]

    Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  11. [11]

    Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

  12. [12]

    Openai o1 system card.arXiv preprint arXiv:2412.16720,

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  13. [13]

    Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

  14. [14]

    Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472,

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472,

  15. [15]

    Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026a

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026a. Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, W...

  16. [16]

    Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026b

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026b. Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqin...

  17. [17]

    Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

  18. [18]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592,

  19. [19]

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al

    Notion Blog. Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

  20. [20]

    Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse.arXiv preprint arXiv:2503.18470,

    Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse.arXiv preprint arXiv:2503.18470,

  21. [21]

    POPE: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779,

    Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779,

  22. [22]

    Sail into the headwind: Alignment via robust rewards and dynamic labels against reward hacking

    Paria Rashidinejad and Yuandong Tian. Sail into the headwind: Alignment via robust rewards and dynamic labels against reward hacking. InInternational Conference on Learning Representations, volume 2025, pages 80338–80382,

  23. [23]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  24. [24]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  25. [25]

    Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826,

    Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826,

  26. [26]

    Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710,

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710,

  27. [27]

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

  28. [28]

    Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

  29. [29]

    Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649,

    Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649,

  30. [30]

    Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325,

    16 Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325,

  31. [31]

    Paced: Distillation and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178,

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and on-policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178,

  32. [32]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models.arXiv prep...

  33. [33]

    Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

  34. [34]

    Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

  35. [35]

    Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240,

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240,

  36. [36]

    The lessons of developing process reward models in mathematical reasoning

    Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516,

  37. [37]

    Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

  38. [38]

    A direct calculation gives J(πθ) = 1 20·1 +9 20·1 2 = 11 40 = 0.275, J(π T ) = 3 10·1 +1 20·1 2 = 13 40 = 0.325, so∆ =J(πT )−J(πθ) = 1 20 >0

    = 0, and student and teacher policies πθ= ( 1 20, 9 20, 1 2 ) , π T = ( 3 10, 1 20, 13 20 ) . A direct calculation gives J(πθ) = 1 20·1 +9 20·1 2 = 11 40 = 0.275, J(π T ) = 3 10·1 +1 20·1 2 = 13 40 = 0.325, so∆ =J(πT )−J(πθ) = 1 20 >0. The bad action probability increases: The reverse-KL NPG update takes the form π′ θ(y) =πθ(y)−ηπθ(y) ( log πθ(y) πT (y)−D...

  39. [39]

    We then provide definitions for quantities that appear in the analysis: ratio-based and KL-based concentrability (coverage) coefficients between the teacher and student policies (Rashidinejad et al., 2021; Chang et al.,

  40. [40]

    With these definitions in place, Appendix B.4.3 proves the regret bound for DistIL

    as well as two imitation-learning quantities that measure the difficulty of matching the teacher, namely teacher-policy variance and teacher recoverability parameter (Foster et al., 2024). With these definitions in place, Appendix B.4.3 proves the regret bound for DistIL. B.4.1 Natural-policy gradient variant of DistIL NPG-DistIL instantiates DistIL with ...

  41. [41]

    Definition1(Ratio-based concentrability coefficient).Let {πθi}i≥1denote the sequence of student policies generated by DistIL, and letπT denote the teacher policy

    B.4.2 Concentrability, variance, and recoverability coefficients We begin by defining the ratio-based and KL-based concentrability coefficients used in our analysis, which are standard definitions in reinforcement learning (Rashidinejad et al., 2021; Chang et al., 2024). Definition1(Ratio-based concentrability coefficient).Let {πθi}i≥1denote the sequence ...

  42. [42]

    For a fixed statest, using the definition ofℓst(π)from Algorithm 2, a direct computation gives ℓst(π)−ℓst(πT ) =D KL(πT (·|st)∥π(·|st)). Summing over the horizon, taking expectations overst∼d πθi t , and averaging overi= 1,...,ngives ϵn := 1 n n∑ i=1 (ℓi(πθi)−ℓi(πT )) = 1 n n∑ i=1 H∑ t=1 Est∼d πθi t [ DKL(πT (·|st)∥πθi(·|st)) ] . Next, for any pair of dis...

  43. [43]

    on Mathematics benchmarks. For DistIL, we observe performance degradation for longer sequencesy∼πθ(·|x), as the cumulative future loss ∑ i>t H×(·)grows with sequence length, leading to disproportionately large gradient magnitudes. To mitigate this effect, we apply length normalization, similar to Gu et al. (2024). Specifically, we replace the cumulative l...

  44. [44]

    The only difference between OPSD and SDPO in this scenario is that OPSD uses Forward-KL divergence whereas SDPO uses reverse-KL divergence in these mathematics experiments

    Evaluation hyperparameters are provided in Table 8 for Qwen3-4B-2507-Instruct model and Table 9 for Qwen3-8B model. The only difference between OPSD and SDPO in this scenario is that OPSD uses Forward-KL divergence whereas SDPO uses reverse-KL divergence in these mathematics experiments. D Additional Results D.1 More results for Coding. SDPO evaluated the...