A Brief Overview: On-Policy Self-Distillation In Large Language Models
Pith reviewed 2026-05-22 09:48 UTC · model grok-4.3
The pith
On-policy self-distillation lets one large language model serve as both teacher and student to align reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external
What carries the argument
Dual-role single model with teacher granted verified reasoning traces and student limited to problem statement, trained by minimizing distributional divergence on student-sampled trajectories.
Load-bearing premise
That verified reasoning traces are reliably available to the teacher role and that the divergence minimization on student trajectories produces stable alignment without amplifying errors.
What would settle it
A test where removing access to verified traces or using noisy traces causes the model to perform worse than baseline training methods on reasoning tasks.
Figures
read the original abstract
On-Policy Self-Distillation (OPSD) is a unified learning framework in which a single large language model acts simultaneously as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD). In this paper, we present a brief analysis of the conceptual foundations, methodological innovations, and principled designs underlying recent advances in OPSD for large language models. This discussion, crafted from the perspective of beginners in this field, aims to provide a concise overview of the design principles and emerging patterns of OPSD in LLMs, intended for researchers who are similarly new to this area.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a brief overview of On-Policy Self-Distillation (OPSD) for large language models. It defines OPSD as a unified framework in which a single LLM simultaneously serves as teacher (with privileged access to verified reasoning traces) and student (observing only the problem statement). Training minimizes per-token distributional divergence between the two roles on trajectories sampled from the student policy. The paper claims this eliminates external teachers, leverages ground-truth solutions directly, resolves off-policy distribution mismatch, and reduces GPU memory consumption by 40%-60% relative to standard On-Policy Distillation (OPD). The discussion covers conceptual foundations, methodological innovations, and design principles aimed at beginners.
Significance. If the described benefits hold, OPSD could provide a practical route to memory-efficient, self-contained alignment of LLMs without separate teacher models. The claimed 40-60% memory reduction would be a concrete engineering advantage for scaling distillation. However, because the manuscript supplies only descriptive synthesis and no new derivations, experiments, or stability analysis, its significance is limited to potential utility as an introductory summary rather than a substantive advance.
major comments (2)
- [Abstract] Abstract: the quantitative claim that OPSD 'typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD)' is asserted without any experimental data, baselines, error bars, implementation details, or citations. This figure is load-bearing for the central practical advantage but rests on description alone.
- [Abstract and framework description] Framework description (Abstract and subsequent sections): the setup assumes verified reasoning traces are reliably available to the teacher role and that per-token distributional divergence minimization on student-sampled trajectories produces stable alignment without error amplification. No derivation, stability analysis, or counter-example check is supplied to support why the student policy remains anchored rather than drifting; this assumption is load-bearing for the claim of reliable self-alignment.
minor comments (1)
- The manuscript would benefit from explicit section headings or numbered subsections to improve readability for the intended beginner audience.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our brief overview manuscript. We appreciate the identification of areas where claims require better qualification given the paper's scope as a conceptual synthesis rather than a source of new empirical results. We address each major comment below and outline the planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the quantitative claim that OPSD 'typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD)' is asserted without any experimental data, baselines, error bars, implementation details, or citations. This figure is load-bearing for the central practical advantage but rests on description alone.
Authors: We agree that the manuscript, being an overview without new experiments, should not present the 40-60% figure as an unsubstantiated assertion. This range is intended to reflect reported outcomes from prior OPSD implementations in the literature that the paper synthesizes. In the revised version we will either insert citations to the specific studies documenting these memory reductions or rephrase the statement to indicate that such savings have been observed in existing OPSD work, thereby removing any implication that the figure is a new result of this overview. revision: yes
-
Referee: [Abstract and framework description] Framework description (Abstract and subsequent sections): the setup assumes verified reasoning traces are reliably available to the teacher role and that per-token distributional divergence minimization on student-sampled trajectories produces stable alignment without error amplification. No derivation, stability analysis, or counter-example check is supplied to support why the student policy remains anchored rather than drifting; this assumption is load-bearing for the claim of reliable self-alignment.
Authors: The referee is correct that the paper supplies no new derivations or formal stability analysis; this is consistent with its purpose as a beginner-oriented overview of existing design principles. The on-policy sampling combined with privileged access to verified traces is presented as the mechanism intended to limit distribution shift and error accumulation. We will revise the abstract and framework sections to state these assumptions more explicitly and to include a concise paragraph noting potential limitations, such as dependence on trace quality and the desirability of empirical checks for drift in particular domains, while pointing readers to the referenced empirical studies for further investigation. revision: yes
Circularity Check
No circularity: purely descriptive overview without derivations or fitted predictions
full rationale
The paper is explicitly framed as a 'brief overview' and 'brief analysis of the conceptual foundations' of OPSD. It defines the framework in prose (single model as teacher/student with privileged traces, minimize per-token divergence on student trajectories) but supplies no equations, no parameter fitting, no predictions, and no self-citation chains that bear the central claim. The memory-reduction claim is stated as an empirical observation rather than a derived result. No load-bearing step reduces to its own inputs by construction; the content remains self-contained description.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself (eq. 6, JSD_β in eq. 7)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
single large language model acts simultaneously as both teacher and student... privileged access to verified reasoning traces
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Yibing Liu, Xinyu Fu, Shi Wu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, and Haoliang Li. 2026. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models. arXiv:2605.11854 [cs.CL] https://arxiv.org/abs/2605.11854
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Ganqu Cui, Liangyuan Yuan, Ning Ding, Zhiwei Yao, Wei Ye, Yujia Wang, Yue Zhang, Jing Xu, Han Zhang, Zini Chen, et al. 2023. UltraFeedback: Boosting Language Models with High-quality Feedback.arXiv preprint arXiv:2310.01377(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [7]
-
[8]
Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Zihao Han, Tiangang Zhang, Huaibin Wang, and Yilun Sun. 2026. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning. arXiv:2605.11458 [cs.AI] https://arxiv.org/abs/2605.11458
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, and Sanjeev Arora. 2026. Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision.arXiv preprint arXiv:2604.12002(2026). https: //arxiv.org/abs/2604.12002
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset.Advances in Neural Information Processing Systems34 (2021), 25774–25786
work page 2021
-
[13]
HuggingFaceH4. 2024. AIME 2024 Dataset. https://huggingface.co/datasets/HuggingFaceH4/aime_2024
work page 2024
-
[14]
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. 2026. Reinforcement Learning via Self-Distillation. arXiv:2601.20802 [cs.LG] https://arxiv.org/abs/2601.20802
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.arXiv preprint(2024)
work page 2024
-
[16]
Minbyul Jeong. 2026. Healthcare AI GYM for Medical Agents. arXiv:2605.02943 [cs.LG] https://arxiv.org/abs/2605.02943
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Jiaming Ji, Meng Liu, Juntao Dai, Xuehai Pan, Ce Zhang, Chi Bian, Botao Chen, Rui Sun, Yashi Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.Advances in Neural Information Processing Systems36 (2023), 24621–24658
work page 2023
-
[18]
Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, and Steven Hoi. 2026. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models. arXiv:2605.05204 [cs.CV] https://arxiv.org/abs/2605.05204
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [19]
-
[20]
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, and Srijan Kumar. 2026. UniSD: Towards a Unified Self-Distillation Framework for Large Language Models. arXiv:2605.06597 [cs.CL] https://arxiv.org/abs/2605.06597
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Junlong Ke, Zichen Wen, Weijia Li, Conghui He, and Linfeng Zhang. 2026. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning. arXiv:2605.13255 [cs.AI] https://arxiv.org/abs/2605.13255
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Jeonghye Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR. arXiv:2605.10781 [cs.LG] https://arxiv.org/abs/2605.10781 A Brief Overview: On-Policy Self-Distillation In Large Language Models 9
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. 2026. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? arXiv:2603.24472 [cs.CL] https://arxiv.org/abs/2603.24472
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, and Tom Goldstein. 2026. Multi-Token Prediction via Self-Distillation. arXiv:2602.06019 [cs.CL] https://arxiv.org/abs/2602.06019
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [25]
-
[26]
Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, and Rui Wang. 2026. GEAR: Granularity- Adaptive Advantage Reweighting for LLM Agents via Self-Distillation. arXiv:2605.11853 [cs.LG] https://arxiv.org/abs/2605.11853
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. 2026. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arXiv:2604.13016 [cs.LG] https://arxiv.org/abs/2604.13016
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Wenjie Liao, Like Wu, Liangjie Zhao, Shihui Xu, and Shigeru Fujimura. 2026. IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning. arXiv:2604.20933 [cs.LG] https://arxiv.org/abs/2604.20933
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, and Hongbo Jin. 2026. VISD: Enhancing Video Reasoning via Structured Self-Distillation. arXiv:2605.06094 [cs.CV] https://arxiv.org/abs/2605.06094
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context.arXiv preprint arXiv:1405.0312(2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[31]
Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, and Hinrich Schütze. 2026. Crosslingual On-Policy Self-Distillation for Multilingual Reasoning. arXiv:2605.09548 [cs.CL] https://arxiv.org/abs/2605.09548
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Kevin Lu and Thinking Machines Lab. 2025. On-Policy Distillation.Thinking Machines Lab: Connectionism(2025). doi:10.64434/tml.20251026 https://thinkingmachines.ai/blog/on-policy-distillation
-
[33]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. 2026. Privileged Information Distillation for Language Models.arXiv preprint arXiv:2602.04942(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Ruiyang Qin, Qingzhuo Wang, Dongrui Liu, Qiang Li, Zhihua Wei, and Wen Shen. 2026. Multilingual Safety Alignment via Self-Distillation. arXiv:2605.02971 [cs.LG] https://arxiv.org/abs/2605.02971
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. 2026. On-Policy Self-Distillation for Reasoning Compression. arXiv preprint arXiv:2603.05433(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information. arXiv:2605.11609 [cs.LG] https://arxiv.org/abs/2605.11609
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Guobin Shen, Lei Huang, Xiang Cheng, Chenxiao Zhao, Jindong Li, Dongcheng Zhao, and Xing Yu. 2026. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation. arXiv:2605.11613 [cs.LG] https://arxiv.org/abs/2605.11613
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. 2026. Self-Distillation Enables Continual Learning. arXiv:2601.19897 [cs.LG] https://arxiv.org/abs/2601.19897
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning.arXiv preprint arXiv:2010.03768(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Mingyang Song and Mao Zheng. 2026. A Survey of On-Policy Distillation for Large Language Models. arXiv:2604.00626 [cs.LG] https://arxiv.org/ abs/2604.00626
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [43]
-
[44]
Zhiquan Tan and Yinrong Hong. 2026. PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners. arXiv:2604.26573 [cs.LG] https://arxiv.org/abs/2604.26573
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, and Honggang Qi. 2026. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents. arXiv:2604.10674 [cs.LG] https://arxiv.org/abs/2604.10674
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, and Sanqiang Zhao. 2026. SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe. arXiv:2410.05248 [cs.CL] https://arxiv.org/abs/2410.05248
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. 2026. PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence. arXiv:2603.11178 [cs.AI] https://arxiv.org/abs/2603.11178
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. 2026. Self-Distilled RLVR. arXiv:2604.03128 [cs.LG] https://arxiv.org/abs/2604.03128
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Yuxiao Yang, Xiaoyun Wang, and Weitong Zhang. 2026. OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning. arXiv:2605.12400 [cs.LG] https://arxiv.org/abs/2605.12400 10 Fangming Cui, Sunan Li, and Jiahong Li
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [50]
- [51]
-
[52]
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. 2026. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yu, Weinan Dai, TianTian Fan, Gaohong Liu, Lingjun Liu, and et al. 2025e. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476(2025e)
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, and Qinzhen Guo. 2026. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization. arXiv:2605.05040 [cs.LG] https://arxiv.org/abs/2605.05040
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
Xiangyu Yue, Yu Zheng, Zhang Zhang, Steven Gao, Yuhang Wang, Runzhe Chen, Yukun Jia, Yitong Sun, Yizhi Gao, Mark Zhao, et al. 2023. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.arXiv preprint arXiv:2311.16502(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [56]
-
[57]
Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. 2026. OPSDL: On-Policy Self-Distillation for Long-Context Language Models. arXiv:2604.17535 [cs.CL] https://arxiv.org/abs/2604.17535
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, and Yu Zhou. 2026. Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding. arXiv:2605.00642 [cs.AI] https://arxiv.org/abs/2605.00642
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, and Dongbin Zhao
-
[60]
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
𝜋-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data. arXiv:2604.14054 [cs.LG] https://arxiv.org/abs/2604.14054
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. 2026. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models.arXiv preprint arXiv:2601.18734(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
Zhengyang Zhao, Lu Ma, and Wentao Zhang. 2026. Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning. arXiv:2605.08741 [cs.CL] https://arxiv.org/abs/2605.08741
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[63]
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Cheng, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025a. Group sequence policy optimization.arXiv preprint arXiv:2507.18071(2025a)
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. arXiv:2305.11206 [cs.CL] https://arxiv.org/abs/2305.11206
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.