pith. machine review for the scientific record. sign in

arxiv: 2605.06597 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords self-distillationlarge language modelsunified frameworktraining stabilitysupervision reliabilitymodel adaptationcontrastive learningEMA stabilization
0
0 comments X

The pith

A unified framework makes self-distillation a reliable way to adapt large language models without stronger teachers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops UniSD to study self-distillation in autoregressive large language models where self-generated outputs create unstable supervision. It integrates several techniques to improve reliability, alignment, and stability during training. Experiments across different models and tasks show when and how these techniques contribute to better performance. The full combination leads to measurable gains over base models and prior approaches. This positions self-distillation as a practical method for efficient model adaptation.

Core claim

UniSD integrates multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping to address supervision reliability, representation alignment, and training stability in self-distillation for LLMs. Guided by analysis of component roles and interactions, the UniSDfull pipeline achieves the strongest results, with improvements of 5.4 points over the base model and 2.8 points over the strongest baseline across six benchmarks and six models from three families.

What carries the argument

The UniSD unified framework that combines complementary mechanisms for handling free-form self-generated trajectories in self-distillation.

If this is right

  • Self-distillation can outperform static imitation when stabilization and alignment components are included.
  • The choice of components determines gains on different tasks and model families.
  • Combining the mechanisms yields the highest overall performance improvements.
  • Analysis reveals the conditions under which self-distillation provides benefits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting this approach could allow smaller teams to adapt large models using only their own compute resources.
  • Similar unification strategies might apply to other areas where self-supervised signals are noisy.
  • Further work could test if these gains hold when scaling to models with billions more parameters.
  • The insights on component interactions could guide design of future distillation methods.

Load-bearing premise

The individual mechanisms address instability in self-generated supervision and their benefits combine additively without hidden selection effects in the experiments.

What would settle it

Testing the UniSDfull pipeline on a new benchmark suite or model family where it fails to exceed the strongest baseline by a similar margin would indicate the gains are not general.

Figures

Figures reproduced from arXiv: 2605.06597 by B. Aditya Prakash, Haoxin Liu, Jindong Wang, Josiah Hester, Lucheng Fu, Srijan Kumar, Yijia Xiao, Yinyi Luo, Yiqiao Jin, Yiyang Wang.

Figure 1
Figure 1. Figure 1: Overview of UniSD, a unified framework for self-distillation in LLMs. UniSD integrates agreement, view at source ↗
Figure 2
Figure 2. Figure 2: UniSD is a Unified framework for systematically studying Self-Distillation in autoregressive LLMs. It integrates multiple complementary objectives: Multi-Teacher Agreement, EMA Teacher, Token-Level Contrastive Learning, Feature Matching, and Divergence Clipping. The modular design enables controlled analysis of each component and is extensible to additional strategies. • Guided by these insights, we constr… view at source ↗
Figure 3
Figure 3. Figure 3: Left. Gains over the raw Qwen2.5 model [30] across four size variants on ScienceQA (in-domain) and GPQA (OOD). UniSD∗ reaches the largest gain (+7.06) on Qwen2.5-3B. Right. Base-distribution retention perplexity across the same Qwen2.5 size variants. contains diverse reasoning paths, program implementations, or formats. In contrast, on-policy baselines provide a stronger starting point. SDFT improves ToolA… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of token-level agreement across 3 auxiliary teacher construction strategies. Sensitivity analysis shows that more contexts do not necessarily improve performance. The sensitivity analyses in Appendix Fig￾ures 9 and 10 show that performance changes non-monotonically with K. The best setting depends on both task and granularity: sequence-level agreement peaks at K = 3 on ScienceQA with γ = 0.01 (8… view at source ↗
Figure 5
Figure 5. Figure 5: Left: Training time vs. accuracy. Middle: Component effectiveness analysis. The full framework UniSD∗ outperforms all individual components. EMA and multi-teacher agreement provide the strongest single-component gains. Right: Training loss curve on Qwen2.5-7B. Agreement granularity changes supervision robustness. Token- and sequence-level agreement offer different trade-offs across auxiliary teacher constr… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of base-scored perplexity and view at source ↗
Figure 7
Figure 7. Figure 7: Gains over the original model across Qwen2.5, Llama-3.1, and Gemma-3 on ScienceQA (SQA), view at source ↗
Figure 8
Figure 8. Figure 8: Left. Training time comparison of UniSD variants on ScienceQA. Right. Retention perplexity comparison across UniSD variants. These variants preserve high throughput (2.32–3.22M tokens/GPU-hour), showing that adding representation, contrastive, or temporal stabilization incurs only modest overhead. Agreement-based variants require 0.16–0.18 kWh per million tokens and also increase peak memory by roughly 13–… view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity to the number of contexts k and the agreement weight γ. Adding more contexts does not consistently improve accuracy. Category Dataset Train Test License Scientific Reasoning ScienceQA 12,726 4,241 CC BY-NC-SA 4.0 GPQA – 448 CC BY 4.0 / MIT Coding MBPP 120 257 CC BY 4.0 HumanEval – 164 MIT Commonsense QA CoS-E 9,741 1,221 BSD-3-Clause Tool Usage ToolAlpaca 4,046 68 Apache-2.0 view at source ↗
Figure 10
Figure 10. Figure 10: Sensitivity to the number of contexts k and the agreement weight γ. Adding more contexts does not consistently improve accuracy. E Ethical Considerations Self-distillation inherits the limitations of the underlying base model, including potential factual errors, social biases, and unsafe behaviors. Although UniSD uses reliability weighting, divergence clipping, and stabilization to reduce the reinforcemen… view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of sequence-level agreement across three auxiliary-context strategies: random, view at source ↗
Figure 12
Figure 12. Figure 12: Per-dataset gains of UniSD variants over the raw Qwen2.5-7B model. Asterisks ( view at source ↗
read the original abstract

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes UniSD, a unified self-distillation framework for large language models that integrates mechanisms including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping to improve supervision reliability, representation alignment, and training stability in autoregressive LLMs using self-generated trajectories. Through experiments across six benchmarks and six models from three families, it analyzes component effectiveness and interactions, and presents UniSDfull—an integrated pipeline—as achieving the strongest performance with +5.4 points over the base model and +2.8 over the strongest baseline.

Significance. If the gains hold after addressing assembly concerns, the work would be significant for demonstrating self-distillation as a practical, steerable method for efficient LLM adaptation without external teachers. Credit is due to the broad evaluation spanning multiple model families and tasks, which supports claims about generalizability and when self-distillation outperforms static imitation.

major comments (1)
  1. [Abstract] Abstract: The construction of UniSDfull 'guided by these insights' from component studies performed on the same six benchmarks risks post-hoc selection bias; without pre-specification of the exact combination of the five mechanisms or reporting of all enumerated subsets, the reported +5.4 and +2.8 point gains cannot be confidently attributed to reliable complementarity rather than selection on the evaluation data.
minor comments (2)
  1. The reported performance metrics lack error bars, standard deviations, or details on the number of runs, making it difficult to assess the statistical reliability of the improvements.
  2. Baseline definitions and implementation details (e.g., how static imitation and other self-distillation methods are instantiated) should be expanded for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback and for highlighting the potential for post-hoc selection bias in the construction of UniSDfull. We address this concern directly below and commit to revisions that increase transparency around the component selection process.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The construction of UniSDfull 'guided by these insights' from component studies performed on the same six benchmarks risks post-hoc selection bias; without pre-specification of the exact combination of the five mechanisms or reporting of all enumerated subsets, the reported +5.4 and +2.8 point gains cannot be confidently attributed to reliable complementarity rather than selection on the evaluation data.

    Authors: We agree that the current presentation leaves room for this interpretation. The component ablations were performed to identify which mechanisms provide complementary benefits across tasks and models, and UniSDfull was assembled from those showing consistent positive interactions rather than from exhaustive enumeration of all 2^5 subsets. To strengthen the claim, the revised manuscript will (1) explicitly state the selection criteria (average improvement across all six benchmarks and six models), (2) add a table reporting performance for the key partial combinations explored during development, and (3) include a brief discussion of the risk of evaluation-set overfitting with the mitigation steps taken (multi-model, multi-task evaluation). These additions will allow readers to judge the robustness of the reported gains independently. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper presents an empirical study proposing UniSD as a unified self-distillation framework for LLMs, integrating mechanisms like multi-teacher agreement and EMA stabilization, then evaluating performance across six benchmarks and six models. No mathematical derivation chain, equations, or predictions exist that reduce by construction to fitted inputs or self-definitions. The assembly of UniSDfull is described as guided by component insights from experiments, which is standard empirical practice rather than a self-referential loop. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a manner that creates circularity per the defined patterns. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the framework builds on standard self-distillation and contrastive learning ideas without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5557 in / 1077 out tokens · 32498 ms · 2026-05-08T09:52:45.380637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 22 canonical work pages · 15 internal anchors

  1. [1]

    Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

  2. [2]

    Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023

  3. [3]

    Visual program distillation: Distilling tools and programmatic reasoning into vision-language models

    Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. InCVPR, pages 9590–9601, 2024

  4. [4]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300, 2024

  5. [5]

    Simpo: Simple preference optimization with a reference-free reward.NeurIPS, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.NeurIPS, 37:124198–124235, 2024

  6. [6]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

  7. [7]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015

  8. [8]

    Gpt4all: An ecosystem of open source compressed language models

    Yuvanesh Anand, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Benjamin Schmidt, Brandon Duderstadt, and Andriy Mulyar. Gpt4all: An ecosystem of open source compressed language models. InNLP-OSS Workshop, pages 59–64, 2023

  9. [9]

    Agentark: Distilling multi-agent intelligence into a single llm agent

    Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, and Jindong Wang. Agentark: Distilling multi-agent intelligence into a single llm agent. arXiv:2602.03955, 2026. 11

  10. [10]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  11. [11]

    Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

    Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv:2402.13116, 2024

  12. [12]

    Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

    Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, and Josiah Hester. Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

  13. [13]

    Harnessing the wisdom of the inner crowd.Trends in cognitive sciences, 18(10):504–506, 2014

    Stefan M Herzog and Ralph Hertwig. Harnessing the wisdom of the inner crowd.Trends in cognitive sciences, 18(10):504–506, 2014

  14. [14]

    Instruction induction: From few examples to natural language task descriptions

    Or Honovich, Uri Shaham, Samuel Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. InACL, pages 1935–1952, 2023

  15. [15]

    Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

    George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

  16. [16]

    Ppdb: The paraphrase database

    Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Ppdb: The paraphrase database. In NAACL, pages 758–764, 2013

  17. [17]

    Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

    John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. InEMNLP, pages 119–126, 2020

  18. [18]

    Less is more: Task-aware layer-wise distillation for language model compression

    Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: Task-aware layer-wise distillation for language model compression. InICML, pages 20852–20867. PMLR, 2023

  19. [19]

    Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers

    Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers. InACL, pages 2140–2151, 2021

  20. [20]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InICLR, 2024

  21. [21]

    Embarrassingly simple self-distillation improves code generation.arXiv preprint arXiv:2604.01193, 2026a

    Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv:2604.01193, 2026

  22. [22]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  23. [23]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

  24. [24]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In COLM, 2024

  25. [25]

    Explain yourself! leveraging language models for commonsense reasoning

    Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. InACL, pages 4932–4942, 2019. 12

  26. [26]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InNAACL, pages 4149–4158, 2019

  27. [27]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv:2108.07732, 2021

  28. [28]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

  29. [29]

    Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv:2306.05301, 2023

  30. [30]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  31. [31]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv:2407.21783, 2024

  32. [32]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv:2503.19786, 2025

  33. [33]

    How context affects language models’ factual predictions

    Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models’ factual predictions. InAKBC, 2020

  34. [34]

    Lost in the middle: How language models use long contexts.TACL, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.TACL, 12:157–173, 2024

  35. [35]

    Federated continual learning via knowledge fusion: A survey.TKDE, 36(8):3832–3850, 2024

    Xin Yang, Hao Yu, Xin Gao, Hao Wang, Junbo Zhang, and Tianrui Li. Federated continual learning via knowledge fusion: A survey.TKDE, 36(8):3832–3850, 2024

  36. [36]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

  37. [37]

    A comprehensive survey of continual learning: Theory, method and application.TPAMI, 46(8):5362–5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.TPAMI, 46(8):5362–5383, 2024

  38. [38]

    Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

    Yiyang Wang, Yiqiao Jin, Alex Cabral, and Josiah Hester. Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

  39. [39]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv:2604.00626, 2026

  40. [40]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InICLR, 2024

  41. [41]

    Distillm: Towards streamlined distillation for large language models

    Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. InICML, 2024

  42. [42]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv:2603.07079, 2026. 13

  43. [43]

    Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation.arXiv:2603.26666, 2026

    Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, and Haoang Li. Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation.arXiv:2603.26666, 2026

  44. [44]

    SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

    Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. Scope: Signal-calibrated on-policy distillation enhancement with dual-path adaptive weighting.arXiv:2604.10688, 2026

  45. [45]

    Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying opd: Length inflation and stabilization strategies for large language models. arXiv:2604.08527, 2026

  46. [46]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv:2601.20802, 2026

  47. [47]

    Energy and policy considerations for deep learning in nlp

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. InACL, pages 3645–3650, 2019

  48. [48]

    Green ai.Communications of the ACM, 63(12):54–63, 2020

    Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai.Communications of the ACM, 63(12):54–63, 2020

  49. [49]

    Carbon Emissions and Large Neural Network Training

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv:2104.10350, 2021

  50. [50]

    CodeCarbon: mlco2/codecarbon v2.4.1, May 2024

    Benoit Courty, Victor Schmidt, Sasha Luccioni, Goyal-Kamal, et al. CodeCarbon: mlco2/codecarbon v2.4.1, May 2024. Software

  51. [51]

    Quantifying the Carbon Emissions of Machine Learning

    Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv:1910.09700, 2019

  52. [52]

    Pue: a comprehensive examination of the metric.White paper, 49:52, 2012

    Victor Avelar, Dan Azevedo, Alan French, and Emerson Network Power. Pue: a comprehensive examination of the metric.White paper, 49:52, 2012

  53. [53]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR, 2021

  54. [54]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  55. [55]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pages 611–626, 2023

  56. [56]

    Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare

    Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, et al. Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare. InICLR, 2026. 14 Table 2: Comparison of teacher-forced conditional perplexity on gold completions (§3.5). Lower values indi...