UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

B. Aditya Prakash; Haoxin Liu; Jindong Wang; Josiah Hester; Lucheng Fu; Srijan Kumar; Yijia Xiao; Yinyi Luo; Yiqiao Jin; Yiyang Wang

arxiv: 2605.06597 · v2 · pith:7XEA25FQnew · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.LG

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Yiqiao Jin , Yiyang Wang , Lucheng Fu , Yijia Xiao , Yinyi Luo , Haoxin Liu , B. Aditya Prakash , Josiah Hester

show 2 more authors

Jindong Wang Srijan Kumar

This is my paper

Pith reviewed 2026-05-22 09:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords self-distillationlarge language modelsmodel adaptationknowledge distillationtraining stabilityautoregressive modelsbenchmark evaluation

0 comments

The pith

UniSD unifies multiple self-distillation mechanisms so large language models can improve using only their own outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UniSD as a single framework that pulls together several techniques to make self-distillation work reliably in autoregressive LLMs. Existing approaches test isolated choices, but UniSD examines how multi-teacher agreement, EMA stabilization, token-level contrastive learning, feature matching, and divergence clipping interact to fix unreliable supervision and unstable training. Experiments across six benchmarks and six models from three families show when self-distillation beats static imitation and which parts drive the gains. The integrated UniSDfull version delivers the largest lift, reaching 5.4 points above the base model and 2.8 points above the best prior method. This matters because it turns self-generated data into a steerable way to adapt LLMs without needing stronger external teachers.

Core claim

UniSD integrates multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping to address supervision reliability, representation alignment, and training stability in self-distillation for autoregressive LLMs. Across six benchmarks and six models, the framework reveals when self-distillation improves over static imitation, identifies the components that drive gains, and shows how those components interact across tasks. The combined UniSDfull pipeline produces the strongest results, improving over the base model by 5.4 points and over the strongest baseline by 2.8 points.

What carries the argument

UniSD, a unified framework that combines multi-teacher agreement for reliable supervision, EMA stabilization for training consistency, token-level contrastive learning, feature matching for representation alignment, and divergence clipping.

If this is right

Self-distillation outperforms static imitation once complementary mechanisms jointly handle supervision quality, alignment, and stability.
Gains from individual components vary by task, so the full combination yields the most consistent improvements across benchmarks.
Self-distillation becomes a practical route for LLM adaptation that avoids dependence on stronger external teachers.
Component-interaction insights can guide construction of integrated pipelines that outperform any single technique.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unification approach may help stabilize self-improvement loops in non-autoregressive or multimodal models.
The framework could reduce the need for curated external data in domain-adaptation settings where only the model's own generations are available.
If the mechanisms prove robust, similar modular combinations might accelerate efficient fine-tuning under strict compute limits.

Load-bearing premise

That the listed mechanisms can be combined without creating new instabilities or task-specific biases that erase the reported gains on the chosen benchmarks.

What would settle it

Running UniSDfull on a new held-out benchmark or larger model family where the average gain over the strongest baseline drops below 1 point would falsify the claim of consistent superiority.

Figures

Figures reproduced from arXiv: 2605.06597 by B. Aditya Prakash, Haoxin Liu, Jindong Wang, Josiah Hester, Lucheng Fu, Srijan Kumar, Yijia Xiao, Yinyi Luo, Yiqiao Jin, Yiyang Wang.

**Figure 1.** Figure 1: Overview of UniSD, a unified framework for self-distillation in LLMs. UniSD integrates agreement, view at source ↗

**Figure 2.** Figure 2: UniSD is a Unified framework for systematically studying Self-Distillation in autoregressive LLMs. It integrates multiple complementary objectives: Multi-Teacher Agreement, EMA Teacher, Token-Level Contrastive Learning, Feature Matching, and Divergence Clipping. The modular design enables controlled analysis of each component and is extensible to additional strategies. • Guided by these insights, we constr… view at source ↗

**Figure 3.** Figure 3: Left. Gains over the raw Qwen2.5 model [30] across four size variants on ScienceQA (in-domain) and GPQA (OOD). UniSD∗ reaches the largest gain (+7.06) on Qwen2.5-3B. Right. Base-distribution retention perplexity across the same Qwen2.5 size variants. contains diverse reasoning paths, program implementations, or formats. In contrast, on-policy baselines provide a stronger starting point. SDFT improves ToolA… view at source ↗

**Figure 4.** Figure 4: Comparison of token-level agreement across 3 auxiliary teacher construction strategies. Sensitivity analysis shows that more contexts do not necessarily improve performance. The sensitivity analyses in Appendix Figures 9 and 10 show that performance changes non-monotonically with K. The best setting depends on both task and granularity: sequence-level agreement peaks at K = 3 on ScienceQA with γ = 0.01 (8… view at source ↗

**Figure 5.** Figure 5: Left: Training time vs. accuracy. Middle: Component effectiveness analysis. The full framework UniSD∗ outperforms all individual components. EMA and multi-teacher agreement provide the strongest single-component gains. Right: Training loss curve on Qwen2.5-7B. Agreement granularity changes supervision robustness. Token- and sequence-level agreement offer different trade-offs across auxiliary teacher constr… view at source ↗

**Figure 6.** Figure 6: Distribution of base-scored perplexity and view at source ↗

**Figure 7.** Figure 7: Gains over the original model across Qwen2.5, Llama-3.1, and Gemma-3 on ScienceQA (SQA), view at source ↗

**Figure 7.** Figure 7: Gains over the original model across Qwen2.5, Llama-3.1, and Gemma-3 on ScienceQA (SQA), [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Left. Training time comparison of UniSD variants on ScienceQA. Right. Retention perplexity comparison across UniSD variants. These variants preserve high throughput (2.32–3.22M tokens/GPU-hour), showing that adding representation, contrastive, or temporal stabilization incurs only modest overhead. Agreement-based variants require 0.16–0.18 kWh per million tokens and also increase peak memory by roughly 13–… view at source ↗

**Figure 9.** Figure 9: Sensitivity to the number of contexts k and the agreement weight γ. Adding more contexts does not consistently improve accuracy. Category Dataset Train Test License Scientific Reasoning ScienceQA 12,726 4,241 CC BY-NC-SA 4.0 GPQA – 448 CC BY 4.0 / MIT Coding MBPP 120 257 CC BY 4.0 HumanEval – 164 MIT Commonsense QA CoS-E 9,741 1,221 BSD-3-Clause Tool Usage ToolAlpaca 4,046 68 Apache-2.0 view at source ↗

**Figure 10.** Figure 10: Sensitivity to the number of contexts k and the agreement weight γ. Adding more contexts does not consistently improve accuracy. E Ethical Considerations Self-distillation inherits the limitations of the underlying base model, including potential factual errors, social biases, and unsafe behaviors. Although UniSD uses reliability weighting, divergence clipping, and stabilization to reduce the reinforcemen… view at source ↗

**Figure 11.** Figure 11: Comparison of sequence-level agreement across three auxiliary-context strategies: random, view at source ↗

**Figure 11.** Figure 11: Comparison of sequence-level agreement across three auxiliary-context strategies: random, [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Per-dataset gains of UniSD variants over the raw Qwen2.5-7B model. Asterisks ( view at source ↗

**Figure 12.** Figure 12: Per-dataset gains of UniSD variants over the raw Qwen2.5-7B model. Asterisks ( [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniSD unifies several self-distillation mechanisms for LLMs and reports modest benchmark gains that are worth checking in full but rest on empirical claims needing more verification.

read the letter

The punchline is that this paper unifies several self-distillation techniques into one framework for LLMs and shows they can deliver better performance than using them separately or than static imitation. The gains are +5.4 over the base model and +2.8 over the top baseline across the tested setups. The paper does well by taking isolated design choices from earlier work and testing them together. It looks at multi-teacher agreement for reliable supervision, EMA for stabilization, token-level contrastive learning and feature matching for alignment, and divergence clipping for stability. By running experiments on six benchmarks with six models from three different families, it reveals how these components interact and which ones matter most for different tasks. Building UniSDfull as the integrated version based on those findings is a logical outcome. This kind of methodical combination helps clarify the roles of each part in a way that scattered studies did not. The results point to self-distillation as a steerable way to adapt models without external teachers, which is relevant for resource-limited scenarios. Where it could be stronger is in the verification of the results. The abstract gives the headline numbers, but full details on how baselines were implemented, whether error bars are reported, and rules for data handling are needed to be confident the gains are solid. The idea that these mechanisms can be combined without creating new instabilities or biases is the key assumption, and it would be good to see more evidence that this holds across the board. This paper is for people in the LLM adaptation area who are looking for ways to improve models using their own outputs. A reader interested in practical methods for self-improvement would get value from the component analysis and the cross-task insights. I would recommend sending it for peer review. The empirical coverage is wide enough that referees can provide useful feedback on the setups and whether the unification holds up under closer inspection.

Referee Report

0 major / 3 minor

Summary. The paper proposes UniSD, a unified self-distillation framework for autoregressive LLMs that integrates complementary mechanisms—multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping—to address supervision reliability, representation alignment, and training stability. It systematically evaluates these components across six benchmarks and six models from three families, identifies their interactions and when self-distillation outperforms static imitation, and constructs an integrated UniSDfull pipeline that reports +5.4 point gains over the base model and +2.8 points over the strongest baseline.

Significance. If the reported gains hold under full experimental scrutiny, the work would be significant for providing a systematic, component-level analysis of self-distillation rather than isolated design choices. The empirical demonstration that an integrated pipeline can deliver consistent improvements without external teachers, together with insights on component interactions, offers a practical contribution to efficient LLM adaptation. The multi-model, multi-benchmark scope strengthens generalizability claims.

minor comments (3)

[Abstract] Abstract: the claim of 'strongest overall performance' and specific point gains (+5.4 / +2.8) would be more informative if the exact six benchmarks and six models (including their sizes and families) were named rather than summarized.
[Method] The description of how the five mechanisms are combined into UniSDfull (e.g., weighting, scheduling, or conditional activation) is only sketched at a high level; a dedicated subsection or algorithm box would clarify reproducibility.
[Experiments] Experimental results: while aggregate gains are reported, the manuscript would benefit from per-task and per-model breakdowns with error bars or statistical significance tests to support the cross-task interaction claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of its significance, and the recommendation for minor revision. We are pleased that the unified framework, component analysis, and empirical gains across models and benchmarks were viewed favorably.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical framework (UniSD) that combines listed mechanisms and reports measured performance gains (+5.4 over base model, +2.8 over strongest baseline) on external benchmarks and models. No derivation chain, equations, or predictions are shown that reduce by construction to fitted parameters, self-definitions, or self-citation load-bearing premises. Results are externally falsifiable against independent test sets rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard machine-learning assumptions about distillation benefits and stabilization techniques rather than introducing new free parameters or invented entities in the abstract.

axioms (1)

domain assumption Self-generated trajectories can supply useful supervision for autoregressive LLMs when reliability, alignment, and stability issues are addressed by the listed mechanisms.
Invoked to justify the design of UniSD and the expectation of performance gains.

pith-pipeline@v0.9.0 · 5788 in / 1324 out tokens · 39450 ms · 2026-05-22T09:41:17.234025+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 19 internal anchors

[1]

Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

work page 2023
[2]

Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023

work page 2023
[3]

Visual program distillation: Distilling tools and programmatic reasoning into vision-language models

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. InCVPR, pages 9590–9601, 2024

work page 2024
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Simpo: Simple preference optimization with a reference-free reward.NeurIPS, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.NeurIPS, 37:124198–124235, 2024. 11

work page 2024
[6]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Gpt4all: An ecosystem of open source compressed language models

Yuvanesh Anand, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Benjamin Schmidt, Brandon Duderstadt, and Andriy Mulyar. Gpt4all: An ecosystem of open source compressed language models. InNLP-OSS Workshop, pages 59–64, 2023

work page 2023
[9]

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, and Jindong Wang. Agentark: Distilling multi-agent intelligence into a single llm agent. arXiv:2602.03955, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv:2402.13116, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, and Josiah Hester. Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

work page 2026
[13]

Harnessing the wisdom of the inner crowd.Trends in cognitive sciences, 18(10):504–506, 2014

Stefan M Herzog and Ralph Hertwig. Harnessing the wisdom of the inner crowd.Trends in cognitive sciences, 18(10):504–506, 2014

work page 2014
[14]

Instruction induction: From few examples to natural language task descriptions

Or Honovich, Uri Shaham, Samuel Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. InACL, pages 1935–1952, 2023

work page 1935
[15]

Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

work page 1995
[16]

Ppdb: The paraphrase database

Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Ppdb: The paraphrase database. In NAACL, pages 758–764, 2013

work page 2013
[17]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. InEMNLP, pages 119–126, 2020

work page 2020
[18]

Less is more: Task-aware layer-wise distillation for language model compression

Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: Task-aware layer-wise distillation for language model compression. InICML, pages 20852–20867. PMLR, 2023

work page 2023
[19]

Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers. InACL, pages 2140–2151, 2021

work page 2021
[20]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InICLR, 2024

work page 2024
[21]

Embarrassingly simple self-distillation improves code generation.arXiv:2604.01193, 2026

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv:2604.01193, 2026. 12

work page arXiv 2026
[22]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

work page 2022
[24]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In COLM, 2024

work page 2024
[25]

Explain yourself! leveraging language models for commonsense reasoning

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. InACL, pages 4932–4942, 2019

work page 2019
[26]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InNAACL, pages 4149–4158, 2019

work page 2019
[27]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv:2306.05301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[31]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

How context affects language models’ factual predictions

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models’ factual predictions. InAKBC, 2020

work page 2020
[34]

Lost in the middle: How language models use long contexts.TACL, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.TACL, 12:157–173, 2024

work page 2024
[35]

Federated continual learning via knowledge fusion: A survey.TKDE, 36(8):3832–3850, 2024

Xin Yang, Hao Yu, Xin Gao, Hao Wang, Junbo Zhang, and Tianrui Li. Federated continual learning via knowledge fusion: A survey.TKDE, 36(8):3832–3850, 2024

work page 2024
[36]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989
[37]

A comprehensive survey of continual learning: Theory, method and application.TPAMI, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.TPAMI, 46(8):5362–5383, 2024

work page 2024
[38]

Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

Yiyang Wang, Yiqiao Jin, Alex Cabral, and Josiah Hester. Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026. 13

work page arXiv 2026
[39]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InICLR, 2024

work page 2024
[42]

Distillm: Towards streamlined distillation for large language models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. InICML, 2024

work page 2024
[43]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv:2603.07079, 2026

work page arXiv 2026
[44]

Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation.arXiv:2603.26666, 2026

Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, and Haoang Li. Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation.arXiv:2603.26666, 2026

work page arXiv 2026
[45]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. Scope: Signal-calibrated on-policy distillation enhancement with dual-path adaptive weighting.arXiv:2604.10688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying opd: Length inflation and stabilization strategies for large language models. arXiv:2604.08527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Energy and policy considerations for deep learning in nlp

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. InACL, pages 3645–3650, 2019

work page 2019
[49]

Green ai.Communications of the ACM, 63(12):54–63, 2020

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai.Communications of the ACM, 63(12):54–63, 2020

work page 2020
[50]

Carbon Emissions and Large Neural Network Training

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

CodeCarbon: mlco2/codecarbon v2.4.1, May 2024

Benoit Courty, Victor Schmidt, Sasha Luccioni, Goyal-Kamal, et al. CodeCarbon: mlco2/codecarbon v2.4.1, May 2024. Software

work page 2024
[52]

Quantifying the Carbon Emissions of Machine Learning

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv:1910.09700, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[53]

Pue: a comprehensive examination of the metric.White paper, 49:52, 2012

Victor Avelar, Dan Azevedo, Alan French, and Emerson Network Power. Pue: a comprehensive examination of the metric.White paper, 49:52, 2012

work page 2012
[54]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR, 2021

work page 2021
[55]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 14

work page 2019
[56]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pages 611–626, 2023

work page 2023
[57]

Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare

Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, et al. Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare. InICLR, 2026. 15 Table 2: Comparison of teacher-forced conditional perplexity on gold completions (§3.5). Lower values indi...

work page 2026

[1] [1]

Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

work page 2023

[2] [2]

Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023

work page 2023

[3] [3]

Visual program distillation: Distilling tools and programmatic reasoning into vision-language models

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. InCVPR, pages 9590–9601, 2024

work page 2024

[4] [4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Simpo: Simple preference optimization with a reference-free reward.NeurIPS, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.NeurIPS, 37:124198–124235, 2024. 11

work page 2024

[6] [6]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Gpt4all: An ecosystem of open source compressed language models

Yuvanesh Anand, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Benjamin Schmidt, Brandon Duderstadt, and Andriy Mulyar. Gpt4all: An ecosystem of open source compressed language models. InNLP-OSS Workshop, pages 59–64, 2023

work page 2023

[9] [9]

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, and Jindong Wang. Agentark: Distilling multi-agent intelligence into a single llm agent. arXiv:2602.03955, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models.arXiv:2402.13116, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, and Josiah Hester. Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

work page 2026

[13] [13]

Harnessing the wisdom of the inner crowd.Trends in cognitive sciences, 18(10):504–506, 2014

Stefan M Herzog and Ralph Hertwig. Harnessing the wisdom of the inner crowd.Trends in cognitive sciences, 18(10):504–506, 2014

work page 2014

[14] [14]

Instruction induction: From few examples to natural language task descriptions

Or Honovich, Uri Shaham, Samuel Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. InACL, pages 1935–1952, 2023

work page 1935

[15] [15]

Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995

work page 1995

[16] [16]

Ppdb: The paraphrase database

Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Ppdb: The paraphrase database. In NAACL, pages 758–764, 2013

work page 2013

[17] [17]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. InEMNLP, pages 119–126, 2020

work page 2020

[18] [18]

Less is more: Task-aware layer-wise distillation for language model compression

Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: Task-aware layer-wise distillation for language model compression. InICML, pages 20852–20867. PMLR, 2023

work page 2023

[19] [19]

Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self- attention relation distillation for compressing pretrained transformers. InACL, pages 2140–2151, 2021

work page 2021

[20] [20]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InICLR, 2024

work page 2024

[21] [21]

Embarrassingly simple self-distillation improves code generation.arXiv:2604.01193, 2026

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, and Yizhe Zhang. Embarrassingly simple self-distillation improves code generation.arXiv:2604.01193, 2026. 12

work page arXiv 2026

[22] [22]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS, 2022

work page 2022

[24] [24]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In COLM, 2024

work page 2024

[25] [25]

Explain yourself! leveraging language models for commonsense reasoning

Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. InACL, pages 4932–4942, 2019

work page 2019

[26] [26]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InNAACL, pages 4149–4158, 2019

work page 2019

[27] [27]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[29] [29]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv:2306.05301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024

[31] [31]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

How context affects language models’ factual predictions

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. How context affects language models’ factual predictions. InAKBC, 2020

work page 2020

[34] [34]

Lost in the middle: How language models use long contexts.TACL, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.TACL, 12:157–173, 2024

work page 2024

[35] [35]

Federated continual learning via knowledge fusion: A survey.TKDE, 36(8):3832–3850, 2024

Xin Yang, Hao Yu, Xin Gao, Hao Wang, Junbo Zhang, and Tianrui Li. Federated continual learning via knowledge fusion: A survey.TKDE, 36(8):3832–3850, 2024

work page 2024

[36] [36]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

work page 1989

[37] [37]

A comprehensive survey of continual learning: Theory, method and application.TPAMI, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.TPAMI, 46(8):5362–5383, 2024

work page 2024

[38] [38]

Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

Yiyang Wang, Yiqiao Jin, Alex Cabral, and Josiah Hester. Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026. 13

work page arXiv 2026

[39] [39]

A Survey of On-Policy Distillation for Large Language Models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv:2604.00626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv:2602.04942, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InICLR, 2024

work page 2024

[42] [42]

Distillm: Towards streamlined distillation for large language models

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models. InICML, 2024

work page 2024

[43] [43]

Entropy-aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv:2603.07079, 2026

work page arXiv 2026

[44] [44]

Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation.arXiv:2603.26666, 2026

Zhide Zhong, Haodong Yan, Junfeng Li, Junjie He, Tianran Zhang, and Haoang Li. Vla-opd: Bridging offline sft and online rl for vision-language-action models via on-policy distillation.arXiv:2603.26666, 2026

work page arXiv 2026

[45] [45]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. Scope: Signal-calibrated on-policy distillation enhancement with dual-path adaptive weighting.arXiv:2604.10688, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying opd: Length inflation and stabilization strategies for large language models. arXiv:2604.08527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Energy and policy considerations for deep learning in nlp

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. InACL, pages 3645–3650, 2019

work page 2019

[49] [49]

Green ai.Communications of the ACM, 63(12):54–63, 2020

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai.Communications of the ACM, 63(12):54–63, 2020

work page 2020

[50] [50]

Carbon Emissions and Large Neural Network Training

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[51] [51]

CodeCarbon: mlco2/codecarbon v2.4.1, May 2024

Benoit Courty, Victor Schmidt, Sasha Luccioni, Goyal-Kamal, et al. CodeCarbon: mlco2/codecarbon v2.4.1, May 2024. Software

work page 2024

[52] [52]

Quantifying the Carbon Emissions of Machine Learning

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv:1910.09700, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[53] [53]

Pue: a comprehensive examination of the metric.White paper, 49:52, 2012

Victor Avelar, Dan Azevedo, Alan French, and Emerson Network Power. Pue: a comprehensive examination of the metric.White paper, 49:52, 2012

work page 2012

[54] [54]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR, 2021

work page 2021

[55] [55]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 14

work page 2019

[56] [56]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pages 611–626, 2023

work page 2023

[57] [57]

Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare

Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, et al. Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare. InICLR, 2026. 15 Table 2: Comparison of teacher-forced conditional perplexity on gold completions (§3.5). Lower values indi...

work page 2026