Strong Teacher Not Needed? On Distillation in LLM Pretraining

Taiming Lu; Zhuang Liu

arxiv: 2605.23857 · v1 · pith:C7FM2EQTnew · submitted 2026-05-22 · 💻 cs.LG · cs.CL

Strong Teacher Not Needed? On Distillation in LLM Pretraining

Taiming Lu , Zhuang Liu This is my paper

Pith reviewed 2026-05-25 04:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords knowledge distillationLLM pretrainingteacher-studentloss mixingweak teachergeneralization

0 comments

The pith

Even small undertrained teachers can improve larger LLM students when language modeling and distillation losses are mixed properly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that knowledge distillation in LLM pretraining requires a strong teacher to produce better students. By varying teacher and student sizes along with training token budgets, the authors create weak-to-strong, same-level, and strong-to-weak pairs and test distillation effectiveness in each. They show that proper weighting of the language modeling loss with the distillation loss allows even small and undertrained teachers to improve larger students. At the same time, further increasing teacher size or training tokens can saturate or reverse those gains. Distillation improves out-of-distribution and downstream performance more readily than in-domain fitting.

Core claim

With appropriate mixing of the language modeling objective and the knowledge distillation loss, distillation from small or undertrained teachers improves larger student models during LLM pretraining, while increasing teacher parameters or training tokens can saturate or reverse the gains; distillation also aids generalization more than in-domain performance.

What carries the argument

The weighted sum of language modeling loss and knowledge distillation loss applied during student pretraining.

If this is right

Distillation remains useful in weak-to-strong setups when losses are mixed correctly.
Increasing teacher strength yields diminishing or negative returns past a saturation point.
Generalization metrics improve more consistently than in-domain perplexity under distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training budgets for teachers could be reduced without losing distillation value to students.
The saturation effect may appear when the teacher has seen more data than the student relative to model capacity.
The same loss-mixing approach might apply when distilling across different architectures rather than size differences.

Load-bearing premise

Observed gains and reversals stem from the teacher-student relationship and loss weighting rather than differences in total compute, data order, or hyperparameter choices.

What would settle it

Equalizing total compute and data exposure across teacher strengths while keeping loss mixing fixed, then checking whether the reported saturation or reversal of student gains still appears.

Figures

Figures reproduced from arXiv: 2605.23857 by Taiming Lu, Zhuang Liu.

**Figure 1.** Figure 1: Effective distillation in LLM pretraining depends on teacher–student compatibility, not strong teacher. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: A strong teacher is not always needed in distillation pretraining. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Distillation improvement under best loss mixing coefficient [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Relative performance of 𝛼 across teacher configurations. Column-wise, each cell shows the normalized performance within each teacher’s runs, where darker colors indicate better 𝛼 for that teacher. Left panel groups teachers by architecture and training tokens; right panel orders all teachers by in-domain perplexity. All three evaluation types follow the same pattern, with downstream accuracy being noisier:… view at source ↗

**Figure 6.** Figure 6: Distillation improvement versus teacher capability [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 5.** Figure 5: Distillation improvement versus teacher in-domain perplexity under different loss mixing coefficient [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Training loss curves for distillation pretraining across all teacher configurations. Columns are ordered by teacher architecture size and rows by teacher token budget. Each subplot shows training loss versus tokens seen by the 1.7B student, zoomed in for clarity, with different colors indicating loss mixing coefficient 𝛼. Insets show the full y-axis range for reference. Across all configurations, higher 𝛼 … view at source ↗

**Figure 8.** Figure 8: Percentage PPL improvement over baseline across alpha values and datasets. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Raw perplexity values across alpha values and datasets. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Improvement by token difficulty at each teacher’s best alpha. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: PPL improvement by token difficulty across all alpha values. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Statistical significance of hard-token concentration. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Causal analysis of distillation by token category. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Distillation vs. label smoothing: opposite benefit profiles. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Distribution convergence ratio across alpha. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: tests this by plotting each student’s entropy change against its PPL improvement. Most successful students have positive entropy increase (less confident) yet positive PPL improvement (better predictions). This counterintuitive result indicates that distillation redistributes probability mass from wrong confident predictions to better-calibrated ones, rather than simply sharpening. Strong teachers (8.0B, … view at source ↗

**Figure 17.** Figure 17: tests whether distillation improvement varies with position in the context window. PPL improvement is computed in four position bins (0–128, 128–512, 512–1024, 1024–2048 tokens). All teachers show essentially flat improvement across positions, there is no early-token vs. late-token effect. This rules out the hypothesis that distillation primarily helps with “cold start” predictions (early tokens with litt… view at source ↗

read the original abstract

Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Weak teachers can improve larger students via distillation in pretraining with right loss mixing, but student training controls look underspecified.

read the letter

The central result is that you do not need a stronger teacher for distillation during LLM pretraining. With suitable mixing of the language modeling loss and the distillation loss, even small or lightly trained teachers can produce gains in larger students, and further strengthening the teacher can plateau or hurt performance. They also note that the gains show up more on out-of-distribution and downstream tasks than on in-domain fit. This directly tests the usual strong-to-weak assumption by building teacher-student pairs that run the other way or stay matched, using changes in both parameter count and token budget. That variation is more systematic than most earlier distillation studies, which tend to fix the teacher as larger. The work is therefore useful for anyone thinking about compute trade-offs in the pretraining stage. The abstract gives only directional claims and no effect sizes, run counts, or error bars, so the size of the saturation and reversal effects is hard to judge from what is shown. The mixing ratio is called “proper” without saying whether it was fixed in advance or tuned after seeing results. The stress-test point lands: the description varies teacher size and tokens but does not state that student training tokens, optimizer settings, or the mixing coefficient itself stay constant across the different teacher conditions. If those student-side factors shift, the attribution to teacher strength becomes less clean. This is the kind of empirical question that matters for practice, so it is worth sending to a serious referee even though the current write-up needs tighter controls and more quantitative detail to stand up.

Referee Report

3 major / 2 minor

Summary. The manuscript examines the assumption that knowledge distillation in LLM pretraining requires a strong teacher. By varying teacher architecture sizes and training token budgets to create strong-to-weak, same-level, and weak-to-strong pairings, the authors report that appropriate mixing of language modeling and distillation losses allows even small or undertrained teachers to improve larger students. They further find that increasing teacher strength (via parameters or tokens) can saturate or reverse gains, and that distillation improves out-of-distribution and downstream performance more readily than in-domain fitting.

Significance. If the central empirical observations hold after controlling for student-side factors, the results would challenge the prevailing strong-to-weak paradigm in distillation for pretraining and suggest more flexible, compute-efficient teacher choices. The work's strength lies in its systematic variation of teacher configurations and its observation that distillation aids generalization; however, the absence of reported effect sizes, run counts, and variance measures in the abstract limits immediate impact.

major comments (3)

[Section 3] Section 3 (Experimental Setup): the description of the distillation experiments does not state whether student training token count, optimizer settings, data ordering, and total compute are held exactly constant when the teacher size or token budget is varied. This control is load-bearing for the claim that observed gains and reversals arise from the teacher-student loss interaction rather than incidental differences in student optimization effort.
[Section 4] Section 4 (Results): the saturation and reversal claims when increasing teacher parameters or tokens are presented without error bars, number of independent runs, or statistical tests. Without these, it is impossible to assess whether the reversals are robust or could be explained by run-to-run variance.
[Section 3.2] Section 3.2 (Loss Mixing): the phrase 'proper mixing' of LM and KD losses is central to the positive results with weak teachers, yet the procedure for choosing or validating the mixing coefficient (grid search, fixed value, or post-hoc selection) is not specified. If the coefficient was tuned after observing outcomes, the attribution to teacher quality is weakened.

minor comments (2)

[Table 2] Table 2: column headers for teacher and student configurations are abbreviated without a clear legend, making it difficult to map rows to the strong-to-weak vs. weak-to-strong cases discussed in the text.
[Figure 3] Figure 3: axis labels for 'distillation gain' do not indicate the exact baseline (pure LM loss or another control) against which gains are measured.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on experimental controls, statistical reporting, and loss mixing. We address each point below with clarifications from our setup and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Section 3] Section 3 (Experimental Setup): the description of the distillation experiments does not state whether student training token count, optimizer settings, data ordering, and total compute are held exactly constant when the teacher size or token budget is varied. This control is load-bearing for the claim that observed gains and reversals arise from the teacher-student loss interaction rather than incidental differences in student optimization effort.

Authors: Student training token count (fixed at 50B tokens), optimizer (AdamW with identical hyperparameters and schedule), data ordering (fixed random seed for the same dataset shuffle), and total compute budget were held exactly constant across all teacher variations. This isolates effects to the loss interaction. We will add an explicit paragraph in Section 3 stating these controls. revision: yes
Referee: [Section 4] Section 4 (Results): the saturation and reversal claims when increasing teacher parameters or tokens are presented without error bars, number of independent runs, or statistical tests. Without these, it is impossible to assess whether the reversals are robust or could be explained by run-to-run variance.

Authors: We acknowledge the need for variance reporting. Main results used single runs due to compute limits, but we performed 3 independent runs with different seeds on representative configurations showing saturation/reversal. We will add error bars, report run counts, and include a brief note that variance is smaller than the effect sizes in the revised Section 4. revision: yes
Referee: [Section 3.2] Section 3.2 (Loss Mixing): the phrase 'proper mixing' of LM and KD losses is central to the positive results with weak teachers, yet the procedure for choosing or validating the mixing coefficient (grid search, fixed value, or post-hoc selection) is not specified. If the coefficient was tuned after observing outcomes, the attribution to teacher quality is weakened.

Authors: The mixing coefficient was chosen via a grid search (0.0 to 1.0 in steps of 0.1) on a small held-out validation set prior to main experiments, selecting the value maximizing validation performance for each teacher-student pair. This was not post-hoc. We will specify the grid-search procedure and validation protocol in the revised Section 3.2. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical study with no derivations or self-referential fits

full rationale

The paper reports experimental results on varying teacher-student sizes and token budgets in LLM distillation, with no equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations. All claims rest on direct observations from controlled runs rather than any reduction to inputs by construction, satisfying the criteria for a self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities. The central observations rest on the implicit assumption that loss mixing weights can be chosen without introducing new fitted constants that define the result.

pith-pipeline@v0.9.0 · 5675 in / 1103 out tokens · 17012 ms · 2026-05-25T04:39:14.936571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 12 internal anchors

[1]

The Llama 3 Herd of Models

AI@Meta. The llama 3 herd of models.arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Hongqiu Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search.arXiv:1909.09436,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[8]

Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari. Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

work page arXiv
[9]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Phi3@Microsoft. Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Qwen2.5 Technical Report

Qwen. Qwen2.5 technical report.arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen3 Technical Report

Qwen. Qwen3 technical report.arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[13]

Chi, and Sagar Jain

Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H. Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv:2002.03532,

work page arXiv 2002
[14]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv:2401.02385,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

dark knowledge,

is a widely used technique to transfer knowledge from a teacher model to a student model by training the student to match the teacher’s predictive distribution. The key insight is that the teacher’s soft predictions contain “dark knowledge,” information about the relative similarities between classes that is not present in hard labels (Neyshabur et al., 2...

work page 2020
[16]

in vision, KD has since been extended in many directions, including distilling intermediate representations (Romero et al., 2015), attention maps (Zagoruyko and Komodakis, 2017), and relational information between samples (Park et al., 2019). Various works have also studied the theoretical foundations of why distillation works, attributing its success to ...

work page 2015
[17]

Moreover, KD is not monotonic in teacher strength: for autoregressive LMs, stronger teachers can sometimes degrade student performance (Zhong et al., 2024; Busbridge et al., 2025)

can qualitatively change distillation outcomes. Moreover, KD is not monotonic in teacher strength: for autoregressive LMs, stronger teachers can sometimes degrade student performance (Zhong et al., 2024; Busbridge et al., 2025). Prior work has shown that stronger teachers can sometimes degrade performance, but has not systematically characterized when thi...

work page 2024
[18]

In LLMs, while distillation remains a standard compression approach, strict strong-to-weak supervision may be unrealistic in the long run (Burns et al., 2024)

first showed that same-architecture distillation can still improve the student, with subsequent work clarifying why this effect holds (Zhang et al., 2019; Cho and Hariharan, 2019; Mobahi et al., 2020). In LLMs, while distillation remains a standard compression approach, strict strong-to-weak supervision may be unrealistic in the long run (Burns et al., 20...

work page 2019
[19]

Despite these developments, systematic studies of how the scale of different components governs logit-level distillation in LLM pretraining remain limited

and DeepSeek-R1 (DeepSeek-AI, 2025). Despite these developments, systematic studies of how the scale of different components governs logit-level distillation in LLM pretraining remain limited. Busbridge et al. (2025) studies distillation scaling laws, but constrains teachers to a fixed budget distribution, while Peng et al. (2025) studies how student size...

work page 2025
[20]

Results show percentage improvement over stan- dard pretraining 300B baseline

Distillation improvement at 300B training to- kens with𝛼= 0.2.Student is 1.7B trained on 300B tokens. Results show percentage improvement over stan- dard pretraining 300B baseline. G.1 Generalization to Different Architecture Families Our main experiments use the Llama3 architecture family. We now test whether our findings transfer to a different architec...

work page 2025
[21]

cold start

Entropy change vs. PPL improvement.Each point is one distilled student. Most successful students show increased entropy (less confident) yet improved PPL (more accurate), indicating distillation redistributes probability rather than sharpening predictions. Strong teachers achieve high improvement with modest entropy change; the 0.5B teacher at high alpha ...

work page 2048

[1] [1]

The Llama 3 Herd of Models

AI@Meta. The llama 3 herd of models.arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Hongqiu Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search.arXiv:1909.09436,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[8] [8]

Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari. Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

work page arXiv

[9] [9]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Phi3@Microsoft. Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Qwen2.5 Technical Report

Qwen. Qwen2.5 technical report.arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Qwen3 Technical Report

Qwen. Qwen3 technical report.arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[13] [13]

Chi, and Sagar Jain

Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H. Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv:2002.03532,

work page arXiv 2002

[14] [14]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv:2401.02385,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

dark knowledge,

is a widely used technique to transfer knowledge from a teacher model to a student model by training the student to match the teacher’s predictive distribution. The key insight is that the teacher’s soft predictions contain “dark knowledge,” information about the relative similarities between classes that is not present in hard labels (Neyshabur et al., 2...

work page 2020

[16] [16]

in vision, KD has since been extended in many directions, including distilling intermediate representations (Romero et al., 2015), attention maps (Zagoruyko and Komodakis, 2017), and relational information between samples (Park et al., 2019). Various works have also studied the theoretical foundations of why distillation works, attributing its success to ...

work page 2015

[17] [17]

Moreover, KD is not monotonic in teacher strength: for autoregressive LMs, stronger teachers can sometimes degrade student performance (Zhong et al., 2024; Busbridge et al., 2025)

can qualitatively change distillation outcomes. Moreover, KD is not monotonic in teacher strength: for autoregressive LMs, stronger teachers can sometimes degrade student performance (Zhong et al., 2024; Busbridge et al., 2025). Prior work has shown that stronger teachers can sometimes degrade performance, but has not systematically characterized when thi...

work page 2024

[18] [18]

In LLMs, while distillation remains a standard compression approach, strict strong-to-weak supervision may be unrealistic in the long run (Burns et al., 2024)

first showed that same-architecture distillation can still improve the student, with subsequent work clarifying why this effect holds (Zhang et al., 2019; Cho and Hariharan, 2019; Mobahi et al., 2020). In LLMs, while distillation remains a standard compression approach, strict strong-to-weak supervision may be unrealistic in the long run (Burns et al., 20...

work page 2019

[19] [19]

Despite these developments, systematic studies of how the scale of different components governs logit-level distillation in LLM pretraining remain limited

and DeepSeek-R1 (DeepSeek-AI, 2025). Despite these developments, systematic studies of how the scale of different components governs logit-level distillation in LLM pretraining remain limited. Busbridge et al. (2025) studies distillation scaling laws, but constrains teachers to a fixed budget distribution, while Peng et al. (2025) studies how student size...

work page 2025

[20] [20]

Results show percentage improvement over stan- dard pretraining 300B baseline

Distillation improvement at 300B training to- kens with𝛼= 0.2.Student is 1.7B trained on 300B tokens. Results show percentage improvement over stan- dard pretraining 300B baseline. G.1 Generalization to Different Architecture Families Our main experiments use the Llama3 architecture family. We now test whether our findings transfer to a different architec...

work page 2025

[21] [21]

cold start

Entropy change vs. PPL improvement.Each point is one distilled student. Most successful students show increased entropy (less confident) yet improved PPL (more accurate), indicating distillation redistributes probability rather than sharpening predictions. Strong teachers achieve high improvement with modest entropy change; the 0.5B teacher at high alpha ...

work page 2048