pith. sign in

arxiv: 2605.23857 · v1 · pith:C7FM2EQTnew · submitted 2026-05-22 · 💻 cs.LG · cs.CL

Strong Teacher Not Needed? On Distillation in LLM Pretraining

Pith reviewed 2026-05-25 04:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords knowledge distillationLLM pretrainingteacher-studentloss mixingweak teachergeneralization
0
0 comments X

The pith

Even small undertrained teachers can improve larger LLM students when language modeling and distillation losses are mixed properly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that knowledge distillation in LLM pretraining requires a strong teacher to produce better students. By varying teacher and student sizes along with training token budgets, the authors create weak-to-strong, same-level, and strong-to-weak pairs and test distillation effectiveness in each. They show that proper weighting of the language modeling loss with the distillation loss allows even small and undertrained teachers to improve larger students. At the same time, further increasing teacher size or training tokens can saturate or reverse those gains. Distillation improves out-of-distribution and downstream performance more readily than in-domain fitting.

Core claim

With appropriate mixing of the language modeling objective and the knowledge distillation loss, distillation from small or undertrained teachers improves larger student models during LLM pretraining, while increasing teacher parameters or training tokens can saturate or reverse the gains; distillation also aids generalization more than in-domain performance.

What carries the argument

The weighted sum of language modeling loss and knowledge distillation loss applied during student pretraining.

If this is right

  • Distillation remains useful in weak-to-strong setups when losses are mixed correctly.
  • Increasing teacher strength yields diminishing or negative returns past a saturation point.
  • Generalization metrics improve more consistently than in-domain perplexity under distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training budgets for teachers could be reduced without losing distillation value to students.
  • The saturation effect may appear when the teacher has seen more data than the student relative to model capacity.
  • The same loss-mixing approach might apply when distilling across different architectures rather than size differences.

Load-bearing premise

Observed gains and reversals stem from the teacher-student relationship and loss weighting rather than differences in total compute, data order, or hyperparameter choices.

What would settle it

Equalizing total compute and data exposure across teacher strengths while keeping loss mixing fixed, then checking whether the reported saturation or reversal of student gains still appears.

Figures

Figures reproduced from arXiv: 2605.23857 by Taiming Lu, Zhuang Liu.

Figure 1
Figure 1. Figure 1: Effective distillation in LLM pretraining depends on teacher–student compatibility, not strong teacher. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A strong teacher is not always needed in distillation pretraining. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distillation improvement under best loss mixing coefficient [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relative performance of 𝛼 across teacher configurations. Column-wise, each cell shows the normalized performance within each teacher’s runs, where darker colors indicate better 𝛼 for that teacher. Left panel groups teachers by architecture and training tokens; right panel orders all teachers by in-domain perplexity. All three evaluation types follow the same pattern, with downstream accuracy being noisier:… view at source ↗
Figure 6
Figure 6. Figure 6: Distillation improvement versus teacher capability [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distillation improvement versus teacher in-domain perplexity under different loss mixing coefficient [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training loss curves for distillation pretraining across all teacher configurations. Columns are ordered by teacher architecture size and rows by teacher token budget. Each subplot shows training loss versus tokens seen by the 1.7B student, zoomed in for clarity, with different colors indicating loss mixing coefficient 𝛼. Insets show the full y-axis range for reference. Across all configurations, higher 𝛼 … view at source ↗
Figure 8
Figure 8. Figure 8: Percentage PPL improvement over baseline across alpha values and datasets. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Raw perplexity values across alpha values and datasets. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Improvement by token difficulty at each teacher’s best alpha. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: PPL improvement by token difficulty across all alpha values. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Statistical significance of hard-token concentration. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Causal analysis of distillation by token category. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distillation vs. label smoothing: opposite benefit profiles. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distribution convergence ratio across alpha. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: tests this by plotting each student’s entropy change against its PPL improvement. Most successful students have positive entropy increase (less confident) yet positive PPL improvement (better predictions). This counterintuitive result indicates that distillation redistributes probability mass from wrong confident predictions to better-calibrated ones, rather than simply sharpening. Strong teachers (8.0B, … view at source ↗
Figure 17
Figure 17. Figure 17: tests whether distillation improvement varies with position in the context window. PPL improvement is computed in four position bins (0–128, 128–512, 512–1024, 1024–2048 tokens). All teachers show essentially flat improvement across positions, there is no early-token vs. late-token effect. This rules out the hypothesis that distillation primarily helps with “cold start” predictions (early tokens with litt… view at source ↗
read the original abstract

Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript examines the assumption that knowledge distillation in LLM pretraining requires a strong teacher. By varying teacher architecture sizes and training token budgets to create strong-to-weak, same-level, and weak-to-strong pairings, the authors report that appropriate mixing of language modeling and distillation losses allows even small or undertrained teachers to improve larger students. They further find that increasing teacher strength (via parameters or tokens) can saturate or reverse gains, and that distillation improves out-of-distribution and downstream performance more readily than in-domain fitting.

Significance. If the central empirical observations hold after controlling for student-side factors, the results would challenge the prevailing strong-to-weak paradigm in distillation for pretraining and suggest more flexible, compute-efficient teacher choices. The work's strength lies in its systematic variation of teacher configurations and its observation that distillation aids generalization; however, the absence of reported effect sizes, run counts, and variance measures in the abstract limits immediate impact.

major comments (3)
  1. [Section 3] Section 3 (Experimental Setup): the description of the distillation experiments does not state whether student training token count, optimizer settings, data ordering, and total compute are held exactly constant when the teacher size or token budget is varied. This control is load-bearing for the claim that observed gains and reversals arise from the teacher-student loss interaction rather than incidental differences in student optimization effort.
  2. [Section 4] Section 4 (Results): the saturation and reversal claims when increasing teacher parameters or tokens are presented without error bars, number of independent runs, or statistical tests. Without these, it is impossible to assess whether the reversals are robust or could be explained by run-to-run variance.
  3. [Section 3.2] Section 3.2 (Loss Mixing): the phrase 'proper mixing' of LM and KD losses is central to the positive results with weak teachers, yet the procedure for choosing or validating the mixing coefficient (grid search, fixed value, or post-hoc selection) is not specified. If the coefficient was tuned after observing outcomes, the attribution to teacher quality is weakened.
minor comments (2)
  1. [Table 2] Table 2: column headers for teacher and student configurations are abbreviated without a clear legend, making it difficult to map rows to the strong-to-weak vs. weak-to-strong cases discussed in the text.
  2. [Figure 3] Figure 3: axis labels for 'distillation gain' do not indicate the exact baseline (pure LM loss or another control) against which gains are measured.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on experimental controls, statistical reporting, and loss mixing. We address each point below with clarifications from our setup and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Experimental Setup): the description of the distillation experiments does not state whether student training token count, optimizer settings, data ordering, and total compute are held exactly constant when the teacher size or token budget is varied. This control is load-bearing for the claim that observed gains and reversals arise from the teacher-student loss interaction rather than incidental differences in student optimization effort.

    Authors: Student training token count (fixed at 50B tokens), optimizer (AdamW with identical hyperparameters and schedule), data ordering (fixed random seed for the same dataset shuffle), and total compute budget were held exactly constant across all teacher variations. This isolates effects to the loss interaction. We will add an explicit paragraph in Section 3 stating these controls. revision: yes

  2. Referee: [Section 4] Section 4 (Results): the saturation and reversal claims when increasing teacher parameters or tokens are presented without error bars, number of independent runs, or statistical tests. Without these, it is impossible to assess whether the reversals are robust or could be explained by run-to-run variance.

    Authors: We acknowledge the need for variance reporting. Main results used single runs due to compute limits, but we performed 3 independent runs with different seeds on representative configurations showing saturation/reversal. We will add error bars, report run counts, and include a brief note that variance is smaller than the effect sizes in the revised Section 4. revision: yes

  3. Referee: [Section 3.2] Section 3.2 (Loss Mixing): the phrase 'proper mixing' of LM and KD losses is central to the positive results with weak teachers, yet the procedure for choosing or validating the mixing coefficient (grid search, fixed value, or post-hoc selection) is not specified. If the coefficient was tuned after observing outcomes, the attribution to teacher quality is weakened.

    Authors: The mixing coefficient was chosen via a grid search (0.0 to 1.0 in steps of 0.1) on a small held-out validation set prior to main experiments, selecting the value maximizing validation performance for each teacher-student pair. This was not post-hoc. We will specify the grid-search procedure and validation protocol in the revised Section 3.2. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical study with no derivations or self-referential fits

full rationale

The paper reports experimental results on varying teacher-student sizes and token budgets in LLM distillation, with no equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations. All claims rest on direct observations from controlled runs rather than any reduction to inputs by construction, satisfying the criteria for a self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities. The central observations rest on the implicit assumption that loss mixing weights can be chosen without introducing new fitted constants that define the result.

pith-pipeline@v0.9.0 · 5675 in / 1103 out tokens · 17012 ms · 2026-05-25T04:39:14.936571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 12 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    AI@Meta. The llama 3 herd of models.arXiv:2407.21783,

  2. [2]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  3. [3]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv:1803.05457,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168,

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv:2501.12948,

  6. [6]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv:1503.02531,

  7. [7]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Hongqiu Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search.arXiv:1909.09436,

  8. [8]

    Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

    Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari. Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

  9. [9]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Phi3@Microsoft. Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219,

  10. [10]

    Qwen2.5 Technical Report

    Qwen. Qwen2.5 technical report.arXiv:2412.15115,

  11. [11]

    Qwen3 Technical Report

    Qwen. Qwen3 technical report.arXiv:2505.09388,

  12. [12]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv:1910.01108,

  13. [13]

    Chi, and Sagar Jain

    Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H. Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv:2002.03532,

  14. [14]

    TinyLlama: An Open-Source Small Language Model

    Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv:2401.02385,

  15. [15]

    dark knowledge,

    is a widely used technique to transfer knowledge from a teacher model to a student model by training the student to match the teacher’s predictive distribution. The key insight is that the teacher’s soft predictions contain “dark knowledge,” information about the relative similarities between classes that is not present in hard labels (Neyshabur et al., 2...

  16. [16]

    in vision, KD has since been extended in many directions, including distilling intermediate representations (Romero et al., 2015), attention maps (Zagoruyko and Komodakis, 2017), and relational information between samples (Park et al., 2019). Various works have also studied the theoretical foundations of why distillation works, attributing its success to ...

  17. [17]

    Moreover, KD is not monotonic in teacher strength: for autoregressive LMs, stronger teachers can sometimes degrade student performance (Zhong et al., 2024; Busbridge et al., 2025)

    can qualitatively change distillation outcomes. Moreover, KD is not monotonic in teacher strength: for autoregressive LMs, stronger teachers can sometimes degrade student performance (Zhong et al., 2024; Busbridge et al., 2025). Prior work has shown that stronger teachers can sometimes degrade performance, but has not systematically characterized when thi...

  18. [18]

    In LLMs, while distillation remains a standard compression approach, strict strong-to-weak supervision may be unrealistic in the long run (Burns et al., 2024)

    first showed that same-architecture distillation can still improve the student, with subsequent work clarifying why this effect holds (Zhang et al., 2019; Cho and Hariharan, 2019; Mobahi et al., 2020). In LLMs, while distillation remains a standard compression approach, strict strong-to-weak supervision may be unrealistic in the long run (Burns et al., 20...

  19. [19]

    Despite these developments, systematic studies of how the scale of different components governs logit-level distillation in LLM pretraining remain limited

    and DeepSeek-R1 (DeepSeek-AI, 2025). Despite these developments, systematic studies of how the scale of different components governs logit-level distillation in LLM pretraining remain limited. Busbridge et al. (2025) studies distillation scaling laws, but constrains teachers to a fixed budget distribution, while Peng et al. (2025) studies how student size...

  20. [20]

    Results show percentage improvement over stan- dard pretraining 300B baseline

    Distillation improvement at 300B training to- kens with𝛼= 0.2.Student is 1.7B trained on 300B tokens. Results show percentage improvement over stan- dard pretraining 300B baseline. G.1 Generalization to Different Architecture Families Our main experiments use the Llama3 architecture family. We now test whether our findings transfer to a different architec...

  21. [21]

    cold start

    Entropy change vs. PPL improvement.Each point is one distilled student. Most successful students show increased entropy (less confident) yet improved PPL (more accurate), indicating distillation redistributes probability rather than sharpening predictions. Strong teachers achieve high improvement with modest entropy change; the 0.5B teacher at high alpha ...