pith. sign in

arxiv: 2606.00306 · v1 · pith:R53NXR6Inew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Rethinking the Role of Temperature in Large Language Model Distillation

Pith reviewed 2026-06-28 23:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM distillationtemperature scalingforward KLreverse KLknowledge transferinstruction following
0
0 comments X

The pith

Increasing temperature in LLM distillation makes forward KL outperform reverse KL on instruction tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the standard choice of reverse KL over forward KL for distilling large language models, noting that prior comparisons often fixed temperature at one. Raising temperature softens the teacher outputs and adds information from non-dominant tokens to forward KL gradients, while it mostly rescales the already dominant signals in reverse KL. This asymmetry means forward KL pulls ahead at higher temperatures across instruction-following benchmarks. The same temperature adjustment also lifts results for a wider set of distillation losses, so basic KL methods can match more elaborate recent techniques.

Core claim

Temperature scaling in LLM distillation produces an asymmetric effect: it substantially enriches forward KL with non-dominant token signals while mainly rescaling gradients for reverse KL. As a result, although reverse KL outperforms forward KL at temperature one, forward KL surpasses reverse KL at higher temperatures across instruction-following benchmarks. The benefit of temperature extends beyond these two divergences to a broader family of distillation objectives.

What carries the argument

Temperature scaling of the teacher distribution inside forward and reverse KL distillation losses, which creates an asymmetric enrichment of gradient signals from non-dominant tokens.

If this is right

  • FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks.
  • Temperature scaling improves performance for a broader family of distillation objectives beyond the two KL variants.
  • Simple KL-based distillation methods reach competitive results against recent state-of-the-art approaches once temperature is raised.
  • The performance gain stems from temperature enriching non-dominant token signals more in FKL than in RKL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Temperature may need to be tuned jointly with the choice of divergence rather than held fixed at one.
  • The same temperature adjustment could be tested in distillation settings outside language models to check for similar asymmetry.
  • Practitioners could add temperature sweeps as a standard step when selecting or combining distillation losses.

Load-bearing premise

The observed asymmetry between forward and reverse KL under temperature scaling holds across model sizes, datasets, and training setups without being driven by other unmentioned choices.

What would settle it

Re-running the distillation experiments on a new collection of model sizes or datasets and finding that reverse KL still outperforms forward KL at high temperatures would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.00306 by Hoang-Chau Luong, Lingwei Chen.

Figure 1
Figure 1. Figure 1: Effect of temperature on FKL and RKL across model scales. Temperature consistently improves FKL [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Temperature-sensitivity heatmap on GPT-2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $\tau$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $\tau$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $\tau=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that temperature scaling τ in LLM distillation has an asymmetric effect on forward KL (FKL) versus reverse KL (RKL): τ enriches FKL with non-dominant token signals while primarily rescaling RKL gradients. Consequently, although RKL outperforms FKL at τ=1, FKL surpasses RKL at higher τ across instruction-following benchmarks. The work further asserts that temperature improves a broader family of distillation objectives, allowing simple KL-based methods to compete with recent SOTA approaches.

Significance. If the empirical reversal and gradient asymmetry hold under controlled conditions, the result would challenge the prevailing preference for RKL in distillation and underscore temperature as a first-class hyperparameter. The mechanistic analysis of how τ affects each divergence provides a concrete explanation that, if supported by the experiments, elevates the contribution beyond a simple benchmark comparison.

major comments (2)
  1. [Experimental protocol / §4] The central empirical claim—that FKL overtakes RKL at high τ—rests on the assumption that temperature is the sole variable altered between comparisons. The manuscript must explicitly document (e.g., in the experimental protocol section) that optimizer, learning-rate schedule, number of epochs, data order, and initialization were identical across all FKL/RKL × τ settings; without such controls the observed reversal could arise from differential convergence or regularization effects rather than the claimed asymmetry.
  2. [Gradient analysis / §3] The abstract states that temperature “substantially enriches FKL with non-dominant token signals” while “mainly rescales RKL gradients,” yet no quantitative measure (e.g., gradient norm ratios, token-probability histograms, or an equation defining the enrichment) is referenced. A load-bearing claim of this form requires an explicit derivation or table (e.g., Eq. (X) or Table Y) showing the asymmetry before the benchmark results can be interpreted as evidence for it.
minor comments (2)
  1. [Results / §5] Clarify whether statistical significance (e.g., paired t-tests or bootstrap intervals) was computed for the benchmark reversals; reporting only mean scores leaves open the possibility that differences lie within run-to-run variance.
  2. [Broader objectives / §6] The claim that temperature improves “a broader family of distillation objectives” should be accompanied by the precise list of objectives tested and the corresponding performance deltas, ideally in a single summary table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We address each major comment below.

read point-by-point responses
  1. Referee: [Experimental protocol / §4] The central empirical claim—that FKL overtakes RKL at high τ—rests on the assumption that temperature is the sole variable altered between comparisons. The manuscript must explicitly document (e.g., in the experimental protocol section) that optimizer, learning-rate schedule, number of epochs, data order, and initialization were identical across all FKL/RKL × τ settings; without such controls the observed reversal could arise from differential convergence or regularization effects rather than the claimed asymmetry.

    Authors: We agree that explicit documentation strengthens the interpretation of the results. In the revised manuscript we will expand §4 to state that the optimizer, learning-rate schedule, number of epochs, data order, and initialization were held fixed across all FKL/RKL × τ combinations. revision: yes

  2. Referee: [Gradient analysis / §3] The abstract states that temperature “substantially enriches FKL with non-dominant token signals” while “mainly rescales RKL gradients,” yet no quantitative measure (e.g., gradient norm ratios, token-probability histograms, or an equation defining the enrichment) is referenced. A load-bearing claim of this form requires an explicit derivation or table (e.g., Eq. (X) or Table Y) showing the asymmetry before the benchmark results can be interpreted as evidence for it.

    Authors: Section 3 already contains the derivation showing the asymmetric effect of τ on the two divergences. To make the claim more directly verifiable we will add an explicit table (new Table Y) reporting gradient-norm ratios and token-probability histograms for representative τ values, with a cross-reference from the abstract and §3. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper's central claims rest on empirical benchmark comparisons of FKL vs RKL under temperature scaling, plus analysis of how temperature affects token signals and gradients in the loss. No derivation chain is presented that reduces by construction to fitted inputs, self-citations, or ansatzes; the asymmetry result is demonstrated experimentally rather than forced by definition or prior self-work. This matches the default expectation for an empirical study with no load-bearing mathematical reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is abstract-only so ledger is necessarily incomplete; the work relies on standard definitions of forward and reverse KL and on the assumption that benchmark scores reflect genuine knowledge transfer differences.

free parameters (1)
  • temperature τ
    Central experimental variable whose scaling is shown to produce the reported asymmetry; value is chosen rather than derived.
axioms (1)
  • standard math KL divergence definitions and gradient forms behave as described under temperature scaling
    Invoked when stating the asymmetric enrichment of non-dominant tokens in FKL versus rescaling in RKL.

pith-pipeline@v0.9.1-grok · 5707 in / 1207 out tokens · 19461 ms · 2026-06-28T23:24:27.103104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Rethinking kullback-leibler divergence in knowledge distillation for large language models , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  2. [2]

    Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

    Weight-inherited distillation for task-agnostic bert compression , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

  3. [3]

    Journal of Machine Learning Research , volume=

    Greedification operators for policy optimization: Investigating forward and reverse kl divergences , author=. Journal of Machine Learning Research , volume=

  4. [4]

    Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

    Sequence-level knowledge distillation , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

  5. [5]

    Advances in Neural Information Processing Systems , volume=

    Token-scaled logit distillation for ternary weight generative language models , author=. Advances in Neural Information Processing Systems , volume=

  6. [6]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Promptkd: Distilling student-friendly knowledge for generative language models via prompt tuning , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  7. [7]

    The twelfth international conference on learning representations , year=

    On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=

  8. [8]

    International Conference on Learning Representations , year=

    Minillm: Knowledge distillation of large language models , author=. International Conference on Learning Representations , year=

  9. [9]

    Distilling the Knowledge in a Neural Network

    Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

  10. [10]

    Computer Vision and Image Understanding , volume=

    Self-knowledge distillation via dropout , author=. Computer Vision and Image Understanding , volume=. 2023 , publisher=

  11. [11]

    arXiv preprint arXiv:2212.12965 , year=

    Bd-kd: Balancing the divergences for online knowledge distillation , author=. arXiv preprint arXiv:2212.12965 , year=

  12. [12]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    F-divergence minimization for sequence-level knowledge distillation , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  13. [13]

    Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

    Decoupled knowledge distillation , author=. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

  14. [14]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  15. [15]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  16. [16]

    TinyLlama: An Open-Source Small Language Model

    Tinyllama: An open-source small language model , author=. arXiv preprint arXiv:2401.02385 , year=

  17. [17]

    Forty-first International Conference on Machine Learning , year=

    DistiLLM: Towards Streamlined Distillation for Large Language Models , author=. Forty-first International Conference on Machine Learning , year=

  18. [18]

    2025 , volume =

    Wang, Guanghui and Yang, Zhiyong and Wang, Zitai and Wang, Shi and Xu, Qianqian and Huang, Qingming , booktitle =. 2025 , volume =

  19. [19]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  20. [20]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  21. [21]

    Distillation of Large Language Models via Concrete Score Matching

    Distillation of Large Language Models via Concrete Score Matching , author=. arXiv preprint arXiv:2509.25837 , year=

  22. [22]

    2009 , publisher=

    Probabilistic graphical models: principles and techniques , author=. 2009 , publisher=

  23. [23]

    arXiv preprint arXiv:2309.16240 , year=

    Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints , author=. arXiv preprint arXiv:2309.16240 , year=

  24. [24]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Scaled decoupled distillation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Decoupled kullback-leibler divergence loss , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    Mike Conover and Matt Hayes and Ankit Mathur and Jianwei Xie and Jun Wan and Sam Shah and Ali Ghodsi and Patrick Wendell and Matei Zaharia and Reynold Xin , title =

  27. [27]

    Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

    Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

  28. [28]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , year =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  29. [29]

    Proceedings of the 2022 conference on empirical methods in natural language processing , pages=

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks , author=. Proceedings of the 2022 conference on empirical methods in natural language processing , pages=

  30. [30]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Unnatural instructions: Tuning language models with (almost) no human labor , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  31. [31]

    Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

    A diversity-promoting objective function for neural conversation models , author=. Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pages=

  32. [32]

    Advances in neural information processing systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

  33. [33]

    The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

    Texygen: A benchmarking platform for text generation models , author=. The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

  34. [34]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    On the efficacy of knowledge distillation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  35. [35]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Improved knowledge distillation via teacher assistant , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  36. [36]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Multi-level logit distillation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  37. [37]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Logit standardization in knowledge distillation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  38. [38]

    Yuxian Gu and Hao Zhou and Fandong Meng and Jie Zhou and Minlie Huang , booktitle=. Mini. 2025 , url=

  39. [39]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Curriculum temperature for knowledge distillation , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  40. [40]

    2020 , eprint=

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. 2020 , eprint=

  41. [41]

    Findings of the association for computational linguistics: EMNLP 2020 , pages=

    Tinybert: Distilling bert for natural language understanding , author=. Findings of the association for computational linguistics: EMNLP 2020 , pages=

  42. [42]

    A Survey of On-Policy Distillation for Large Language Models

    A Survey of On-Policy Distillation for Large Language Models , author=. arXiv preprint arXiv:2604.00626 , year=

  43. [43]

    Entropy-Aware On-Policy Distillation of Language Models

    Entropy-Aware On-Policy Distillation of Language Models , author=. arXiv preprint arXiv:2603.07079 , year=

  44. [44]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Llm-oriented token-adaptive knowledge distillation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  45. [45]

    arXiv preprint arXiv:2604.00223 , year=

    Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation , author=. arXiv preprint arXiv:2604.00223 , year=

  46. [46]

    Jongwoo Ko and Tianyi Chen and Sungnyun Kim and Tianyu Ding and Luming Liang and Ilya Zharkov and Se-Young Yun , booktitle=. Disti. 2025 , url=

  47. [47]

    arXiv preprint arXiv:2002.03532 , year=

    Understanding and improving knowledge distillation , author=. arXiv preprint arXiv:2002.03532 , year=

  48. [48]

    In Proceedings of ICLR , year =

    Adriana Romero and Nicolas Ballas and Samira Ebrahimi Kahou and Antoine Chassang and Carlo Gatta and Yoshua Bengio , title =. In Proceedings of ICLR , year =

  49. [49]

    International journal of computer vision , volume=

    Knowledge distillation: A survey , author=. International journal of computer vision , volume=. 2021 , publisher=

  50. [50]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Knowledge distillation: A good teacher is patient and consistent , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  51. [51]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Knowledge distillation with refined logits , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  52. [52]

    Consistently Informative Soft-Label Temperature for Knowledge Distillation

    Consistently Informative Soft-Label Temperature for Knowledge Distillation , author=. arXiv preprint arXiv:2605.20357 , year=