Recognition: unknown
Knowledge Distillation Must Account for What It Loses
Pith reviewed 2026-05-08 03:27 UTC · model grok-4.3
The pith
Distillation often lets students match teacher task scores while losing the capabilities that make those scores reliable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that current evaluation assumes retained task scores imply retained teacher capabilities, but reframing distillation as a lossy projection shows students can match selected observables without preserving the capabilities that make teacher behavior reliable; existing studies contain concrete, recurring, measurable off-metric losses that are unaccounted for when only retention is reported.
What carries the argument
Reframing knowledge distillation as a lossy projection, together with a taxonomy of off-metric distillation losses and the proposed Distillation Loss Statement that reports preserved elements, lost elements, and acceptable remaining losses.
If this is right
- Evaluations will need to check preservation of specific teacher capabilities beyond headline task metrics.
- Different deployment scenarios will require distinct preservation targets rather than uniform score matching.
- A Distillation Loss Statement will document what was kept, what was lost, and the justification for remaining losses.
- Studies will shift from reporting only retained performance to also quantifying and accepting off-metric losses.
Where Pith is reading between the lines
- Benchmark suites could add capability probes that are independent of the original training task to expose hidden losses.
- In regulated domains such as healthcare or autonomous systems, the statement could become part of model release documentation.
- The same logic may apply to other compression methods like pruning or quantization where performance metrics can mask behavioral drift.
Load-bearing premise
That retained task scores reliably indicate preserved teacher capabilities and that off-metric losses are concrete enough to be identified and measured in practice.
What would settle it
A controlled distillation experiment in which students achieve equivalent task scores to the teacher yet show no measurable differences on capability tests for robustness, calibration, or out-of-distribution behavior would weaken the claim.
read the original abstract
This position paper argues that knowledge distillation must account for what it loses: student models should be judged not only by retained task scores, but by whether they preserve the teacher capabilities that make those scores reliable. This matters because distillation is increasingly used to turn large teacher models into deployable students, yet headline metrics can obscure losses in the capabilities that make teacher behavior reliable. Conceptually, we show that current evaluation often assumes retained task scores imply retained teacher capabilities. Reframing distillation as a lossy projection exposes this flaw: students may match selected teacher observables without preserving the capabilities that make them reliable. We then synthesize existing evidence into a taxonomy of off-metric distillation losses, showing that such losses are concrete, recurring, and measurable, yet often unaccounted for when studies report what students retain rather than what they lose. To make the position actionable, we propose scenario-specific preservation targets and a Distillation Loss Statement that reports what was preserved, what was lost, and why the remaining losses are acceptable. The goal is not lossless distillation, but accountable distillation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that knowledge distillation must account for what it loses: student models should be judged not only by retained task scores, but by whether they preserve the teacher capabilities that make those scores reliable. It reframes distillation as a lossy projection to expose the flaw in assuming retained task scores imply retained capabilities, synthesizes existing evidence into a taxonomy of off-metric distillation losses (showing they are concrete, recurring, and measurable yet often unaccounted for), and proposes scenario-specific preservation targets along with a Distillation Loss Statement that reports what was preserved, what was lost, and why remaining losses are acceptable.
Significance. If the position holds, this work could shift evaluation norms in distillation research toward more accountable reporting of capability losses, particularly for deployed student models in reliability-sensitive settings. The synthesis of prior evidence into a structured taxonomy and the introduction of concrete tools (preservation targets and the Distillation Loss Statement) provide a practical framework that builds directly on existing literature without introducing new parameters or ungrounded entities.
major comments (2)
- [Proposal for preservation targets and Distillation Loss Statement] The proposal for scenario-specific preservation targets and the Distillation Loss Statement is central to the claim of actionability. The manuscript does not supply a template, example format, or worked illustration of the Statement (e.g., what fields it would contain or how it would be populated for a concrete distillation scenario), which is load-bearing for readers to assess its feasibility.
- [Taxonomy of off-metric distillation losses] The taxonomy of off-metric distillation losses asserts that such losses 'are concrete, recurring, and measurable, yet often unaccounted for.' Because this synthesis underpins the reframing and the call for change, the manuscript should include at least one specific citation or brief summary per category that demonstrates an observed loss in prior work that was omitted from standard task-score reporting.
minor comments (1)
- [Abstract] The abstract introduces the term 'Distillation Loss Statement' without a one-sentence definition or parenthetical gloss; a brief clarification on first use would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our position paper and the recommendation for minor revision. We address each major comment below, agreeing to incorporate concrete additions that strengthen the actionability and evidentiary support of our proposals without altering the core arguments.
read point-by-point responses
-
Referee: [Proposal for preservation targets and Distillation Loss Statement] The proposal for scenario-specific preservation targets and the Distillation Loss Statement is central to the claim of actionability. The manuscript does not supply a template, example format, or worked illustration of the Statement (e.g., what fields it would contain or how it would be populated for a concrete distillation scenario), which is load-bearing for readers to assess its feasibility.
Authors: We agree that an explicit template and worked example are necessary to demonstrate feasibility. In the revised manuscript, we will add a new subsection providing a clear template for the Distillation Loss Statement with fields including Scenario Description, Preservation Targets, Measured Losses (with methods), and Justification for Acceptability of Remaining Losses. We will populate this template with a worked illustration drawn from a standard distillation scenario in the literature (e.g., distilling a vision transformer for image classification), showing how the fields would be completed based on patterns from existing studies. This addition will be placed in the section on making the position actionable. revision: yes
-
Referee: [Taxonomy of off-metric distillation losses] The taxonomy of off-metric distillation losses asserts that such losses 'are concrete, recurring, and measurable, yet often unaccounted for.' Because this synthesis underpins the reframing and the call for change, the manuscript should include at least one specific citation or brief summary per category that demonstrates an observed loss in prior work that was omitted from standard task-score reporting.
Authors: We acknowledge the value of grounding each taxonomy category with specific evidence. We will revise the taxonomy section to include, for every loss category, at least one citation to prior work accompanied by a brief summary of the observed off-metric loss that was not reported via standard task scores. These citations and summaries will be selected from the existing literature synthesized in the paper, ensuring the additions remain within the scope of a position paper and do not require new experiments. This will directly support the claim that such losses are recurring yet unaccounted for. revision: yes
Circularity Check
No significant circularity detected
full rationale
This is a position paper that reframes knowledge distillation conceptually as a lossy projection and synthesizes existing literature into a taxonomy of off-metric losses, without introducing equations, derivations, fitted parameters, or quantitative predictions. The central claims rest on references to prior external evidence rather than internal self-citations, self-definitions, or renamings that reduce to the paper's own inputs by construction. No load-bearing step equates a claimed result to a fitted input or prior author work in a circular manner; the argument remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Current distillation evaluation assumes retained task scores imply retained teacher capabilities
invented entities (1)
-
Distillation Loss Statement
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531
work page internal anchor Pith review arXiv 2015
-
[2]
J., and Tao, D
Gou, J., Yu, B., Maybank, S. J., and Tao, D. (2021). Knowledge Distillation: A Survey.International Journal of Computer Vision, 129(6):1789–1819
2021
-
[3]
Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108
work page internal anchor Pith review arXiv 2019
- [4]
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review arXiv 2025
-
[6]
Kang, M., Jeong, J., Lee, S., Cho, J., and Hwang, S. J. (2025). Distilling LLM Agent into Small Models with Retrieval and Code Tools. OpenReview, NeurIPS 2025 Conference (Spotlight)
2025
-
[7]
Stanton, S., Izacard, G., and Roux, N. L. (2021). Does Knowledge Distillation Really Work? InAdvances in Neural Information Processing Systems
2021
-
[8]
Ojha, U., Li, Y ., Sundara Rajan, A., Liang, Y ., and Lee, Y . J. (2023). What Knowledge Gets Distilled in Knowledge Distillation? InAdvances in Neural Information Processing Systems, 36:11037–11048
2023
- [9]
-
[10]
Hebbalaguppe, R., Baranwal, M., Prakash, J., Madan, N., Anand, K., and Arora, C. (2024). Understanding Calibration Transfer in Knowledge Distillation. OpenReview, ICLR 2024 withdrawn submission
2024
-
[11]
and Rei, M
Stacey, J. and Rei, M. (2024). Distilling Robustness into Natural Language Inference Models with Domain- Targeted Augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2239–2258
2024
-
[12]
A., Carlini, N., and Tramèr, F
Jagielski, M., Nasr, M., Lee, K., Choquette-Choo, C. A., Carlini, N., and Tramèr, F. (2023). Students Parrot Their Teachers: Membership Inference on Model Distillation. InAdvances in Neural Information Processing Systems, 36
2023
-
[13]
S., Lu, H., Cai, Y ., and Haddadi, H
Zhang, Z., Shamsabadi, A. S., Lu, H., Cai, Y ., and Haddadi, H. (2025). Membership and Memorization in LLM Knowledge Distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20074–20084
2025
-
[14]
Reasoning Models Don't Always Say What They Think
Chen, Y ., Benton, J., Radhakrishnan, A., Uesato, J., Denison, C., Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V ., Bowman, S. R., Leike, J., Kaplan, J., and Perez, E. (2025). Reasoning Models Don’t Always Say What They Think. arXiv preprint arXiv:2505.05410
work page internal anchor Pith review arXiv 2025
-
[15]
A Survey of On-Policy Distillation for Large Language Models
Song, M. and Zheng, M. (2026). A Survey of On-Policy Distillation for Large Language Models. arXiv preprint arXiv:2604.00626
work page internal anchor Pith review arXiv 2026
-
[16]
Shumailov, I., Shumaylov, Z., Zhao, Y ., Papernot, N., Anderson, R., and Gal, Y . (2024). AI Models Collapse When Trained on Recursively Generated Data.Nature, 631:755–759
2024
-
[17]
D., and Gebru, T
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. (2019). Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229
2019
-
[18]
W., Wallach, H., Daumé III, H., and Crawford, K
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets.Communications of the ACM, 64(12):86–92
2021
-
[19]
Zhao, D., Andrews, J. T. A., Papakyriakopoulos, O., and Xiang, A. (2024). Position: Measure Dataset Diversity, Don’t Just Claim It. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 60644–60673
2024
-
[20]
Tramèr, F., Kamath, G., and Carlini, N. (2024). Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48453–48467. 10
2024
- [21]
-
[22]
and Ioannou, Y
Mohammadshahi, A. and Ioannou, Y . (2025). What Is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias.Transactions on Machine Learning Research
2025
-
[23]
Zhang, M., Liu, D., Zhang, K., Franco, J., and Liu, H. (2026). Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety. arXiv preprint arXiv:2602.11157
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
K., Rawat, A
Menon, A. K., Rawat, A. S., Reddi, S. J., Kim, S., and Kumar, S. (2021). A Statistical Perspective on Distillation. InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632–7642
2021
-
[25]
Gu, Y ., Dong, L., Wei, F., and Huang, M. (2024). MiniLLM: Knowledge Distillation of Large Language Models. InInternational Conference on Learning Representations
2024
-
[26]
E., Chassang, A., Gatta, C., and Bengio, Y
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y . (2015). FitNets: Hints for Thin Deep Nets. InInternational Conference on Learning Representations
2015
-
[27]
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020). MiniLM: Deep Self-Attention Dis- tillation for Task-Agnostic Compression of Pre-Trained Transformers. InAdvances in Neural Information Processing Systems
2020
-
[28]
Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. (2021). Teacher’s Pet: Understanding and Mitigating Biases in Distillation. arXiv preprint arXiv:2106.10494
-
[29]
A., Xu, Z., and Garcia-Olano, D
Borkar, J., Chadha, K., Mireshghallah, N., Zhang, Y ., Veliche, I.-E., Mitra, A., Smith, D. A., Xu, Z., and Garcia-Olano, D. (2026). Memorization Dynamics in Knowledge Distillation for Language Models. arXiv preprint arXiv:2601.15394
-
[30]
Guo, C., Pleiss, G., Sun, Y ., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330
2017
-
[31]
Fan, H., Jiang, Z., Lei, J., and Zhang, M. (2024). Revisit the Essence of Distilling Knowledge Through Calibration. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 12882–12894
2024
-
[32]
Geng, J., Cai, F., Wang, Y ., Koeppl, H., Nakov, P., and Gurevych, I. (2024). A Survey of Confidence Estimation and Calibration in Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577–6595
2024
-
[33]
Kapoor, S., Gruver, N., Roberts, M., Collins, K., Pal, A., Bhatt, U., Weller, A., Dooley, S., Goldblum, M., and Wilson, A. G. (2024). Large Language Models Must Be Taught to Know What They Don’t Know. In Advances in Neural Information Processing Systems, 37
2024
-
[34]
Wen, B., Yao, J., Feng, S., Xu, C., Tsvetkov, Y ., Howe, B., and Wang, L. L. (2025). Know Your Limits: A Survey of Abstention in Large Language Models.Transactions of the Association for Computational Linguistics, 13
2025
-
[35]
Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. (2023). Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017
2023
- [36]
-
[37]
Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. InAdvances in Neural Information Processing Systems, 36:74952–74965
2023
-
[38]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, 33
2020
-
[39]
Jia, P., Xu, D., Li, X., Du, Z., Li, X., Wang, Y ., Wang, Y ., Liu, Q., Wang, M., Guo, H., Tang, R., and Zhao, X. (2025). Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4242–4256. 11
2025
-
[40]
Huang, L., Feng, X., Ma, W., Gu, Y ., Zhong, W., Feng, X., Yu, W., Peng, W., Tang, D., Tu, D., and Qin, B. (2024). Learning Fine-Grained Grounded Citations for Attributed Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14095–14113
2024
- [41]
-
[42]
Muhamed, A., Ribeiro, L. F. R., Dreyer, M., Smith, V ., and Diab, M. T. (2026). RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6811–6856
2026
-
[43]
Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Korbak, T., Sleight, H., Agrawal, R., Hughes, J., Pai, D. B., Gromov, A., Roberts, D., Yang, D., Donoho, D. L., and Koyejo, S. (2024). Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. arXiv preprint arXiv:2404.01413
-
[44]
Awal, M. A., Rochan, M., and Roy, C. K. (2025). A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher? arXiv preprint arXiv:2511.05476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Tabassi, E. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, National Institute of Standards and Technology.https://doi.org/10.6028/NIST.AI.100-1
-
[46]
Bucila, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model Compression. InProceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541
2006
-
[47]
V ., Chi, E
Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. InInternational Conference on Learning Representations
2023
-
[48]
C., Tschannen, M., Itti, L., and Anandkumar, A
Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and Anandkumar, A. (2018). Born Again Neural Networks. InProceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1607–1616
2018
-
[49]
and Lampert, C
Phuong, M. and Lampert, C. (2019). Towards Understanding Knowledge Distillation. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5142–5151
2019
-
[50]
V ., and Zhou, D
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V ., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, 35:24824–24837
2022
-
[51]
Li, Y ., Zhang, H., Cao, J., Ma, X., and Gao, J. (2023). Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2665–2679
2023
-
[52]
Measuring Faithfulness in Chain-of-Thought Reasoning
Lanham, T., Garriga-Alonso, A., Cooper, A. F., Hill, K., Greenblatt, R., Noble, R., Birch, A., and others (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv preprint arXiv:2307.13702
work page internal anchor Pith review arXiv 2023
-
[53]
Madsen, A., Chandar, S., and Reddy, S. (2024). Are Self-Explanations from Large Language Models Faithful? InFindings of the Association for Computational Linguistics: ACL 2024, pages 295–337
2024
-
[54]
Cao, L. (2024). Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3628–3646. 12 A Representative Reporting-Pattern Checklist Table 4 is a representative 50-paper checklist used...
2024
-
[55]
Retention and distribu- tion Introduces soft targets as information beyond hard labels
-
[56]
Method taxonomy Distinguishes response-, feature-, and relation- based KD
-
[57]
Retention evidence Shows successful compression and retained language-understanding performance
-
[58]
LLM KD background Documents the diversity of modern LLM distil- lation settings
-
[59]
Reasoning distillation Illustrates the contemporary importance of dis- tilled reasoning students
-
[60]
Agent/tool distillation Shows that tool behavior can become a distilla- tion target
-
[61]
Distribution loss Shows student predictive distributions may di- verge from teachers
-
[62]
Distribution loss Explains why teacher probability estimates can matter beyond accuracy
-
[63]
Generative distribution Studies how distribution-matching choices affect LLM KD
-
[64]
Theory Analyzes why KD can work without reducing success to score retention
-
[65]
Property transfer Studies which off-task properties are inherited by students
-
[66]
Loss study Directly studies information loss between teacher and student
-
[67]
Representation preserva- tion Uses intermediate hints, showing outputs alone may be insufficient
-
[68]
Relation preservation Transfers attention and value relations, not only final outputs
-
[69]
Counterpoint Shows students may outperform teachers on some metrics
-
[70]
Robustness loss Shows adversarial robustness may fail to transfer under KD
-
[71]
OOD loss Shows in-distribution gains do not guarantee tar- get robustness
-
[72]
Subgroup behavior Studies uneven group-wise effects of distillation
-
[73]
Fairness loss Examines fairness and bias after knowledge transfer
-
[74]
Calibration metric Establishes confidence calibration as distinct from accuracy
-
[75]
Calibration transfer Studies whether calibration transfers through KD
-
[76]
Calibration as KD Treats calibration as central to distilling knowl- edge
-
[77]
Uncertainty background Surveys confidence estimation and calibration in LLMs
-
[78]
Uncertainty behavior Argues models must learn what they do not know
-
[79]
13 Work Evidence type Role in our argument
Abstention Surveys abstention as a distinct LLM capability. 13 Work Evidence type Role in our argument
-
[80]
Rationale distillation Shows rationales can improve small-model learn- ing
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.