arxiv: 2605.13143 · v1 · submitted 2026-05-13 · 💻 cs.IT · cs.LG· math.IT

Recognition: no theorem link

On the Generalization of Knowledge Distillation: An Information-Theoretic View

Bingying Li , Haiyun He

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:17 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords knowledge distillationgeneralization boundsKL divergencealgorithmic stabilityloss sharpnessstochastic processesinformation theory

0 comments

The pith

Knowledge distillation generalization is bounded by the KL divergence between teacher and student training kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the training of a teacher model and a student model in knowledge distillation as two coupled stochastic processes. It defines a distillation divergence as the Kullback-Leibler divergence between the probability kernels that govern these processes. Using this object, the authors derive an upper bound on the student's generalization gap relative to the teacher's gap via algorithmic stability under a sub-Gaussian assumption, and a matching lower bound under a central condition whose dependence on the divergence is sharper. They also produce a loss-sharpness-aware variant that becomes strictly tighter when the teacher loss surface is locally flat, and they decompose the divergence explicitly in a linear Gaussian case into bias, variance, and rank-bottleneck terms.

Core claim

We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the

What carries the argument

The distillation divergence: the Kullback-Leibler divergence between the stochastic kernels describing the teacher and student training processes.

If this is right

The student's generalization gap is at most the teacher's gap plus a term controlled by the distillation divergence under stability.
The student's gap is at least the teacher's gap minus a term that grows with the same divergence under the central condition.
Locally flat teacher loss surfaces produce strictly tighter bounds on the student.
In linear Gaussian models the divergence splits into explicit bias, variance, and rank costs that can guide architecture and temperature choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teachers should be chosen or regularized for local flatness even when their accuracy is only average.
The same kernel-divergence construction may apply to other transfer settings such as self-distillation or co-training.
If the central condition holds approximately in deep networks, the lower bound supplies a practical target for divergence minimization during distillation.
The linear-Gaussian decomposition suggests monitoring rank deficiency as a separate diagnostic when distilling compressed models.

Load-bearing premise

Teacher and student training can be treated as coupled stochastic processes whose transition kernels admit a well-defined KL divergence.

What would settle it

A concrete training run on a fixed architecture and dataset in which the measured distillation divergence is large yet the student's test error ends up smaller than the upper bound predicts.

read the original abstract

Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper models teacher and student training in knowledge distillation as coupled stochastic processes, introduces a distillation divergence defined as the KL divergence between the corresponding Markov kernels, and derives two generalization bounds for the student relative to the teacher's gap—an upper bound via algorithmic stability under a sub-Gaussian assumption and a lower bound under a central condition—plus a loss-sharpness-aware variant that tightens when the teacher is locally flat. A linear-Gaussian case study decomposes the divergence into bias, variance, and rank-bottleneck terms.

Significance. If the derivations are valid and the modeling assumptions hold beyond the linear-Gaussian setting, the framework supplies a principled information-theoretic account of distillation's generalization benefit together with an interpretable decomposition that could guide practical design choices. The explicit tightness regime for the sharpness-aware bound is a positive feature.

major comments (3)

[Abstract] Abstract: the upper bound is obtained by invoking algorithmic stability under a sub-Gaussian tail assumption on the loss; this assumption is only verified in the linear-Gaussian case study where the loss is quadratic, yet the paper claims the bound for general distillation. In typical neural-network settings the loss is unbounded and gradient noise is heavier-tailed, so the stability constant can grow with width or depth and render the claimed dependence on distillation divergence vacuous.
[Abstract] Abstract: the modeling premise that teacher and student training admit well-defined coupled stochastic kernels whose KL divergence is finite is load-bearing for both bounds; no conditions guaranteeing existence or finiteness of this KL are stated, and it is unclear whether the premise holds for standard non-convex distillation losses.
[Abstract] Abstract: the lower bound under the central condition and the sharpness-aware tightening inherit the same kernel-modeling premise; without a concrete verification or relaxation of the central condition in distillation settings, the sharper dependence on distillation divergence remains conditional on an unverified hypothesis.

minor comments (2)

The distillation divergence is introduced as a new named quantity; an explicit equation number and a short paragraph contrasting it with existing divergences (e.g., mutual information or f-divergences) would improve readability.
The linear-Gaussian case study should include a brief statement of the precise assumptions under which the bias-variance-rank decomposition holds, so readers can judge its scope.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to clarify assumptions and strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the upper bound is obtained by invoking algorithmic stability under a sub-Gaussian tail assumption on the loss; this assumption is only verified in the linear-Gaussian case study where the loss is quadratic, yet the paper claims the bound for general distillation. In typical neural-network settings the loss is unbounded and gradient noise is heavier-tailed, so the stability constant can grow with width or depth and render the claimed dependence on distillation divergence vacuous.

Authors: We agree that the sub-Gaussian assumption is verified explicitly only in the linear-Gaussian case study. The upper bound derivation in Section 3 is presented under this assumption, and the abstract already qualifies it as holding 'under a sub-Gaussian assumption'. We do not claim unconditional validity for general distillation. In the revised manuscript we will add a clarifying paragraph in the discussion section (Section 5) that explicitly addresses the scope: the bound's dependence on distillation divergence is informative when sub-Gaussianity holds (as in the quadratic-loss case), while noting that unbounded losses and heavier-tailed noise in typical neural-network training may render the stability constant large. This will prevent any overstatement of generality. revision: yes
Referee: [Abstract] Abstract: the modeling premise that teacher and student training admit well-defined coupled stochastic kernels whose KL divergence is finite is load-bearing for both bounds; no conditions guaranteeing existence or finiteness of this KL are stated, and it is unclear whether the premise holds for standard non-convex distillation losses.

Authors: We acknowledge that the manuscript does not state explicit conditions ensuring the coupled kernels are well-defined and that their KL divergence is finite. This modeling premise is introduced in Section 2 as the foundation for the framework. In the revision we will insert a new subsection (Section 2.1) that provides sufficient conditions for existence and finiteness, including absolute continuity of the kernels with respect to a common dominating measure and integrability requirements on the loss functions that guarantee finite relative entropy. For non-convex distillation losses we will add an honest discussion noting that the premise is assumed to hold along typical training trajectories under standard regularization, while acknowledging that a general rigorous guarantee remains open and depends on the specific dynamics. revision: yes
Referee: [Abstract] Abstract: the lower bound under the central condition and the sharpness-aware tightening inherit the same kernel-modeling premise; without a concrete verification or relaxation of the central condition in distillation settings, the sharper dependence on distillation divergence remains conditional on an unverified hypothesis.

Authors: The lower bound and the sharpness-aware variant do rely on the kernel-modeling premise together with the central condition. We will revise the linear-Gaussian case study in Section 4 to include an explicit verification that the central condition holds under the stated covariance assumptions. We will also add a short remark in Section 3.2 discussing the central condition's role and noting that empirical checks or local relaxations around minima may be feasible in practice. The explicit tightness regime for the sharpness-aware bound (when the teacher's loss is locally flat) is already stated and will be emphasized as a concrete regime where the sharper dependence is realized. revision: partial

Circularity Check

0 steps flagged

No circularity: bounds derived from standard stability and central-condition results applied to a newly defined but non-referential divergence

full rationale

The paper defines distillation divergence as the KL between the coupled Markov kernels of teacher and student training processes. It then invokes the standard algorithmic-stability upper bound (under an external sub-Gaussian assumption on the loss) and the standard central-condition lower bound, both of which are independent of the paper's own definitions. The resulting expressions relate the student's generalization gap to the teacher's gap plus terms involving this divergence; the divergence appears as an input quantity rather than being recovered from the bound itself. The linear-Gaussian case study merely decomposes the already-defined divergence into bias/variance/rank terms and does not close any loop. No self-citations are used to justify uniqueness or to smuggle an ansatz, and no fitted parameter is relabeled as a prediction. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the introduction of the distillation divergence and on two standard technical assumptions from generalization theory; no free parameters are fitted in the abstract description.

axioms (2)

domain assumption Sub-Gaussian assumption on the loss functions
Invoked to obtain the algorithmic-stability upper bound
domain assumption Central condition
Required for the sharper lower bound on the student's generalization gap

invented entities (1)

distillation divergence no independent evidence
purpose: Quantify the difference between the stochastic kernels of teacher and student training processes
Defined as the KL divergence between the two kernels; no independent empirical validation supplied in the abstract

pith-pipeline@v0.9.0 · 5456 in / 1391 out tokens · 51057 ms · 2026-05-14T20:17:32.574187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. [Online]. Available: https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

Deep mutual learning,

Y . Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

2018
[3]

In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” inProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 1285–1294. [Online]. Available: https://doi.org/10.1145/3097983. 3098135

work page doi:10.1145/3097983 2017
[4]

Towards understanding knowledge distillation,

M. Phuong and C. Lampert, “Towards understanding knowledge distillation,” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 5142–5151. [Online]. Available: https://proceedings.mlr.press/v97/phuong19a.html

2019
[5]

Do deep nets really need to be deep?

L. J. Ba and R. Caruana, “Do deep nets really need to be deep?” inAdvances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds., vol. 27. Curran Associates, Inc., 2014. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/ 2014/file/b0c355a9dedccb50e5537e8f2e3f0810-Paper.pdf

2014
[6]

Unifying distillation and privileged information,

D. Lopez-Paz, B. Sch ¨olkopf, L. Bottou, and V . Vapnik, “Unifying distillation and privileged information,” inInternational Conference on Learning Representations (ICLR), Nov. 2016

2016
[7]

Learning using privileged information: Similarity control and knowledge transfer,

V . Vapnik and R. Izmailov, “Learning using privileged information: Similarity control and knowledge transfer,”Journal of Machine Learning Research, vol. 16, no. 61, pp. 2023–2049, 2015. [Online]. Available: http://jmlr.org/papers/v16/vapnik15b.html

2023
[8]

Generalization bounds via distillation,

D. Hsu, Z. Ji, M. Telgarsky, and L. Wang, “Generalization bounds via distillation,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/ forum?id=EGdFhBzmAwB

2021
[9]

A statistical perspective on distillation,

A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol
[10]

7632–7642

PMLR, 18–24 Jul 2021, pp. 7632–7642. [Online]. Available: https://proceedings.mlr.press/v139/menon21a.html

2021
[11]

Knowledge distillation performs partial variance reduction,

M. Safaryan, A. Peste, and D. Alistarh, “Knowledge distillation performs partial variance reduction,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 75 229–75 258. [Online]. Available: https://proceedings.neurips.cc/paper files/pape...

2023
[12]

Revisiting knowledge distillation via label smoothing regularization,

L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020
[13]

Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,

G. Ji and Z. Zhu, “Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 20 823–20 833. [Online]. Available: https://proceedings.neurips.cc/paper...

2020
[14]

Flat Minima , year =

S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 01 1997. [Online]. Available: https: //doi.org/10.1162/neco.1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997
[15]

Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,

T. Zhang, M. Xue, J. Zhang, H. Zhang, Y . Wang, L. Cheng, J. Song, and M. Song, “Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 20 176–20 185

2023
[16]

Sharpness-aware minimization for efficiently improving generalization,

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=6Tm1mposlrM

2021
[17]

Leveraging flatness to improve information-theoretic generalization bounds for SGD,

Z. Peng, J. Zhang, Y . Wang, L. Qi, Y . Shi, and Y . Gao, “Leveraging flatness to improve information-theoretic generalization bounds for SGD,” inInternational Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 43 012– 43 060. [Online]. Available: https://proceedings.iclr.cc/paper files/paper/ ...

2025
[18]

The information bottleneck method

N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000. [Online]. Available: https://arxiv.org/abs/ physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 2000
[19]

Efficient knowledge distillation from model checkpoints,

C. Wang, Q. Yang, R. Huang, S. Song, and G. Huang, “Efficient knowledge distillation from model checkpoints,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 607–619. [Online]. Available: https://proceedings.neurips.cc/paper files/pap...

2022
[20]

Varia- tional information distillation for knowledge transfer,

S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Varia- tional information distillation for knowledge transfer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

2019
[21]

Contrastive representation distillation,

Y . Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” inInternational Conference on Learning Representations,
[22]

Available: https://openreview.net/forum?id=SkgpBJrtvS

[Online]. Available: https://openreview.net/forum?id=SkgpBJrtvS
[23]

Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,

L. Ye, S. Mohajer Hamidi, R. Tan, and E.-H. Y ANG, “Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,” inInternational Conference on Learning Representations, B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, Eds., vol. 2024, 2024, pp. 26 722– 26 754. [Online]. Available: https...

2024
[24]

Supplementary material,

B. Li and H. He, “Supplementary material,” 2026. [Online]. Available: https://github.com/bingying-li10/PaperAppendices/ blob/main/ISIT2026 IT BD KD appendix.pdf

2026
[25]

Information-theoretic analysis of generalization capability of learning algorithms,

A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” inAdvances in Neural Information Processing Systems, I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/pape...

2017
[26]

Fast rate information- theoretic bounds on generalization errors,

X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Fast rate information- theoretic bounds on generalization errors,”IEEE Transactions on Infor- mation Theory, vol. 71, no. 8, pp. 6373–6392, 2025

2025
[27]

Drone: Data-aware low-rank compression for large NLP models,

P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Drone: Data-aware low-rank compression for large NLP models,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 29 321– 29 334. [Online]. Available: https://proceedings.neurips.cc/pape...

2021