pith. sign in

arxiv: 2605.17778 · v1 · pith:ACVPXVLTnew · submitted 2026-05-18 · 🧮 math.ST · cs.LG· stat.ME· stat.ML· stat.TH

Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

Pith reviewed 2026-05-20 00:58 UTC · model grok-4.3

classification 🧮 math.ST cs.LGstat.MEstat.MLstat.TH
keywords self-distillationspiked covariance modelsspectral shrinkage estimatorscovariance estimationhigh-dimensional statisticsoptimal estimationiterative shrinkage
0
0 comments X

The pith

s-step self-distillation achieves optimal performance among spectral shrinkage estimators for spiked covariance matrices with s spikes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes statistical foundations for self-distillation by examining it in spiked covariance models. It considers a broad class of spectral shrinkage estimators that modify the eigenvalues of the sample covariance matrix. For a model with exactly s spikes, the authors prove that applying self-distillation precisely s times yields the estimator with the lowest risk within this class. This result matters to a sympathetic reader because it supplies a theoretical reason for the empirical success of self-distillation in machine learning and links it directly to classical methods of shrinkage estimation in statistics. The analysis extends to federated settings with distributed data, where a variant of self-distillation again emerges as optimal locally.

Core claim

For spiked covariance matrices with s spikes, s-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, s steps are necessary for optimality since any (s-k)-step distilled estimator is strictly suboptimal for 1 ≤ k ≤ s. For the special subclass of isotropic covariances, optimally tuned Ridge regression performs best among spectral shrinkage estimators.

What carries the argument

Spectral shrinkage estimators, which apply a shrinkage function to the eigenvalues of the empirical covariance, with self-distillation serving as the iteration mechanism that reaches the optimal shrinkage rule after exactly s steps.

If this is right

  • s-step self-distillation is necessary and sufficient for optimality in the class of spectral shrinkage estimators when there are s spikes.
  • Fewer distillation steps produce strictly worse estimators.
  • Optimally tuned ridge regression is the best spectral shrinkage estimator when the covariance is isotropic.
  • In a federated setting with multiple data centers, the best local shrinkage rule is a form of self-distillation different from the centralized optimal rule.
  • The framework connects self-distillation to classical shrinkage methods and explains its predictive improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This optimality suggests testing whether estimating the spike count s in real data and using that many distillation steps improves performance over fixed-step methods.
  • The results could extend to other high-dimensional estimation problems where iterated procedures might achieve optimality within restricted estimator classes.
  • Practitioners might consider whether the difference in federated versus centralized rules affects how self-distillation is implemented in distributed machine learning systems.

Load-bearing premise

The data follows exactly a spiked covariance model with a known number s of spikes, and the analysis applies only within the restricted class of spectral shrinkage estimators.

What would settle it

Simulate data from a spiked covariance model with a known s, compute the estimation risk or prediction error for the s-step self-distilled estimator and for other spectral shrinkage estimators such as standard ridge or Ledoit-Wolf, and check whether the self-distilled version has the smallest risk; a smaller risk for any other estimator would falsify the optimality.

Figures

Figures reproduced from arXiv: 2605.17778 by Debarghya Mukherjee, Pragya Sur, Radu Lecoiu.

Figure 1
Figure 1. Figure 1: Entries of the vector (b (K) 0 , b (K) 1 , b (K) 2 ) as a function of the number of local servers K ∈ {1, . . . , 40}, for a two-spike covariance with δ1 = 2, δ2 = 3, α1 = 3, α2 = 2.5, and parameters c = 3, r = 5, σε = 2, σ0 = 1. Note the variation in K, demonstrating the difference in the optimal local rule (3.9) and aggregation weights (3.8) as a function of the number of clients. 4 Simulation Studies 4.… view at source ↗
Figure 2
Figure 2. Figure 2: Sub-optimality of self-distillation and principal components regression in the isotropic [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Limiting prediction risk of optimally tuned self-distillation versus optimally tuned ridge [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Optimal self-distillation parameters as functions of [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prediction risk of Ridge, optimal self-distillation (SD), and PCR as a function of the spike [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prediction risk of Ridge, one-step Ridge self-distillation, Lasso, and one-step Lasso self [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Proof dependency. Arrows point from results to the results they prove. [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Numerical check that b (K) 0 ̸= 0 across a wide range of parameter configurations. (a) b (K) 0 vs. spike scale t and number of clients K ∈ {2, . . . , 9}. (b) b (K) 0 vs. t and noise level σ ∈ (0.5, 20). (c) b (K) 0 vs. t and aspect ratio c ∈ (0.5, 20), covering both the underparametrized (c < 1) and overparametrized (c > 1) regimes. (d) b (K) 0 vs. t and signal norm r ∈ ( q ∑j α 2 j , 20). (e) b (K) 0 vs.… view at source ↗
read the original abstract

Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with $s$ spikes, $s$-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that $s$ steps are necessary for optimality: any $(s-k)$-step distilled estimator is strictly suboptimal for $1 \leq k \leq s$. For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper develops the statistical foundations of self-distillation in spiked covariance models by introducing and analyzing spectral shrinkage estimators. It claims that for spiked covariance matrices with s spikes, s-step self-distillation achieves optimal performance among spectral shrinkage estimators and outperforms standard estimators from statistics and machine learning. It further establishes that exactly s steps are necessary, as any (s-k)-step estimator is strictly suboptimal, that optimally tuned ridge regression is best among spectral shrinkage estimators for isotropic covariances, and that in a federated setting the optimal local rule again takes the form of self-distillation (though different from the centralized optimum).

Significance. If the asymptotic risk characterizations and optimality proofs hold, the work supplies a precise theoretical account of self-distillation within the classical spiked model, directly connecting modern machine-learning heuristics to random-matrix shrinkage theory. The explicit recursion for the distilled shrinkage function, the strict necessity result for s steps, and the federated extension are notable strengths that could guide both theory and practice in high-dimensional estimation.

minor comments (3)
  1. [Section 2] Section 2: the definition of the class of spectral shrinkage estimators would be clearer if the shrinkage function were introduced with an explicit functional form or integral representation before the recursion is stated.
  2. [Figure 1] Figure 1: the legend does not distinguish the curves for different numbers of distillation steps; adding labels or a table of risk values would improve readability.
  3. [Section 5] The federated aggregation rule in Section 5 is presented without an explicit comparison table to the centralized optimum; a side-by-side display of the two shrinkage functions would help readers see the difference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful and positive assessment of our manuscript. We are pleased that the referee finds the asymptotic risk characterizations, the optimality proofs, the necessity of exactly s steps, and the federated extension to be notable strengths, and we appreciate the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central derivation establishes optimality of s-step self-distillation within the class of spectral shrinkage estimators under an exact spiked covariance model by invoking standard random-matrix characterizations of spiked eigenvalues together with an explicit recursion for the shrinkage function. These steps are mathematically independent of the target optimality claim and do not reduce to self-definition, fitted-input renaming, or self-citation chains. The assumption of a known spike count s is stated explicitly as part of the model rather than derived from the estimators themselves, and the necessity of exactly s steps follows from direct risk comparisons that remain falsifiable outside the fitted values. No load-bearing ansatz is smuggled via citation, and the results are self-contained against external benchmarks in random matrix theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption of a spiked covariance model; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Data are generated from a spiked covariance model with exactly s spikes.
    This model class is the setting in which all optimality claims are derived.

pith-pipeline@v0.9.0 · 5760 in / 1226 out tokens · 56822 ms · 2026-05-20T00:58:30.334156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 4 internal anchors

  1. [1]

    Model compression

    Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541, 2006

  2. [2]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  3. [3]

    Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014

    Lei J Ba and Rich Caruana. Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014

  4. [4]

    Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

    Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

  5. [5]

    Edgev-se: Self-reflective fine-tuning framework for edge-deployable vision-language models.Applied Sciences, 16(2):818, 2026

    Yoonmo Jeon, Seunghun Lee, and Woongsup Kim. Edgev-se: Self-reflective fine-tuning framework for edge-deployable vision-language models.Applied Sciences, 16(2):818, 2026

  6. [6]

    Born again neural networks

    Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anand- kumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616, 2018. 26

  7. [7]

    Self-distillation: Towards efficient and compact neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (8):4388–4403, 2021

    Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. Self-distillation: Towards efficient and compact neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (8):4388–4403, 2021

  8. [8]

    Learning from Noisy Labels with Distillation

    Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation.arXiv preprint arXiv:1703.02391, 2017

  9. [9]

    Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

  10. [10]

    Communication-efficient algorithms for statistical optimization.Advances in neural information processing systems, 25, 2012

    Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization.Advances in neural information processing systems, 25, 2012

  11. [11]

    Parallelized stochastic gradi- ent descent.Advances in neural information processing systems, 23, 2010

    Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradi- ent descent.Advances in neural information processing systems, 23, 2010

  12. [12]

    Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021

    Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021

  13. [13]

    Federated Optimization: Distributed Machine Learning for On-Device Intelligence

    Jakub Koneˇ cn`y, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Feder- ated optimization: Distributed machine learning for on-device intelligence.arXiv preprint arXiv:1610.02527, 2016

  14. [14]

    Communication-efficient learning of deep networks from decentralized data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 ofProceedings of Machine Learning Research, pages 1273–1282, 2017

  15. [15]

    Federated multi- task learning.Advances in neural information processing systems, 30, 2017

    Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi- task learning.Advances in neural information processing systems, 30, 2017

  16. [16]

    Adaptive self-distillation for minimizing client drift in heterogeneous feder- ated learning.Transactions on Machine Learning Research, 2024

    M Yashwanth, Gaurav Kumar Nayak, Arya Singh, Yogesh Simmhan, and Anirban Chakraborty. Adaptive self-distillation for minimizing client drift in heterogeneous feder- ated learning.Transactions on Machine Learning Research, 2024

  17. [17]

    Learn- ing critically: Selective self-distillation in federated learning on non-iid data.IEEE Transac- tions on Big Data, 10(6):789–800, 2022

    Yuting He, Yiqiang Chen, XiaoDong Yang, Hanchao Yu, Yi-Hua Huang, and Yang Gu. Learn- ing critically: Selective self-distillation in federated learning on non-iid data.IEEE Transac- tions on Big Data, 10(6):789–800, 2022

  18. [18]

    Personalized feder- ated learning via backbone self-distillation

    Pengju Wang, Bochao Liu, Dan Zeng, Chenggang Yan, and Shiming Ge. Personalized feder- ated learning via backbone self-distillation. InProceedings of the 5th ACM International Confer- ence on Multimedia in Asia, pages 1–7, 2023

  19. [19]

    Federated distillation: A survey

    Lin Li, Jianping Gou, Baosheng Yu, Lan Du, and Zhang Yiand Dacheng Tao. Federated dis- tillation: A survey.arXiv preprint arXiv:2404.08564, 2024

  20. [20]

    On the distribution of the largest eigenvalue in principal components analysis.The Annals of statistics, 29(2):295–327, 2001

    Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis.The Annals of statistics, 29(2):295–327, 2001

  21. [21]

    Donoho, Arian Maleki, and Andrea Montanari

    David L. Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing.Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009. 27

  22. [22]

    The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012

    Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012

  23. [23]

    State evolution for general approximate message passing algorithms, with applications to spatial coupling.Information and Inference: A Journal of the IMA, 2(2):115–144, 2013

    Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling.Information and Inference: A Journal of the IMA, 2(2):115–144, 2013

  24. [24]

    Bickel, Chinghway Lim, and Bin Yu

    Noureddine El Karoui, Derek Bean, Peter J. Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors.Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013

  25. [25]

    Statistical physics of inference: thresholds and algo- rithms.Advances in Physics, 65(5):453–552, 2016

    Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: thresholds and algo- rithms.Advances in Physics, 65(5):453–552, 2016

  26. [26]

    Noureddine El Karoui. On the impact of predictor geometry on the performance on high- dimensional ridge-regularized generalized robust regression estimators.Probability Theory and Related Fields, 170(1):95–175, 2018

  27. [27]

    Pragya Sur, Yuxin Chen, and Emmanuel J. Candès. The likelihood ratio test in high- dimensional logistic regression is asymptotically a rescaled chi-square.Probability Theory and Related Fields, 175(1):487–558, 2019

  28. [28]

    Pragya Sur and Emmanuel J. Candès. A modern maximum-likelihood theory for high- dimensional logistic regression.Proceedings of the National Academy of Sciences, 116(29):14516– 14525, 2019

  29. [29]

    Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

    Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

  30. [30]

    Approximate Message Passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1):197 – 224, 2022

    Zhou Fan. Approximate Message Passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1):197 – 224, 2022

  31. [31]

    Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J

    Oliver Y. Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J. Samworth. A unifying tutorial on approximate message passing.Foundations and Trends in Machine Learning, 15(4): 335–536, 05 2022

  32. [32]

    Hong Hu and Yue M. Lu. Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2023

  33. [33]

    The lasso with general gaussian designs with applications to hypothesis testing.The Annals of Statistics, 51(5):2194–2220, 2023

    Michael Celentano, Andrea Montanari, and Yuting Wei. The lasso with general gaussian designs with applications to hypothesis testing.The Annals of Statistics, 51(5):2194–2220, 2023

  34. [34]

    A friendly tutorial on mean-field spin glass tech- niques for non-physicists.Foundations and Trends in Machine Learning, 17(1):1–173, 01 2024

    Andrea Montanari and Subhabrata Sen. A friendly tutorial on mean-field spin glass tech- niques for non-physicists.Foundations and Trends in Machine Learning, 17(1):1–173, 01 2024

  35. [35]

    Method-of-moments inference for glms and doubly robust functionals under proportional asymptotics.arXiv preprint arXiv:2408.06103, 2025

    Xingyu Chen, Lin Liu, and Rajarshi Mukherjee. Method-of-moments inference for glms and doubly robust functionals under proportional asymptotics.arXiv preprint arXiv:2408.06103, 2025

  36. [36]

    A new central limit the- orem for the augmented IPW estimator: Variance inflation, cross-fit covariance and beyond

    Kuanhao Jiang, Rajarshi Mukherjee, Subhabrata Sen, and Pragya Sur. A new central limit the- orem for the augmented IPW estimator: Variance inflation, cross-fit covariance and beyond. The Annals of Statistics, 53(2):647 – 675, 2025. 28

  37. [37]

    Characterizing finite-dimensional posterior marginals in high- dimensional GLMs via leave-one-out.arXiv preprint arXiv:2601.00091, 2025

    Manuel Sáenz and Pragya Sur. Characterizing finite-dimensional posterior marginals in high- dimensional GLMs via leave-one-out.arXiv preprint arXiv:2601.00091, 2025

  38. [38]

    Robinson

    Al Depope, Jakub Bajzik, Marco Mondelli, and Matthew R. Robinson. Joint modeling of whole-genome sequencing data for human height via approximate message passing.Cell Genomics, page 101162, 2026

  39. [39]

    Optimal and provable calibration in high-dimensional binary classi- fication: Angular calibration and platt scaling

    Yufan Li and Pragya Sur. Optimal and provable calibration in high-dimensional binary classi- fication: Angular calibration and platt scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  40. [40]

    Spectrum-aware debiasing: A modern inference framework with applications to principal components regression.The Annals of Statistics, 54(2):745 – 770, 2026

    Yufan Li and Pragya Sur. Spectrum-aware debiasing: A modern inference framework with applications to principal components regression.The Annals of Statistics, 54(2):745 – 770, 2026

  41. [41]

    Optimal unconstrained self-distillation in ridge regression: Strict improvements, precise asymptotics, and one-shot tuning.arXiv preprint arXiv:2602.17565, 2026

    Hien Dang, Pratik Patil, and Alessandro Rinaldo. Optimal unconstrained self-distillation in ridge regression: Strict improvements, precise asymptotics, and one-shot tuning.arXiv preprint arXiv:2602.17565, 2026

  42. [42]

    Asymptotic theory of iterated empirical risk minimization, with applications to active learning.arXiv preprint arXiv:2601.23031, 2026

    Hugo Cui and Yue M Lu. Asymptotic theory of iterated empirical risk minimization, with applications to active learning.arXiv preprint arXiv:2601.23031, 2026

  43. [43]

    Self-distillation amplifies regular- ization in hilbert space

    Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regular- ization in hilbert space. InAdvances in Neural Information Processing Systems, 2020

  44. [44]

    Yang, and Qiang Sun

    Mingqi Wu, Archer Y. Yang, and Qiang Sun. Why self-training helps and hurts: Denoising vs. signal forgetting.arXiv preprint arXiv:2602.14029, 2026

  45. [45]

    On the mechanisms of weak-to-strong generalization: A theoretical perspective

    Behrad Moniri and Hamed Hassani. On the mechanisms of weak-to-strong generalization: A theoretical perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  46. [46]

    High-dimensional analysis of knowledge distillation: Weak-to-strong gener- alization and scaling laws

    Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong gener- alization and scaling laws. InThe Thirteenth International Conference on Learning Representa- tions, 2025

  47. [47]

    Scaling laws for learning with real and surrogate data

    Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  48. [48]

    Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning

    Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning. InThe Eleventh International Conference on Learning Representations, 2023

  49. [49]

    Understanding self-distillation in the presence of label noise

    Rudrajit Das and Sujay Sanghavi. Understanding self-distillation in the presence of label noise. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7102–7140, 23–29 Jul 2023

  50. [50]

    Understanding the gains from repeated self-distillation.Advances in Neural Information Processing Systems, 37:7759–7796, 2024

    Divyansh Pareek, Simon S Du, and Sewoong Oh. Understanding the gains from repeated self-distillation.Advances in Neural Information Processing Systems, 37:7759–7796, 2024

  51. [51]

    Towards understanding knowledge distillation

    Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5142–5151, 09–15 Jun 2019. 29

  52. [52]

    The effect of optimal self-distillation in noisy gaussian mixture model

    Kaito Takanami, Takashi Takahashi, and Ayaka Sakata. The effect of optimal self-distillation in noisy gaussian mixture model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  53. [53]

    Preventing model collapse under over- parametrization: Optimal mixing ratios for interpolation learning and ridge regression

    Anvit Garg, Sohom Bhattacharya, and Pragya Sur. Preventing model collapse under over- parametrization: Optimal mixing ratios for interpolation learning and ridge regression. In The Fourteenth International Conference on Learning Representations, 2026

  54. [54]

    Self-boost via op- timal retraining: An analysis via approximate message passing

    Adel Javanmard, Rudrajit Das, Alessandro Epasto, and Vahab Mirrokni. Self-boost via op- timal retraining: An analysis via approximate message passing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  55. [55]

    Distillation≈early stopping? harvest- ing dark knowledge utilizing anisotropic information retrieval for overparameterized neural network.arXiv preprint arXiv:1910.01255, 2019

    Bin Dong, Jikai Hou, Yiping Lu, and Zhihua Zhang. Distillation≈early stopping? harvest- ing dark knowledge utilizing anisotropic information retrieval for overparameterized neural network.arXiv preprint arXiv:1910.01255, 2019

  56. [56]

    Solvable model for inheriting the regularization through knowledge distillation

    Luca Saglietti and Lenka Zdeborova. Solvable model for inheriting the regularization through knowledge distillation. InProceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 ofProceedings of Machine Learning Research, pages 809–846, 2022

  57. [57]

    Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024

    Enric Boix-Adsera. Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024

  58. [58]

    Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data

    Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  59. [59]

    Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks.ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013

    Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks.ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013

  60. [60]

    Self-training: A survey.Neurocomputing, 616:128904, February 2025

    Amini Massih-Reza, Vasilii Feofanov, Loïc Pauletto, Liès Hadjadj, Émilie Devijver, and Yury Maximov. Self-training: A survey.Neurocomputing, 616:128904, February 2025

  61. [61]

    Duchi, and Percy S

    Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C. Duchi, and Percy S. Liang. Un- labeled data improves adversarial robustness. InAdvances in Neural Information Processing Systems, volume 32, 2019

  62. [62]

    Understanding self-training for gradual do- main adaptation

    Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual do- main adaptation. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5468–5479, 13–18 Jul 2020

  63. [63]

    Self-training avoids using spurious features under domain shift

    Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. InAdvances in Neural Information Processing Systems, volume 33, pages 21061–21071, 2020

  64. [64]

    Samet Oymak and T. C. Gulcu. Statistical and algorithmic insights for semi-supervised learn- ing with self-training.arXiv preprint arXiv:2006.11006, 2020

  65. [65]

    Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622, 2020

    Colin Wei, Kai Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622, 2020. 30

  66. [66]

    How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis.arXiv preprint arXiv:2201.08514, 2022

    Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen, and Jinjun Xiong. How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis.arXiv preprint arXiv:2201.08514, 2022

  67. [67]

    Marchenko and Leonid A

    Vladimir A. Marchenko and Leonid A. Pastur. Distribution of eigenvalues for some sets of random matrices.Matematicheskii Sbornik, 114(4):507–536, 1967

  68. [68]

    Silverstein

    Jinho Baik and Jack W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.Journal of Multivariate Analysis, 97(6):1382–1408, 2006

  69. [69]

    Phase transition of the largest eigenvalue for non-null complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

    Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

  70. [70]

    Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.arXiv preprint arXiv:1805.10939, 2020

    Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.arXiv preprint arXiv:1805.10939, 2020

  71. [71]

    On the optimal weightedℓ 2 regularization in overparameterized linear regression

    Denny Wu and Ji Xu. On the optimal weightedℓ 2 regularization in overparameterized linear regression. InAdvances in Neural Information Processing Systems, volume 33, pages 10112– 10123. Curran Associates, Inc., 2020

  72. [72]

    Asymptotics of ridge(less) re- gression under general source condition

    Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) re- gression under general source condition. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3889–3897, 2021

  73. [73]

    Uniform consistency of cross-validation estimators for high-dimensional ridge regression

    Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani. Uniform consistency of cross-validation estimators for high-dimensional ridge regression. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Ma- chine Learning Research, pages 3178–3186, 2021

  74. [74]

    Bartlett

    Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression.J. Mach. Learn. Res., 24(1), January 2023. ISSN 1532-4435

  75. [75]

    Tibshirani

    Pratik Patil, Jin-Hong Du, and Ryan J. Tibshirani. Optimal ridge regularization for out-of- distribution prediction. InProceedings of the 41st International Conference on Machine Learning, ICML’24, 2024

  76. [76]

    Lee H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension.Bernoulli, 22(1):1 – 37, 2016

  77. [77]

    High-dimensional asymptotics of prediction: Ridge re- gression and classification.The Annals of Statistics, 46(1):247–279, 2018

    Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge re- gression and classification.The Annals of Statistics, 46(1):247–279, 2018

  78. [78]

    J. W. Silverstein and S.-I. Choi. Analysis of the limiting spectral distribution of large dimen- sional random matrices.Journal of Multivariate Analysis, 54(2):295–309, 1995

  79. [79]

    Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices.Journal of Multivariate Analysis, 55(2):331–339, 1995

    Jack Silverstein. Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices.Journal of Multivariate Analysis, 55(2):331–339, 1995

  80. [80]

    Cambridge University Press, 2022

    Romain Couillet and Zhenyu Liao.Random matrix methods for machine learning. Cambridge University Press, 2022. 31

Showing first 80 references.