Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

Debarghya Mukherjee; Pragya Sur; Radu Lecoiu

arxiv: 2605.17778 · v1 · pith:ACVPXVLTnew · submitted 2026-05-18 · 🧮 math.ST · cs.LG· stat.ME· stat.ML· stat.TH

Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

Radu Lecoiu , Debarghya Mukherjee , Pragya Sur This is my paper

Pith reviewed 2026-05-20 00:58 UTC · model grok-4.3

classification 🧮 math.ST cs.LGstat.MEstat.MLstat.TH

keywords self-distillationspiked covariance modelsspectral shrinkage estimatorscovariance estimationhigh-dimensional statisticsoptimal estimationiterative shrinkage

0 comments

The pith

s-step self-distillation achieves optimal performance among spectral shrinkage estimators for spiked covariance matrices with s spikes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes statistical foundations for self-distillation by examining it in spiked covariance models. It considers a broad class of spectral shrinkage estimators that modify the eigenvalues of the sample covariance matrix. For a model with exactly s spikes, the authors prove that applying self-distillation precisely s times yields the estimator with the lowest risk within this class. This result matters to a sympathetic reader because it supplies a theoretical reason for the empirical success of self-distillation in machine learning and links it directly to classical methods of shrinkage estimation in statistics. The analysis extends to federated settings with distributed data, where a variant of self-distillation again emerges as optimal locally.

Core claim

For spiked covariance matrices with s spikes, s-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, s steps are necessary for optimality since any (s-k)-step distilled estimator is strictly suboptimal for 1 ≤ k ≤ s. For the special subclass of isotropic covariances, optimally tuned Ridge regression performs best among spectral shrinkage estimators.

What carries the argument

Spectral shrinkage estimators, which apply a shrinkage function to the eigenvalues of the empirical covariance, with self-distillation serving as the iteration mechanism that reaches the optimal shrinkage rule after exactly s steps.

If this is right

s-step self-distillation is necessary and sufficient for optimality in the class of spectral shrinkage estimators when there are s spikes.
Fewer distillation steps produce strictly worse estimators.
Optimally tuned ridge regression is the best spectral shrinkage estimator when the covariance is isotropic.
In a federated setting with multiple data centers, the best local shrinkage rule is a form of self-distillation different from the centralized optimal rule.
The framework connects self-distillation to classical shrinkage methods and explains its predictive improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This optimality suggests testing whether estimating the spike count s in real data and using that many distillation steps improves performance over fixed-step methods.
The results could extend to other high-dimensional estimation problems where iterated procedures might achieve optimality within restricted estimator classes.
Practitioners might consider whether the difference in federated versus centralized rules affects how self-distillation is implemented in distributed machine learning systems.

Load-bearing premise

The data follows exactly a spiked covariance model with a known number s of spikes, and the analysis applies only within the restricted class of spectral shrinkage estimators.

What would settle it

Simulate data from a spiked covariance model with a known s, compute the estimation risk or prediction error for the s-step self-distilled estimator and for other spectral shrinkage estimators such as standard ridge or Ledoit-Wolf, and check whether the self-distilled version has the smallest risk; a smaller risk for any other estimator would falsify the optimality.

Figures

Figures reproduced from arXiv: 2605.17778 by Debarghya Mukherjee, Pragya Sur, Radu Lecoiu.

**Figure 1.** Figure 1: Entries of the vector (b (K) 0 , b (K) 1 , b (K) 2 ) as a function of the number of local servers K ∈ {1, . . . , 40}, for a two-spike covariance with δ1 = 2, δ2 = 3, α1 = 3, α2 = 2.5, and parameters c = 3, r = 5, σε = 2, σ0 = 1. Note the variation in K, demonstrating the difference in the optimal local rule (3.9) and aggregation weights (3.8) as a function of the number of clients. 4 Simulation Studies 4.… view at source ↗

**Figure 2.** Figure 2: Sub-optimality of self-distillation and principal components regression in the isotropic [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: Limiting prediction risk of optimally tuned self-distillation versus optimally tuned ridge [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Optimal self-distillation parameters as functions of [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Prediction risk of Ridge, optimal self-distillation (SD), and PCR as a function of the spike [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Prediction risk of Ridge, one-step Ridge self-distillation, Lasso, and one-step Lasso self [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Proof dependency. Arrows point from results to the results they prove. [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Numerical check that b (K) 0 ̸= 0 across a wide range of parameter configurations. (a) b (K) 0 vs. spike scale t and number of clients K ∈ {2, . . . , 9}. (b) b (K) 0 vs. t and noise level σ ∈ (0.5, 20). (c) b (K) 0 vs. t and aspect ratio c ∈ (0.5, 20), covering both the underparametrized (c < 1) and overparametrized (c > 1) regimes. (d) b (K) 0 vs. t and signal norm r ∈ ( q ∑j α 2 j , 20). (e) b (K) 0 vs.… view at source ↗

read the original abstract

Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with $s$ spikes, $s$-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that $s$ steps are necessary for optimality: any $(s-k)$-step distilled estimator is strictly suboptimal for $1 \leq k \leq s$. For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper develops the statistical foundations of self-distillation in spiked covariance models by introducing and analyzing spectral shrinkage estimators. It claims that for spiked covariance matrices with s spikes, s-step self-distillation achieves optimal performance among spectral shrinkage estimators and outperforms standard estimators from statistics and machine learning. It further establishes that exactly s steps are necessary, as any (s-k)-step estimator is strictly suboptimal, that optimally tuned ridge regression is best among spectral shrinkage estimators for isotropic covariances, and that in a federated setting the optimal local rule again takes the form of self-distillation (though different from the centralized optimum).

Significance. If the asymptotic risk characterizations and optimality proofs hold, the work supplies a precise theoretical account of self-distillation within the classical spiked model, directly connecting modern machine-learning heuristics to random-matrix shrinkage theory. The explicit recursion for the distilled shrinkage function, the strict necessity result for s steps, and the federated extension are notable strengths that could guide both theory and practice in high-dimensional estimation.

minor comments (3)

[Section 2] Section 2: the definition of the class of spectral shrinkage estimators would be clearer if the shrinkage function were introduced with an explicit functional form or integral representation before the recursion is stated.
[Figure 1] Figure 1: the legend does not distinguish the curves for different numbers of distillation steps; adding labels or a table of risk values would improve readability.
[Section 5] The federated aggregation rule in Section 5 is presented without an explicit comparison table to the centralized optimum; a side-by-side display of the two shrinkage functions would help readers see the difference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful and positive assessment of our manuscript. We are pleased that the referee finds the asymptotic risk characterizations, the optimality proofs, the necessity of exactly s steps, and the federated extension to be notable strengths, and we appreciate the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central derivation establishes optimality of s-step self-distillation within the class of spectral shrinkage estimators under an exact spiked covariance model by invoking standard random-matrix characterizations of spiked eigenvalues together with an explicit recursion for the shrinkage function. These steps are mathematically independent of the target optimality claim and do not reduce to self-definition, fitted-input renaming, or self-citation chains. The assumption of a known spike count s is stated explicitly as part of the model rather than derived from the estimators themselves, and the necessity of exactly s steps follows from direct risk comparisons that remain falsifiable outside the fitted values. No load-bearing ansatz is smuggled via citation, and the results are self-contained against external benchmarks in random matrix theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption of a spiked covariance model; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Data are generated from a spiked covariance model with exactly s spikes.
This model class is the setting in which all optimality claims are derived.

pith-pipeline@v0.9.0 · 5760 in / 1226 out tokens · 56822 ms · 2026-05-20T00:58:30.334156+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1: f_pred^* is a rational function ... numerator degree s, denominator degree s+1 ... realized by s-step self-distillation
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.2: when s=0, f_pred^* = 1/(x + lambda^*) recovers optimal ridge

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 4 internal anchors

[1]

Model compression

Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541, 2006

work page 2006
[2]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014

Lei J Ba and Rich Caruana. Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014

work page 2014
[4]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

work page internal anchor Pith review arXiv 2025
[5]

Edgev-se: Self-reflective fine-tuning framework for edge-deployable vision-language models.Applied Sciences, 16(2):818, 2026

Yoonmo Jeon, Seunghun Lee, and Woongsup Kim. Edgev-se: Self-reflective fine-tuning framework for edge-deployable vision-language models.Applied Sciences, 16(2):818, 2026

work page 2026
[6]

Born again neural networks

Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anand- kumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616, 2018. 26

work page 2018
[7]

Self-distillation: Towards efficient and compact neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (8):4388–4403, 2021

Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. Self-distillation: Towards efficient and compact neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (8):4388–4403, 2021

work page 2021
[8]

Learning from Noisy Labels with Distillation

Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation.arXiv preprint arXiv:1703.02391, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

work page 2012
[10]

Communication-efficient algorithms for statistical optimization.Advances in neural information processing systems, 25, 2012

Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization.Advances in neural information processing systems, 25, 2012

work page 2012
[11]

Parallelized stochastic gradi- ent descent.Advances in neural information processing systems, 23, 2010

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradi- ent descent.Advances in neural information processing systems, 23, 2010

work page 2010
[12]

Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021

Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021

work page 2021
[13]

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

Jakub Koneˇ cn`y, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Feder- ated optimization: Distributed machine learning for on-device intelligence.arXiv preprint arXiv:1610.02527, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 ofProceedings of Machine Learning Research, pages 1273–1282, 2017

work page 2017
[15]

Federated multi- task learning.Advances in neural information processing systems, 30, 2017

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi- task learning.Advances in neural information processing systems, 30, 2017

work page 2017
[16]

Adaptive self-distillation for minimizing client drift in heterogeneous feder- ated learning.Transactions on Machine Learning Research, 2024

M Yashwanth, Gaurav Kumar Nayak, Arya Singh, Yogesh Simmhan, and Anirban Chakraborty. Adaptive self-distillation for minimizing client drift in heterogeneous feder- ated learning.Transactions on Machine Learning Research, 2024

work page 2024
[17]

Learn- ing critically: Selective self-distillation in federated learning on non-iid data.IEEE Transac- tions on Big Data, 10(6):789–800, 2022

Yuting He, Yiqiang Chen, XiaoDong Yang, Hanchao Yu, Yi-Hua Huang, and Yang Gu. Learn- ing critically: Selective self-distillation in federated learning on non-iid data.IEEE Transac- tions on Big Data, 10(6):789–800, 2022

work page 2022
[18]

Personalized feder- ated learning via backbone self-distillation

Pengju Wang, Bochao Liu, Dan Zeng, Chenggang Yan, and Shiming Ge. Personalized feder- ated learning via backbone self-distillation. InProceedings of the 5th ACM International Confer- ence on Multimedia in Asia, pages 1–7, 2023

work page 2023
[19]

Federated distillation: A survey

Lin Li, Jianping Gou, Baosheng Yu, Lan Du, and Zhang Yiand Dacheng Tao. Federated dis- tillation: A survey.arXiv preprint arXiv:2404.08564, 2024

work page arXiv 2024
[20]

On the distribution of the largest eigenvalue in principal components analysis.The Annals of statistics, 29(2):295–327, 2001

Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis.The Annals of statistics, 29(2):295–327, 2001

work page 2001
[21]

Donoho, Arian Maleki, and Andrea Montanari

David L. Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing.Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009. 27

work page 2009
[22]

The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012

Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012

work page 1997
[23]

State evolution for general approximate message passing algorithms, with applications to spatial coupling.Information and Inference: A Journal of the IMA, 2(2):115–144, 2013

Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling.Information and Inference: A Journal of the IMA, 2(2):115–144, 2013

work page 2013
[24]

Bickel, Chinghway Lim, and Bin Yu

Noureddine El Karoui, Derek Bean, Peter J. Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors.Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013

work page 2013
[25]

Statistical physics of inference: thresholds and algo- rithms.Advances in Physics, 65(5):453–552, 2016

Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: thresholds and algo- rithms.Advances in Physics, 65(5):453–552, 2016

work page 2016
[26]

Noureddine El Karoui. On the impact of predictor geometry on the performance on high- dimensional ridge-regularized generalized robust regression estimators.Probability Theory and Related Fields, 170(1):95–175, 2018

work page 2018
[27]

Pragya Sur, Yuxin Chen, and Emmanuel J. Candès. The likelihood ratio test in high- dimensional logistic regression is asymptotically a rescaled chi-square.Probability Theory and Related Fields, 175(1):487–558, 2019

work page 2019
[28]

Pragya Sur and Emmanuel J. Candès. A modern maximum-likelihood theory for high- dimensional logistic regression.Proceedings of the National Academy of Sciences, 116(29):14516– 14525, 2019

work page 2019
[29]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

work page 2022
[30]

Approximate Message Passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1):197 – 224, 2022

Zhou Fan. Approximate Message Passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1):197 – 224, 2022

work page 2022
[31]

Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J

Oliver Y. Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J. Samworth. A unifying tutorial on approximate message passing.Foundations and Trends in Machine Learning, 15(4): 335–536, 05 2022

work page 2022
[32]

Hong Hu and Yue M. Lu. Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2023

work page 1932
[33]

The lasso with general gaussian designs with applications to hypothesis testing.The Annals of Statistics, 51(5):2194–2220, 2023

Michael Celentano, Andrea Montanari, and Yuting Wei. The lasso with general gaussian designs with applications to hypothesis testing.The Annals of Statistics, 51(5):2194–2220, 2023

work page 2023
[34]

A friendly tutorial on mean-field spin glass tech- niques for non-physicists.Foundations and Trends in Machine Learning, 17(1):1–173, 01 2024

Andrea Montanari and Subhabrata Sen. A friendly tutorial on mean-field spin glass tech- niques for non-physicists.Foundations and Trends in Machine Learning, 17(1):1–173, 01 2024

work page 2024
[35]

Method-of-moments inference for glms and doubly robust functionals under proportional asymptotics.arXiv preprint arXiv:2408.06103, 2025

Xingyu Chen, Lin Liu, and Rajarshi Mukherjee. Method-of-moments inference for glms and doubly robust functionals under proportional asymptotics.arXiv preprint arXiv:2408.06103, 2025

work page arXiv 2025
[36]

A new central limit the- orem for the augmented IPW estimator: Variance inflation, cross-fit covariance and beyond

Kuanhao Jiang, Rajarshi Mukherjee, Subhabrata Sen, and Pragya Sur. A new central limit the- orem for the augmented IPW estimator: Variance inflation, cross-fit covariance and beyond. The Annals of Statistics, 53(2):647 – 675, 2025. 28

work page 2025
[37]

Characterizing finite-dimensional posterior marginals in high- dimensional GLMs via leave-one-out.arXiv preprint arXiv:2601.00091, 2025

Manuel Sáenz and Pragya Sur. Characterizing finite-dimensional posterior marginals in high- dimensional GLMs via leave-one-out.arXiv preprint arXiv:2601.00091, 2025

work page arXiv 2025
[38]

Robinson

Al Depope, Jakub Bajzik, Marco Mondelli, and Matthew R. Robinson. Joint modeling of whole-genome sequencing data for human height via approximate message passing.Cell Genomics, page 101162, 2026

work page 2026
[39]

Optimal and provable calibration in high-dimensional binary classi- fication: Angular calibration and platt scaling

Yufan Li and Pragya Sur. Optimal and provable calibration in high-dimensional binary classi- fication: Angular calibration and platt scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[40]

Spectrum-aware debiasing: A modern inference framework with applications to principal components regression.The Annals of Statistics, 54(2):745 – 770, 2026

Yufan Li and Pragya Sur. Spectrum-aware debiasing: A modern inference framework with applications to principal components regression.The Annals of Statistics, 54(2):745 – 770, 2026

work page 2026
[41]

Optimal unconstrained self-distillation in ridge regression: Strict improvements, precise asymptotics, and one-shot tuning.arXiv preprint arXiv:2602.17565, 2026

Hien Dang, Pratik Patil, and Alessandro Rinaldo. Optimal unconstrained self-distillation in ridge regression: Strict improvements, precise asymptotics, and one-shot tuning.arXiv preprint arXiv:2602.17565, 2026

work page arXiv 2026
[42]

Asymptotic theory of iterated empirical risk minimization, with applications to active learning.arXiv preprint arXiv:2601.23031, 2026

Hugo Cui and Yue M Lu. Asymptotic theory of iterated empirical risk minimization, with applications to active learning.arXiv preprint arXiv:2601.23031, 2026

work page arXiv 2026
[43]

Self-distillation amplifies regular- ization in hilbert space

Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regular- ization in hilbert space. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[44]

Yang, and Qiang Sun

Mingqi Wu, Archer Y. Yang, and Qiang Sun. Why self-training helps and hurts: Denoising vs. signal forgetting.arXiv preprint arXiv:2602.14029, 2026

work page arXiv 2026
[45]

On the mechanisms of weak-to-strong generalization: A theoretical perspective

Behrad Moniri and Hamed Hassani. On the mechanisms of weak-to-strong generalization: A theoretical perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[46]

High-dimensional analysis of knowledge distillation: Weak-to-strong gener- alization and scaling laws

Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong gener- alization and scaling laws. InThe Thirteenth International Conference on Learning Representa- tions, 2025

work page 2025
[47]

Scaling laws for learning with real and surrogate data

Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[48]

Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[49]

Understanding self-distillation in the presence of label noise

Rudrajit Das and Sujay Sanghavi. Understanding self-distillation in the presence of label noise. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7102–7140, 23–29 Jul 2023

work page 2023
[50]

Understanding the gains from repeated self-distillation.Advances in Neural Information Processing Systems, 37:7759–7796, 2024

Divyansh Pareek, Simon S Du, and Sewoong Oh. Understanding the gains from repeated self-distillation.Advances in Neural Information Processing Systems, 37:7759–7796, 2024

work page 2024
[51]

Towards understanding knowledge distillation

Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5142–5151, 09–15 Jun 2019. 29

work page 2019
[52]

The effect of optimal self-distillation in noisy gaussian mixture model

Kaito Takanami, Takashi Takahashi, and Ayaka Sakata. The effect of optimal self-distillation in noisy gaussian mixture model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[53]

Preventing model collapse under over- parametrization: Optimal mixing ratios for interpolation learning and ridge regression

Anvit Garg, Sohom Bhattacharya, and Pragya Sur. Preventing model collapse under over- parametrization: Optimal mixing ratios for interpolation learning and ridge regression. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026
[54]

Self-boost via op- timal retraining: An analysis via approximate message passing

Adel Javanmard, Rudrajit Das, Alessandro Epasto, and Vahab Mirrokni. Self-boost via op- timal retraining: An analysis via approximate message passing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[55]

Distillation≈early stopping? harvest- ing dark knowledge utilizing anisotropic information retrieval for overparameterized neural network.arXiv preprint arXiv:1910.01255, 2019

Bin Dong, Jikai Hou, Yiping Lu, and Zhihua Zhang. Distillation≈early stopping? harvest- ing dark knowledge utilizing anisotropic information retrieval for overparameterized neural network.arXiv preprint arXiv:1910.01255, 2019

work page arXiv 1910
[56]

Solvable model for inheriting the regularization through knowledge distillation

Luca Saglietti and Lenka Zdeborova. Solvable model for inheriting the regularization through knowledge distillation. InProceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 ofProceedings of Machine Learning Research, pages 809–846, 2022

work page 2022
[57]

Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024

Enric Boix-Adsera. Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024

work page arXiv 2024
[58]

Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data

Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[59]

Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks.ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013

Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks.ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013

work page 2013
[60]

Self-training: A survey.Neurocomputing, 616:128904, February 2025

Amini Massih-Reza, Vasilii Feofanov, Loïc Pauletto, Liès Hadjadj, Émilie Devijver, and Yury Maximov. Self-training: A survey.Neurocomputing, 616:128904, February 2025

work page 2025
[61]

Duchi, and Percy S

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C. Duchi, and Percy S. Liang. Un- labeled data improves adversarial robustness. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[62]

Understanding self-training for gradual do- main adaptation

Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual do- main adaptation. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5468–5479, 13–18 Jul 2020

work page 2020
[63]

Self-training avoids using spurious features under domain shift

Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. InAdvances in Neural Information Processing Systems, volume 33, pages 21061–21071, 2020

work page 2020
[64]

Samet Oymak and T. C. Gulcu. Statistical and algorithmic insights for semi-supervised learn- ing with self-training.arXiv preprint arXiv:2006.11006, 2020

work page arXiv 2006
[65]

Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622, 2020

Colin Wei, Kai Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622, 2020. 30

work page arXiv 2010
[66]

How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis.arXiv preprint arXiv:2201.08514, 2022

Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen, and Jinjun Xiong. How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis.arXiv preprint arXiv:2201.08514, 2022

work page arXiv 2022
[67]

Marchenko and Leonid A

Vladimir A. Marchenko and Leonid A. Pastur. Distribution of eigenvalues for some sets of random matrices.Matematicheskii Sbornik, 114(4):507–536, 1967

work page 1967
[68]

Silverstein

Jinho Baik and Jack W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.Journal of Multivariate Analysis, 97(6):1382–1408, 2006

work page 2006
[69]

Phase transition of the largest eigenvalue for non-null complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

work page 2005
[70]

Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.arXiv preprint arXiv:1805.10939, 2020

Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.arXiv preprint arXiv:1805.10939, 2020

work page arXiv 2020
[71]

On the optimal weightedℓ 2 regularization in overparameterized linear regression

Denny Wu and Ji Xu. On the optimal weightedℓ 2 regularization in overparameterized linear regression. InAdvances in Neural Information Processing Systems, volume 33, pages 10112– 10123. Curran Associates, Inc., 2020

work page 2020
[72]

Asymptotics of ridge(less) re- gression under general source condition

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) re- gression under general source condition. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3889–3897, 2021

work page 2021
[73]

Uniform consistency of cross-validation estimators for high-dimensional ridge regression

Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani. Uniform consistency of cross-validation estimators for high-dimensional ridge regression. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Ma- chine Learning Research, pages 3178–3186, 2021

work page 2021
[74]

Bartlett

Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression.J. Mach. Learn. Res., 24(1), January 2023. ISSN 1532-4435

work page 2023
[75]

Tibshirani

Pratik Patil, Jin-Hong Du, and Ryan J. Tibshirani. Optimal ridge regularization for out-of- distribution prediction. InProceedings of the 41st International Conference on Machine Learning, ICML’24, 2024

work page 2024
[76]

Lee H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension.Bernoulli, 22(1):1 – 37, 2016

work page 2016
[77]

High-dimensional asymptotics of prediction: Ridge re- gression and classification.The Annals of Statistics, 46(1):247–279, 2018

Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge re- gression and classification.The Annals of Statistics, 46(1):247–279, 2018

work page 2018
[78]

J. W. Silverstein and S.-I. Choi. Analysis of the limiting spectral distribution of large dimen- sional random matrices.Journal of Multivariate Analysis, 54(2):295–309, 1995

work page 1995
[79]

Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices.Journal of Multivariate Analysis, 55(2):331–339, 1995

Jack Silverstein. Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices.Journal of Multivariate Analysis, 55(2):331–339, 1995

work page 1995
[80]

Cambridge University Press, 2022

Romain Couillet and Zhenyu Liao.Random matrix methods for machine learning. Cambridge University Press, 2022. 31

work page 2022

Showing first 80 references.

[1] [1]

Model compression

Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541, 2006

work page 2006

[2] [2]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014

Lei J Ba and Rich Caruana. Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014

work page 2014

[4] [4]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025

work page internal anchor Pith review arXiv 2025

[5] [5]

Edgev-se: Self-reflective fine-tuning framework for edge-deployable vision-language models.Applied Sciences, 16(2):818, 2026

Yoonmo Jeon, Seunghun Lee, and Woongsup Kim. Edgev-se: Self-reflective fine-tuning framework for edge-deployable vision-language models.Applied Sciences, 16(2):818, 2026

work page 2026

[6] [6]

Born again neural networks

Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anand- kumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616, 2018. 26

work page 2018

[7] [7]

Self-distillation: Towards efficient and compact neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (8):4388–4403, 2021

Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. Self-distillation: Towards efficient and compact neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (8):4388–4403, 2021

work page 2021

[8] [8]

Learning from Noisy Labels with Distillation

Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation.arXiv preprint arXiv:1703.02391, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012

work page 2012

[10] [10]

Communication-efficient algorithms for statistical optimization.Advances in neural information processing systems, 25, 2012

Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization.Advances in neural information processing systems, 25, 2012

work page 2012

[11] [11]

Parallelized stochastic gradi- ent descent.Advances in neural information processing systems, 23, 2010

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradi- ent descent.Advances in neural information processing systems, 23, 2010

work page 2010

[12] [12]

Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021

Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021

work page 2021

[13] [13]

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

Jakub Koneˇ cn`y, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Feder- ated optimization: Distributed machine learning for on-device intelligence.arXiv preprint arXiv:1610.02527, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 ofProceedings of Machine Learning Research, pages 1273–1282, 2017

work page 2017

[15] [15]

Federated multi- task learning.Advances in neural information processing systems, 30, 2017

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi- task learning.Advances in neural information processing systems, 30, 2017

work page 2017

[16] [16]

Adaptive self-distillation for minimizing client drift in heterogeneous feder- ated learning.Transactions on Machine Learning Research, 2024

M Yashwanth, Gaurav Kumar Nayak, Arya Singh, Yogesh Simmhan, and Anirban Chakraborty. Adaptive self-distillation for minimizing client drift in heterogeneous feder- ated learning.Transactions on Machine Learning Research, 2024

work page 2024

[17] [17]

Learn- ing critically: Selective self-distillation in federated learning on non-iid data.IEEE Transac- tions on Big Data, 10(6):789–800, 2022

Yuting He, Yiqiang Chen, XiaoDong Yang, Hanchao Yu, Yi-Hua Huang, and Yang Gu. Learn- ing critically: Selective self-distillation in federated learning on non-iid data.IEEE Transac- tions on Big Data, 10(6):789–800, 2022

work page 2022

[18] [18]

Personalized feder- ated learning via backbone self-distillation

Pengju Wang, Bochao Liu, Dan Zeng, Chenggang Yan, and Shiming Ge. Personalized feder- ated learning via backbone self-distillation. InProceedings of the 5th ACM International Confer- ence on Multimedia in Asia, pages 1–7, 2023

work page 2023

[19] [19]

Federated distillation: A survey

Lin Li, Jianping Gou, Baosheng Yu, Lan Du, and Zhang Yiand Dacheng Tao. Federated dis- tillation: A survey.arXiv preprint arXiv:2404.08564, 2024

work page arXiv 2024

[20] [20]

On the distribution of the largest eigenvalue in principal components analysis.The Annals of statistics, 29(2):295–327, 2001

Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis.The Annals of statistics, 29(2):295–327, 2001

work page 2001

[21] [21]

Donoho, Arian Maleki, and Andrea Montanari

David L. Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing.Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009. 27

work page 2009

[22] [22]

The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012

Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012

work page 1997

[23] [23]

State evolution for general approximate message passing algorithms, with applications to spatial coupling.Information and Inference: A Journal of the IMA, 2(2):115–144, 2013

Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling.Information and Inference: A Journal of the IMA, 2(2):115–144, 2013

work page 2013

[24] [24]

Bickel, Chinghway Lim, and Bin Yu

Noureddine El Karoui, Derek Bean, Peter J. Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors.Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013

work page 2013

[25] [25]

Statistical physics of inference: thresholds and algo- rithms.Advances in Physics, 65(5):453–552, 2016

Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: thresholds and algo- rithms.Advances in Physics, 65(5):453–552, 2016

work page 2016

[26] [26]

Noureddine El Karoui. On the impact of predictor geometry on the performance on high- dimensional ridge-regularized generalized robust regression estimators.Probability Theory and Related Fields, 170(1):95–175, 2018

work page 2018

[27] [27]

Pragya Sur, Yuxin Chen, and Emmanuel J. Candès. The likelihood ratio test in high- dimensional logistic regression is asymptotically a rescaled chi-square.Probability Theory and Related Fields, 175(1):487–558, 2019

work page 2019

[28] [28]

Pragya Sur and Emmanuel J. Candès. A modern maximum-likelihood theory for high- dimensional logistic regression.Proceedings of the National Academy of Sciences, 116(29):14516– 14525, 2019

work page 2019

[29] [29]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

work page 2022

[30] [30]

Approximate Message Passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1):197 – 224, 2022

Zhou Fan. Approximate Message Passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1):197 – 224, 2022

work page 2022

[31] [31]

Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J

Oliver Y. Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J. Samworth. A unifying tutorial on approximate message passing.Foundations and Trends in Machine Learning, 15(4): 335–536, 05 2022

work page 2022

[32] [32]

Hong Hu and Yue M. Lu. Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2023

work page 1932

[33] [33]

The lasso with general gaussian designs with applications to hypothesis testing.The Annals of Statistics, 51(5):2194–2220, 2023

Michael Celentano, Andrea Montanari, and Yuting Wei. The lasso with general gaussian designs with applications to hypothesis testing.The Annals of Statistics, 51(5):2194–2220, 2023

work page 2023

[34] [34]

A friendly tutorial on mean-field spin glass tech- niques for non-physicists.Foundations and Trends in Machine Learning, 17(1):1–173, 01 2024

Andrea Montanari and Subhabrata Sen. A friendly tutorial on mean-field spin glass tech- niques for non-physicists.Foundations and Trends in Machine Learning, 17(1):1–173, 01 2024

work page 2024

[35] [35]

Method-of-moments inference for glms and doubly robust functionals under proportional asymptotics.arXiv preprint arXiv:2408.06103, 2025

Xingyu Chen, Lin Liu, and Rajarshi Mukherjee. Method-of-moments inference for glms and doubly robust functionals under proportional asymptotics.arXiv preprint arXiv:2408.06103, 2025

work page arXiv 2025

[36] [36]

A new central limit the- orem for the augmented IPW estimator: Variance inflation, cross-fit covariance and beyond

Kuanhao Jiang, Rajarshi Mukherjee, Subhabrata Sen, and Pragya Sur. A new central limit the- orem for the augmented IPW estimator: Variance inflation, cross-fit covariance and beyond. The Annals of Statistics, 53(2):647 – 675, 2025. 28

work page 2025

[37] [37]

Characterizing finite-dimensional posterior marginals in high- dimensional GLMs via leave-one-out.arXiv preprint arXiv:2601.00091, 2025

Manuel Sáenz and Pragya Sur. Characterizing finite-dimensional posterior marginals in high- dimensional GLMs via leave-one-out.arXiv preprint arXiv:2601.00091, 2025

work page arXiv 2025

[38] [38]

Robinson

Al Depope, Jakub Bajzik, Marco Mondelli, and Matthew R. Robinson. Joint modeling of whole-genome sequencing data for human height via approximate message passing.Cell Genomics, page 101162, 2026

work page 2026

[39] [39]

Optimal and provable calibration in high-dimensional binary classi- fication: Angular calibration and platt scaling

Yufan Li and Pragya Sur. Optimal and provable calibration in high-dimensional binary classi- fication: Angular calibration and platt scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[40] [40]

Spectrum-aware debiasing: A modern inference framework with applications to principal components regression.The Annals of Statistics, 54(2):745 – 770, 2026

Yufan Li and Pragya Sur. Spectrum-aware debiasing: A modern inference framework with applications to principal components regression.The Annals of Statistics, 54(2):745 – 770, 2026

work page 2026

[41] [41]

Optimal unconstrained self-distillation in ridge regression: Strict improvements, precise asymptotics, and one-shot tuning.arXiv preprint arXiv:2602.17565, 2026

Hien Dang, Pratik Patil, and Alessandro Rinaldo. Optimal unconstrained self-distillation in ridge regression: Strict improvements, precise asymptotics, and one-shot tuning.arXiv preprint arXiv:2602.17565, 2026

work page arXiv 2026

[42] [42]

Asymptotic theory of iterated empirical risk minimization, with applications to active learning.arXiv preprint arXiv:2601.23031, 2026

Hugo Cui and Yue M Lu. Asymptotic theory of iterated empirical risk minimization, with applications to active learning.arXiv preprint arXiv:2601.23031, 2026

work page arXiv 2026

[43] [43]

Self-distillation amplifies regular- ization in hilbert space

Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regular- ization in hilbert space. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[44] [44]

Yang, and Qiang Sun

Mingqi Wu, Archer Y. Yang, and Qiang Sun. Why self-training helps and hurts: Denoising vs. signal forgetting.arXiv preprint arXiv:2602.14029, 2026

work page arXiv 2026

[45] [45]

On the mechanisms of weak-to-strong generalization: A theoretical perspective

Behrad Moniri and Hamed Hassani. On the mechanisms of weak-to-strong generalization: A theoretical perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[46] [46]

High-dimensional analysis of knowledge distillation: Weak-to-strong gener- alization and scaling laws

Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong gener- alization and scaling laws. InThe Thirteenth International Conference on Learning Representa- tions, 2025

work page 2025

[47] [47]

Scaling laws for learning with real and surrogate data

Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[48] [48]

Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[49] [49]

Understanding self-distillation in the presence of label noise

Rudrajit Das and Sujay Sanghavi. Understanding self-distillation in the presence of label noise. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7102–7140, 23–29 Jul 2023

work page 2023

[50] [50]

Understanding the gains from repeated self-distillation.Advances in Neural Information Processing Systems, 37:7759–7796, 2024

Divyansh Pareek, Simon S Du, and Sewoong Oh. Understanding the gains from repeated self-distillation.Advances in Neural Information Processing Systems, 37:7759–7796, 2024

work page 2024

[51] [51]

Towards understanding knowledge distillation

Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5142–5151, 09–15 Jun 2019. 29

work page 2019

[52] [52]

The effect of optimal self-distillation in noisy gaussian mixture model

Kaito Takanami, Takashi Takahashi, and Ayaka Sakata. The effect of optimal self-distillation in noisy gaussian mixture model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[53] [53]

Preventing model collapse under over- parametrization: Optimal mixing ratios for interpolation learning and ridge regression

Anvit Garg, Sohom Bhattacharya, and Pragya Sur. Preventing model collapse under over- parametrization: Optimal mixing ratios for interpolation learning and ridge regression. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026

[54] [54]

Self-boost via op- timal retraining: An analysis via approximate message passing

Adel Javanmard, Rudrajit Das, Alessandro Epasto, and Vahab Mirrokni. Self-boost via op- timal retraining: An analysis via approximate message passing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[55] [55]

Distillation≈early stopping? harvest- ing dark knowledge utilizing anisotropic information retrieval for overparameterized neural network.arXiv preprint arXiv:1910.01255, 2019

Bin Dong, Jikai Hou, Yiping Lu, and Zhihua Zhang. Distillation≈early stopping? harvest- ing dark knowledge utilizing anisotropic information retrieval for overparameterized neural network.arXiv preprint arXiv:1910.01255, 2019

work page arXiv 1910

[56] [56]

Solvable model for inheriting the regularization through knowledge distillation

Luca Saglietti and Lenka Zdeborova. Solvable model for inheriting the regularization through knowledge distillation. InProceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 ofProceedings of Machine Learning Research, pages 809–846, 2022

work page 2022

[57] [57]

Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024

Enric Boix-Adsera. Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024

work page arXiv 2024

[58] [58]

Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data

Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[59] [59]

Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks.ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013

Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks.ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013

work page 2013

[60] [60]

Self-training: A survey.Neurocomputing, 616:128904, February 2025

Amini Massih-Reza, Vasilii Feofanov, Loïc Pauletto, Liès Hadjadj, Émilie Devijver, and Yury Maximov. Self-training: A survey.Neurocomputing, 616:128904, February 2025

work page 2025

[61] [61]

Duchi, and Percy S

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C. Duchi, and Percy S. Liang. Un- labeled data improves adversarial robustness. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[62] [62]

Understanding self-training for gradual do- main adaptation

Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual do- main adaptation. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5468–5479, 13–18 Jul 2020

work page 2020

[63] [63]

Self-training avoids using spurious features under domain shift

Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. InAdvances in Neural Information Processing Systems, volume 33, pages 21061–21071, 2020

work page 2020

[64] [64]

Samet Oymak and T. C. Gulcu. Statistical and algorithmic insights for semi-supervised learn- ing with self-training.arXiv preprint arXiv:2006.11006, 2020

work page arXiv 2006

[65] [65]

Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622, 2020

Colin Wei, Kai Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622, 2020. 30

work page arXiv 2010

[66] [66]

How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis.arXiv preprint arXiv:2201.08514, 2022

Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen, and Jinjun Xiong. How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis.arXiv preprint arXiv:2201.08514, 2022

work page arXiv 2022

[67] [67]

Marchenko and Leonid A

Vladimir A. Marchenko and Leonid A. Pastur. Distribution of eigenvalues for some sets of random matrices.Matematicheskii Sbornik, 114(4):507–536, 1967

work page 1967

[68] [68]

Silverstein

Jinho Baik and Jack W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.Journal of Multivariate Analysis, 97(6):1382–1408, 2006

work page 2006

[69] [69]

Phase transition of the largest eigenvalue for non-null complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005

work page 2005

[70] [70]

Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.arXiv preprint arXiv:1805.10939, 2020

Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.arXiv preprint arXiv:1805.10939, 2020

work page arXiv 2020

[71] [71]

On the optimal weightedℓ 2 regularization in overparameterized linear regression

Denny Wu and Ji Xu. On the optimal weightedℓ 2 regularization in overparameterized linear regression. InAdvances in Neural Information Processing Systems, volume 33, pages 10112– 10123. Curran Associates, Inc., 2020

work page 2020

[72] [72]

Asymptotics of ridge(less) re- gression under general source condition

Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) re- gression under general source condition. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3889–3897, 2021

work page 2021

[73] [73]

Uniform consistency of cross-validation estimators for high-dimensional ridge regression

Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani. Uniform consistency of cross-validation estimators for high-dimensional ridge regression. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Ma- chine Learning Research, pages 3178–3186, 2021

work page 2021

[74] [74]

Bartlett

Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression.J. Mach. Learn. Res., 24(1), January 2023. ISSN 1532-4435

work page 2023

[75] [75]

Tibshirani

Pratik Patil, Jin-Hong Du, and Ryan J. Tibshirani. Optimal ridge regularization for out-of- distribution prediction. InProceedings of the 41st International Conference on Machine Learning, ICML’24, 2024

work page 2024

[76] [76]

Lee H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension.Bernoulli, 22(1):1 – 37, 2016

work page 2016

[77] [77]

High-dimensional asymptotics of prediction: Ridge re- gression and classification.The Annals of Statistics, 46(1):247–279, 2018

Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge re- gression and classification.The Annals of Statistics, 46(1):247–279, 2018

work page 2018

[78] [78]

J. W. Silverstein and S.-I. Choi. Analysis of the limiting spectral distribution of large dimen- sional random matrices.Journal of Multivariate Analysis, 54(2):295–309, 1995

work page 1995

[79] [79]

Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices.Journal of Multivariate Analysis, 55(2):331–339, 1995

Jack Silverstein. Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices.Journal of Multivariate Analysis, 55(2):331–339, 1995

work page 1995

[80] [80]

Cambridge University Press, 2022

Romain Couillet and Zhenyu Liao.Random matrix methods for machine learning. Cambridge University Press, 2022. 31

work page 2022