Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models
Pith reviewed 2026-05-20 00:58 UTC · model grok-4.3
The pith
s-step self-distillation achieves optimal performance among spectral shrinkage estimators for spiked covariance matrices with s spikes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For spiked covariance matrices with s spikes, s-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, s steps are necessary for optimality since any (s-k)-step distilled estimator is strictly suboptimal for 1 ≤ k ≤ s. For the special subclass of isotropic covariances, optimally tuned Ridge regression performs best among spectral shrinkage estimators.
What carries the argument
Spectral shrinkage estimators, which apply a shrinkage function to the eigenvalues of the empirical covariance, with self-distillation serving as the iteration mechanism that reaches the optimal shrinkage rule after exactly s steps.
If this is right
- s-step self-distillation is necessary and sufficient for optimality in the class of spectral shrinkage estimators when there are s spikes.
- Fewer distillation steps produce strictly worse estimators.
- Optimally tuned ridge regression is the best spectral shrinkage estimator when the covariance is isotropic.
- In a federated setting with multiple data centers, the best local shrinkage rule is a form of self-distillation different from the centralized optimal rule.
- The framework connects self-distillation to classical shrinkage methods and explains its predictive improvements.
Where Pith is reading between the lines
- This optimality suggests testing whether estimating the spike count s in real data and using that many distillation steps improves performance over fixed-step methods.
- The results could extend to other high-dimensional estimation problems where iterated procedures might achieve optimality within restricted estimator classes.
- Practitioners might consider whether the difference in federated versus centralized rules affects how self-distillation is implemented in distributed machine learning systems.
Load-bearing premise
The data follows exactly a spiked covariance model with a known number s of spikes, and the analysis applies only within the restricted class of spectral shrinkage estimators.
What would settle it
Simulate data from a spiked covariance model with a known s, compute the estimation risk or prediction error for the s-step self-distilled estimator and for other spectral shrinkage estimators such as standard ridge or Ledoit-Wolf, and check whether the self-distilled version has the smallest risk; a smaller risk for any other estimator would falsify the optimality.
Figures
read the original abstract
Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with $s$ spikes, $s$-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that $s$ steps are necessary for optimality: any $(s-k)$-step distilled estimator is strictly suboptimal for $1 \leq k \leq s$. For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops the statistical foundations of self-distillation in spiked covariance models by introducing and analyzing spectral shrinkage estimators. It claims that for spiked covariance matrices with s spikes, s-step self-distillation achieves optimal performance among spectral shrinkage estimators and outperforms standard estimators from statistics and machine learning. It further establishes that exactly s steps are necessary, as any (s-k)-step estimator is strictly suboptimal, that optimally tuned ridge regression is best among spectral shrinkage estimators for isotropic covariances, and that in a federated setting the optimal local rule again takes the form of self-distillation (though different from the centralized optimum).
Significance. If the asymptotic risk characterizations and optimality proofs hold, the work supplies a precise theoretical account of self-distillation within the classical spiked model, directly connecting modern machine-learning heuristics to random-matrix shrinkage theory. The explicit recursion for the distilled shrinkage function, the strict necessity result for s steps, and the federated extension are notable strengths that could guide both theory and practice in high-dimensional estimation.
minor comments (3)
- [Section 2] Section 2: the definition of the class of spectral shrinkage estimators would be clearer if the shrinkage function were introduced with an explicit functional form or integral representation before the recursion is stated.
- [Figure 1] Figure 1: the legend does not distinguish the curves for different numbers of distillation steps; adding labels or a table of risk values would improve readability.
- [Section 5] The federated aggregation rule in Section 5 is presented without an explicit comparison table to the centralized optimum; a side-by-side display of the two shrinkage functions would help readers see the difference.
Simulated Author's Rebuttal
We thank the referee for the careful and positive assessment of our manuscript. We are pleased that the referee finds the asymptotic risk characterizations, the optimality proofs, the necessity of exactly s steps, and the federated extension to be notable strengths, and we appreciate the recommendation for minor revision.
Circularity Check
No significant circularity identified
full rationale
The paper's central derivation establishes optimality of s-step self-distillation within the class of spectral shrinkage estimators under an exact spiked covariance model by invoking standard random-matrix characterizations of spiked eigenvalues together with an explicit recursion for the shrinkage function. These steps are mathematically independent of the target optimality claim and do not reduce to self-definition, fitted-input renaming, or self-citation chains. The assumption of a known spike count s is stated explicitly as part of the model rather than derived from the estimators themselves, and the necessity of exactly s steps follows from direct risk comparisons that remain falsifiable outside the fitted values. No load-bearing ansatz is smuggled via citation, and the results are self-contained against external benchmarks in random matrix theory.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data are generated from a spiked covariance model with exactly s spikes.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.1: f_pred^* is a rational function ... numerator degree s, denominator degree s+1 ... realized by s-step self-distillation
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.2: when s=0, f_pred^* = 1/(x + lambda^*) recovers optimal ridge
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541, 2006
work page 2006
-
[2]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[3]
Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014
Lei J Ba and Rich Caruana. Do deep nets really need to be deep?Advances in Neural Informa- tion Processing Systems, 27, 2014
work page 2014
-
[4]
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, et al. Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction.arXiv preprint arXiv:2508.03613, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
Yoonmo Jeon, Seunghun Lee, and Woongsup Kim. Edgev-se: Self-reflective fine-tuning framework for edge-deployable vision-language models.Applied Sciences, 16(2):818, 2026
work page 2026
-
[6]
Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anand- kumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616, 2018. 26
work page 2018
-
[7]
Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. Self-distillation: Towards efficient and compact neural networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (8):4388–4403, 2021
work page 2021
-
[8]
Learning from Noisy Labels with Distillation
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation.arXiv preprint arXiv:1703.02391, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks.Advances in neural information processing systems, 25, 2012
work page 2012
-
[10]
Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization.Advances in neural information processing systems, 25, 2012
work page 2012
-
[11]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradi- ent descent.Advances in neural information processing systems, 23, 2010
work page 2010
-
[12]
Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021
Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging.The Annals of Statistics, 49(2):918 – 943, 2021
work page 2021
-
[13]
Federated Optimization: Distributed Machine Learning for On-Device Intelligence
Jakub Koneˇ cn`y, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Feder- ated optimization: Distributed machine learning for on-device intelligence.arXiv preprint arXiv:1610.02527, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
Communication-efficient learning of deep networks from decentralized data
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 ofProceedings of Machine Learning Research, pages 1273–1282, 2017
work page 2017
-
[15]
Federated multi- task learning.Advances in neural information processing systems, 30, 2017
Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi- task learning.Advances in neural information processing systems, 30, 2017
work page 2017
-
[16]
M Yashwanth, Gaurav Kumar Nayak, Arya Singh, Yogesh Simmhan, and Anirban Chakraborty. Adaptive self-distillation for minimizing client drift in heterogeneous feder- ated learning.Transactions on Machine Learning Research, 2024
work page 2024
-
[17]
Yuting He, Yiqiang Chen, XiaoDong Yang, Hanchao Yu, Yi-Hua Huang, and Yang Gu. Learn- ing critically: Selective self-distillation in federated learning on non-iid data.IEEE Transac- tions on Big Data, 10(6):789–800, 2022
work page 2022
-
[18]
Personalized feder- ated learning via backbone self-distillation
Pengju Wang, Bochao Liu, Dan Zeng, Chenggang Yan, and Shiming Ge. Personalized feder- ated learning via backbone self-distillation. InProceedings of the 5th ACM International Confer- ence on Multimedia in Asia, pages 1–7, 2023
work page 2023
-
[19]
Federated distillation: A survey
Lin Li, Jianping Gou, Baosheng Yu, Lan Du, and Zhang Yiand Dacheng Tao. Federated dis- tillation: A survey.arXiv preprint arXiv:2404.08564, 2024
-
[20]
Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis.The Annals of statistics, 29(2):295–327, 2001
work page 2001
-
[21]
Donoho, Arian Maleki, and Andrea Montanari
David L. Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing.Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009. 27
work page 2009
-
[22]
The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012
Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices.IEEE Transactions on Information Theory, 58(4):1997–2017, 2012
work page 1997
-
[23]
Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling.Information and Inference: A Journal of the IMA, 2(2):115–144, 2013
work page 2013
-
[24]
Bickel, Chinghway Lim, and Bin Yu
Noureddine El Karoui, Derek Bean, Peter J. Bickel, Chinghway Lim, and Bin Yu. On robust regression with high-dimensional predictors.Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013
work page 2013
-
[25]
Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: thresholds and algo- rithms.Advances in Physics, 65(5):453–552, 2016
work page 2016
-
[26]
Noureddine El Karoui. On the impact of predictor geometry on the performance on high- dimensional ridge-regularized generalized robust regression estimators.Probability Theory and Related Fields, 170(1):95–175, 2018
work page 2018
-
[27]
Pragya Sur, Yuxin Chen, and Emmanuel J. Candès. The likelihood ratio test in high- dimensional logistic regression is asymptotically a rescaled chi-square.Probability Theory and Related Fields, 175(1):487–558, 2019
work page 2019
-
[28]
Pragya Sur and Emmanuel J. Candès. A modern maximum-likelihood theory for high- dimensional logistic regression.Proceedings of the National Academy of Sciences, 116(29):14516– 14525, 2019
work page 2019
-
[29]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022
work page 2022
-
[30]
Zhou Fan. Approximate Message Passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1):197 – 224, 2022
work page 2022
-
[31]
Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J
Oliver Y. Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J. Samworth. A unifying tutorial on approximate message passing.Foundations and Trends in Machine Learning, 15(4): 335–536, 05 2022
work page 2022
-
[32]
Hong Hu and Yue M. Lu. Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2023
work page 1932
-
[33]
Michael Celentano, Andrea Montanari, and Yuting Wei. The lasso with general gaussian designs with applications to hypothesis testing.The Annals of Statistics, 51(5):2194–2220, 2023
work page 2023
-
[34]
Andrea Montanari and Subhabrata Sen. A friendly tutorial on mean-field spin glass tech- niques for non-physicists.Foundations and Trends in Machine Learning, 17(1):1–173, 01 2024
work page 2024
-
[35]
Xingyu Chen, Lin Liu, and Rajarshi Mukherjee. Method-of-moments inference for glms and doubly robust functionals under proportional asymptotics.arXiv preprint arXiv:2408.06103, 2025
-
[36]
Kuanhao Jiang, Rajarshi Mukherjee, Subhabrata Sen, and Pragya Sur. A new central limit the- orem for the augmented IPW estimator: Variance inflation, cross-fit covariance and beyond. The Annals of Statistics, 53(2):647 – 675, 2025. 28
work page 2025
-
[37]
Manuel Sáenz and Pragya Sur. Characterizing finite-dimensional posterior marginals in high- dimensional GLMs via leave-one-out.arXiv preprint arXiv:2601.00091, 2025
- [38]
-
[39]
Yufan Li and Pragya Sur. Optimal and provable calibration in high-dimensional binary classi- fication: Angular calibration and platt scaling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[40]
Yufan Li and Pragya Sur. Spectrum-aware debiasing: A modern inference framework with applications to principal components regression.The Annals of Statistics, 54(2):745 – 770, 2026
work page 2026
-
[41]
Hien Dang, Pratik Patil, and Alessandro Rinaldo. Optimal unconstrained self-distillation in ridge regression: Strict improvements, precise asymptotics, and one-shot tuning.arXiv preprint arXiv:2602.17565, 2026
-
[42]
Hugo Cui and Yue M Lu. Asymptotic theory of iterated empirical risk minimization, with applications to active learning.arXiv preprint arXiv:2601.23031, 2026
-
[43]
Self-distillation amplifies regular- ization in hilbert space
Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regular- ization in hilbert space. InAdvances in Neural Information Processing Systems, 2020
work page 2020
-
[44]
Mingqi Wu, Archer Y. Yang, and Qiang Sun. Why self-training helps and hurts: Denoising vs. signal forgetting.arXiv preprint arXiv:2602.14029, 2026
-
[45]
On the mechanisms of weak-to-strong generalization: A theoretical perspective
Behrad Moniri and Hamed Hassani. On the mechanisms of weak-to-strong generalization: A theoretical perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[46]
Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-strong gener- alization and scaling laws. InThe Thirteenth International Conference on Learning Representa- tions, 2025
work page 2025
-
[47]
Scaling laws for learning with real and surrogate data
Ayush Jain, Andrea Montanari, and Eren Sasoglu. Scaling laws for learning with real and surrogate data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[48]
Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning
Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distilla- tion and self-distillation in deep learning. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[49]
Understanding self-distillation in the presence of label noise
Rudrajit Das and Sujay Sanghavi. Understanding self-distillation in the presence of label noise. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7102–7140, 23–29 Jul 2023
work page 2023
-
[50]
Divyansh Pareek, Simon S Du, and Sewoong Oh. Understanding the gains from repeated self-distillation.Advances in Neural Information Processing Systems, 37:7759–7796, 2024
work page 2024
-
[51]
Towards understanding knowledge distillation
Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5142–5151, 09–15 Jun 2019. 29
work page 2019
-
[52]
The effect of optimal self-distillation in noisy gaussian mixture model
Kaito Takanami, Takashi Takahashi, and Ayaka Sakata. The effect of optimal self-distillation in noisy gaussian mixture model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[53]
Anvit Garg, Sohom Bhattacharya, and Pragya Sur. Preventing model collapse under over- parametrization: Optimal mixing ratios for interpolation learning and ridge regression. In The Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[54]
Self-boost via op- timal retraining: An analysis via approximate message passing
Adel Javanmard, Rudrajit Das, Alessandro Epasto, and Vahab Mirrokni. Self-boost via op- timal retraining: An analysis via approximate message passing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[55]
Bin Dong, Jikai Hou, Yiping Lu, and Zhihua Zhang. Distillation≈early stopping? harvest- ing dark knowledge utilizing anisotropic information retrieval for overparameterized neural network.arXiv preprint arXiv:1910.01255, 2019
-
[56]
Solvable model for inheriting the regularization through knowledge distillation
Luca Saglietti and Lenka Zdeborova. Solvable model for inheriting the regularization through knowledge distillation. InProceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 ofProceedings of Machine Learning Research, pages 809–846, 2022
work page 2022
-
[57]
Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024
Enric Boix-Adsera. Towards a theory of model distillation.arXiv preprint arXiv:2403.09053, 2024
-
[58]
Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data
Eyar Azar and Boaz Nadler. Semi-supervised sparse gaussian classification: Provable bene- fits of unlabeled data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[59]
Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks.ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 07 2013
work page 2013
-
[60]
Self-training: A survey.Neurocomputing, 616:128904, February 2025
Amini Massih-Reza, Vasilii Feofanov, Loïc Pauletto, Liès Hadjadj, Émilie Devijver, and Yury Maximov. Self-training: A survey.Neurocomputing, 616:128904, February 2025
work page 2025
-
[61]
Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C. Duchi, and Percy S. Liang. Un- labeled data improves adversarial robustness. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[62]
Understanding self-training for gradual do- main adaptation
Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual do- main adaptation. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5468–5479, 13–18 Jul 2020
work page 2020
-
[63]
Self-training avoids using spurious features under domain shift
Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. InAdvances in Neural Information Processing Systems, volume 33, pages 21061–21071, 2020
work page 2020
- [64]
-
[65]
Colin Wei, Kai Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622, 2020. 30
-
[66]
Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen, and Jinjun Xiong. How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis.arXiv preprint arXiv:2201.08514, 2022
-
[67]
Vladimir A. Marchenko and Leonid A. Pastur. Distribution of eigenvalues for some sets of random matrices.Matematicheskii Sbornik, 114(4):507–536, 1967
work page 1967
-
[68]
Jinho Baik and Jack W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.Journal of Multivariate Analysis, 97(6):1382–1408, 2006
work page 2006
-
[69]
Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices.The Annals of Probability, 33(5):1643–1697, 2005
work page 2005
-
[70]
Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization.arXiv preprint arXiv:1805.10939, 2020
-
[71]
On the optimal weightedℓ 2 regularization in overparameterized linear regression
Denny Wu and Ji Xu. On the optimal weightedℓ 2 regularization in overparameterized linear regression. InAdvances in Neural Information Processing Systems, volume 33, pages 10112– 10123. Curran Associates, Inc., 2020
work page 2020
-
[72]
Asymptotics of ridge(less) re- gression under general source condition
Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco. Asymptotics of ridge(less) re- gression under general source condition. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 3889–3897, 2021
work page 2021
-
[73]
Uniform consistency of cross-validation estimators for high-dimensional ridge regression
Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani. Uniform consistency of cross-validation estimators for high-dimensional ridge regression. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Ma- chine Learning Research, pages 3178–3186, 2021
work page 2021
- [74]
-
[75]
Pratik Patil, Jin-Hong Du, and Ryan J. Tibshirani. Optimal ridge regularization for out-of- distribution prediction. InProceedings of the 41st International Conference on Machine Learning, ICML’24, 2024
work page 2024
-
[76]
Lee H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension.Bernoulli, 22(1):1 – 37, 2016
work page 2016
-
[77]
Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge re- gression and classification.The Annals of Statistics, 46(1):247–279, 2018
work page 2018
-
[78]
J. W. Silverstein and S.-I. Choi. Analysis of the limiting spectral distribution of large dimen- sional random matrices.Journal of Multivariate Analysis, 54(2):295–309, 1995
work page 1995
-
[79]
Jack Silverstein. Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices.Journal of Multivariate Analysis, 55(2):331–339, 1995
work page 1995
-
[80]
Cambridge University Press, 2022
Romain Couillet and Zhenyu Liao.Random matrix methods for machine learning. Cambridge University Press, 2022. 31
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.