Scaling and renormalization in high-dimensional regression
Pith reviewed 2026-05-24 01:52 UTC · model grok-4.3
The pith
Fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This deterministic equivalence allows analytic formulas for the training and generalization errors to be derived in a few lines via the S-transform of free probability. In all models the S-transform equals the train-test generalization gap and supplies a natural estimator analogous to generalized cross-validation. The same machinery yields fine-grained bias-variance decompositions for random-feature models and identifies regimes in which feature variance or anisotropic weights set the leading performance limits.
What carries the argument
The S-transform of free probability, which encodes the deterministic equivalence obtained by renormalizing the ridge parameter.
If this is right
- Exact asymptotic expressions for training and generalization error follow directly from the S-transform once the renormalized ridge parameter is identified.
- The train-test gap is given exactly by the S-transform itself, supplying a theoretical analogue of generalized cross-validation.
- Power-law scalings in performance arise from the analytic structure of the S-transform for different covariance ensembles.
- In random-feature models the variance contributed by the random features themselves can become the performance bottleneck once the model is overparameterized.
- Anisotropic structure in the random weights produces nontrivial exponents for finite-width corrections to the overparameterized limit.
Where Pith is reading between the lines
- The same renormalization perspective may apply to other quadratic estimators or losses whose population versions admit an S-transform description.
- Practical regularization schedules could be improved by estimating the effective ridge shift from finite-sample covariance spectra.
- If the deterministic equivalence survives in certain non-Gaussian or dependent-data settings, it would supply a route to scaling laws for more realistic feature distributions.
Load-bearing premise
The high-dimensional asymptotic regime holds and the data distribution permits the use of free probability, in particular the existence and algebraic properties of the S-transform for the relevant covariance ensembles.
What would settle it
Numerical experiments in large but finite dimension that compare the paper's predicted training and test errors, obtained from the renormalized ridge formula, against direct computation of ridge regression on the same synthetic data; systematic disagreement would falsify the deterministic equivalence.
read the original abstract
From benign overfitting in overparameterized models to rich power-law scalings in performance, simple ridge regression displays surprising behaviors sometimes thought to be limited to deep neural networks. This balance of phenomenological richness with analytical tractability makes ridge regression the model system of choice in high-dimensional machine learning. In this paper, we present a unifying perspective on recent results on ridge regression using the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning. We highlight the fact that statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This `deterministic equivalence' allows us to obtain analytic formulas for the training and generalization errors in a few lines of algebra by leveraging the properties of the $S$-transform of free probability. From these precise asymptotics, we can easily identify sources of power-law scaling in model performance. In all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. This allows us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in high-dimensional ridge regression, statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter via deterministic equivalence. This yields closed-form asymptotic expressions for training and generalization errors through the S-transform of free probability, which also equals the train-test gap and provides a GCV analogue. The framework is applied to derive bias-variance decompositions for a broad class of random feature models with structured (non-i.i.d.) covariates and anisotropic weights, identifying scaling regimes where feature variance or anisotropy limits overparameterized performance and producing power-law exponents.
Significance. If the deterministic equivalence and S-transform application hold for the claimed generality, the work supplies a compact analytic toolkit that unifies disparate scaling results in overparameterized ridge and random-feature models, directly linking covariance fluctuations to renormalized regularization and exposing sources of power-law behavior. The explicit identification of the S-transform with the generalization gap is a clean observation with potential for broader use. The paper credits its derivations to standard free-probability identities rather than ad-hoc fitting.
major comments (3)
- [§3.3, §4.1] §3.3 and §4.1: the deterministic equivalence that absorbs covariance fluctuations into a renormalized ridge is stated for 'very general' structured covariates, yet the justification that the S-transform continues to encode the fluctuation renormalization without additional correction terms (required for asymptotic freeness) is not supplied for low-rank perturbations or feature correlations that violate standard Wishart/Marchenko-Pastur assumptions; this step is load-bearing for the unification claim and the subsequent bias-variance formulas.
- [§5.2, Eq. (27)–(29)] §5.2, Eq. (27)–(29): the finite-width correction exponents for anisotropic weights are derived under the renormalized ridge; however, the derivation assumes the anisotropy matrix commutes with the limiting covariance in a manner that preserves the S-transform factorization, which is not verified for the structured random-feature ensembles considered, undermining the claimed nontrivial exponents.
- [§4.3] §4.3: the claim that the same S-transform directly supplies both the train-test gap and the GCV analogue for all models in the class rests on the deterministic equivalence holding uniformly; without an explicit statement of the required independence or freeness conditions for the structured covariates, the extension from the i.i.d. case remains unverified.
minor comments (2)
- Notation for the S-transform is introduced without a self-contained definition or reference to the precise normalization used; a short appendix recalling the functional equation would improve readability for the target physics/DL audience.
- Figure 3 caption states 'variance due to features limits performance' but the plotted curves mix feature variance with label noise; separating the two contributions would clarify the scaling regime identified in the text.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. The points raised help clarify the scope of our assumptions. We respond to each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [§3.3, §4.1] §3.3 and §4.1: the deterministic equivalence that absorbs covariance fluctuations into a renormalized ridge is stated for 'very general' structured covariates, yet the justification that the S-transform continues to encode the fluctuation renormalization without additional correction terms (required for asymptotic freeness) is not supplied for low-rank perturbations or feature correlations that violate standard Wishart/Marchenko-Pastur assumptions; this step is load-bearing for the unification claim and the subsequent bias-variance formulas.
Authors: The deterministic equivalence in §3.3 is derived from the resolvent analysis and holds for the broad class of structured covariates under the paper's stated moment and scaling conditions, which ensure asymptotic freeness. Low-rank perturbations contribute vanishing terms to the limiting spectrum and do not introduce corrections to the S-transform. We will revise the text in §3.3 and §4.1 to explicitly reference the relevant free-probability theorems justifying the absence of additional terms. revision: yes
-
Referee: [§5.2, Eq. (27)–(29)] §5.2, Eq. (27)–(29): the finite-width correction exponents for anisotropic weights are derived under the renormalized ridge; however, the derivation assumes the anisotropy matrix commutes with the limiting covariance in a manner that preserves the S-transform factorization, which is not verified for the structured random-feature ensembles considered, undermining the claimed nontrivial exponents.
Authors: The derivation assumes alignment between the anisotropy and covariance eigenbases to enable S-transform factorization, which is satisfied for the ensembles analyzed. We acknowledge that this does not cover completely arbitrary non-commuting structures. We will revise §5.2 to state this assumption explicitly and qualify the scope of the nontrivial exponents. revision: partial
-
Referee: [§4.3] §4.3: the claim that the same S-transform directly supplies both the train-test gap and the GCV analogue for all models in the class rests on the deterministic equivalence holding uniformly; without an explicit statement of the required independence or freeness conditions for the structured covariates, the extension from the i.i.d. case remains unverified.
Authors: The uniform application follows from the freeness conditions already used to establish the deterministic equivalence in §3.3. We will revise §4.3 to include an explicit paragraph listing the independence and asymptotic freeness requirements for the structured covariates. revision: yes
Circularity Check
No circularity: derivation applies standard free-probability S-transform to renormalized ridge
full rationale
The paper states that fluctuations are absorbed into a renormalized ridge parameter, after which analytic formulas follow from the properties of the S-transform of free probability. This S-transform is invoked as an external mathematical object whose properties are leveraged, not defined or fitted within the paper. No self-citations, ansatzes smuggled via prior work, or predictions that reduce to fitted inputs appear in the provided derivation outline. The approach is therefore self-contained against external benchmarks from random matrix theory.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Existence and algebraic properties of the S-transform for the relevant random matrix ensembles
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel / Jcost functional equation echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This 'deterministic equivalence' allows us to obtain analytic formulas... by leveraging the properties of the S-transform
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection / coupling combiner renormalization echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the change from λ to κ is exactly due to κ absorbing the contributions of the statistical fluctuations... analogous to how a renormalized mass term absorbs the quantum or thermal fluctuations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 11 Pith papers
-
Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model
A solvable hierarchical model with power-law feature strengths yields explicit power-law scaling of prediction error through sequential recovery of latent directions by a layer-wise spectral algorithm.
-
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
-
Random Matrix Theory of Early-Stopped Gradient Flow: A Transient BBP Scenario
In an anisotropic random-matrix model of gradient flow, the teacher signal produces a transient BBP transition where the outlier eigenvalue emerges only in an intermediate time window before overfitting.
-
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructurin...
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
-
Double Descent in Quantum Kernel Ridge Regression
Quantum kernel ridge regression shows double descent in test risk, with the interpolation peak suppressible by regularization, via random matrix theory asymptotics in the high-dimensional limit.
-
Renormalization group for spectral collapse in random matrices with power-law variance profiles
A renormalization group scheme with running normalization collapses eigenvalue spectra of Wigner and Wishart matrices modified by power-law variance profiles, confirmed via fixed-point equations and simulations.
-
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.
-
Asymmetric Scaling Laws from Sparse Features
A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...
Reference graph
Works this paper leans on
-
[1]
Achiam, Josh, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. (2023), “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774. Adlam, Ben, and Jeffrey Pennington (2020a), “The neural tangent kernel in high dimensions: Triple descent and a multi-scale ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
On the asymptotics of wide networks with polynomial activations,
Aitken, Kyle, and Guy Gur-Ari (2020), “On the asymptotics of wide networks with polynomial activations,” arXiv preprint arXiv:2006.06687 arXiv:2006.06687. Alabdulmohsin, Ibrahim M, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer (2024), “Getting ViT in shape: Scaling laws for compute-optimal model design,” Advances in Neural Information Processing Systems
-
[3]
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E
Ali, Alnur, J. Zico Kolter, and Ryan J. Tibshirani (2019), “A continuous-time view of early stopping for least squares regression,” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , Proceedings of Machine Learning Research, Vol. 89, edited by Kamalika Chaudhuri and Masashi Sugiyama (PMLR) pp. 1370–137...
-
[4]
Explaining neural scaling laws,
Bahri, Yasaman, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma (2024), “Explaining neural scaling laws,” Proceed- ings of the National Academy of Sciences 121 (27), e2311878121, https://www.pnas.org/doi/pdf/10.1073/pnas.2311878121. Banica, Teodor (2010), “The orthogonal Weingarten formula in compact form,” Letters in Mathematical Physics 91 (2)...
-
[5]
A spectral theory of neural prediction and alignment,
Canatar, Abdulkadir, Jenelle Feather, Albert Wakhloo, and SueYeon Chung (2024), “A spectral theory of neural prediction and alignment,” Advances in Neural Information Processing Systems
work page 2024
-
[6]
Optimal rates for the regularized least-squares algorithm,
Caponnetto, Andrea, and Ernesto De Vito (2007), “Optimal rates for the regularized least-squares algorithm,” Foundations of Computational Mathematics 7, 331–368. Caponnetto, Andrea, and Ernesto De Vito (2005), Fast rates for regularized least-squares algorithm , Tech. Rep. (Massachusetts Institute of Technology Computer Science and Artificial Intelligence...
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Collins, Benoˆ ıt, and Sho Matsumoto (2009), “On some properties of orthogonal Weingarten functions,” Journal of Mathematical Physics 50 (11). Cram´ er, Harald (1999),Mathematical methods of statistics , Vol. 26 (Princeton university press). Craven, Peter, and Grace Wahba (1978), “Smoothing noisy data with spline functions: estimating the correct degree o...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[8]
Statistical mechanics of support vector networks,
Dietrich, Rainer, Manfred Opper, and Haim Sompolinsky (1999), “Statistical mechanics of support vector networks,” Physical review letters 82 (14),
work page 1999
-
[9]
High-dimensional asymptotics of prediction: Ridge regression and classification,
Dobriban, Edgar, and Stefan Wager (2018), “High-dimensional asymptotics of prediction: Ridge regression and classification,” The Annals of Statistics 46 (1), 247 –
work page 2018
-
[10]
Dubova, Sofiia, Yue M. Lu, Benjamin McKenna, and Horng-Tzer Yau (2023), “Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime,” arXiv arXiv:2310.18280 [math.PR]. 71 Dyer, Ethan, and Guy Gur-Ari (2019), “Asymptotics of wide networks from feynman diagrams,” arXiv preprint arXiv:1909.11304. d’Ascoli, St´ ephan...
-
[11]
Fort, Stanislav, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli (2020), “Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel,” Advances in Neural Information Processing Systems 33, 5850–5861. Geiger, Mario, Arthur Jacot, Stef...
-
[12]
Static phenomena near critical points: theory and experiment,
Kadanoff, Leo P, Wolfgang G¨ otze, David Hamblen, Robert Hecht, EAS Lewis, V V Palciauskas, Martin Rayl, J Swift, David Aspnes, and Joseph Kane (1967), “Static phenomena near critical points: theory and experiment,” Reviews of Modern Physics 39 (2),
work page 1967
-
[13]
Scaling Laws for Neural Language Models
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei (2020), “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361. Kobak, Dmitry, Jonathan Lomond, and Benoit Sanchez (2020), “The optimal ridge penalty for real-world high-dimensional data can be z...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[14]
Wide neural networks of any depth evolve as linear models under gradient descent,
72 Lee, Jaehoon, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington (2019), “Wide neural networks of any depth evolve as linear models under gradient descent,” Advances in neural information processing systems
work page 2019
-
[15]
Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions,
Lee, Kiwon, Andrew Cheng, Elliot Paquette, and Courtney Paquette (2022), “Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions,” in Advances in Neural Information Processing Systems , Vol. 35, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc.) pp. 36944–36957. LeJeun...
-
[16]
arXiv preprint arXiv:2204.10425 , year=
Mingo, James A, and Roland Speicher (2017), Free probability and random matrices, Vol. 35 (Springer). Misiakiewicz, Theodor (2022), “Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression,” arXiv arXiv:2204.10425 [math.ST]. Misiakiewicz, Theodor, and Andrea Montanari (2023), “Six lectu...
-
[17]
On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels,
Muller, Ralf R (2002), “On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels,” IEEE Transactions on Information Theory 48 (7), 2086–2091. Nakkiran, Preetum (2019), “More data can hurt for linear regression: Sample-wise double descent,” arXiv preprint arXiv:1912.07242. Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan...
-
[18]
Pesce, Luca, Florent Krzakala, Bruno Loureiro, and Ludovic Stephan (2023), “Are Gaussian data all you need? The extents and limits of universality in high-dimensional generalized linear estimation,” in Proceedings of the 40th International Conference on Machine Learning , Proceedings of Machine Learning Research, Vol. 202, edited by Andreas Krause, Emma B...
work page 2023
-
[19]
Improving language understanding by generative pre-training,
Potters, Marc, and Jean-Philippe Bouchaud (2020), A First Course in Random Matrix Theory: For Physicists, Engineers and Data Scientists (Cambridge University Press). Radford, Alec, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. (2018), “Improving language understanding by generative pre-training,”. Radford, Alec, Jeffrey Wu, Rewon Child, David L...
work page 2020
-
[20]
Bias-variance decomposition of overparameterized regression with random linear features,
Roberts, Daniel A, Sho Yaida, and Boris Hanin (2022), The principles of deep learning theory , Vol. 46 (Cambridge University Press Cambridge, MA, USA). Rocks, Jason W, and Pankaj Mehta (2022), “Bias-variance decomposition of overparameterized regression with random linear features,” Physical Review E 106 (2), 025304. Rosenfeld, Jonathan S, Amir Rosenfeld,...
-
[21]
Learning curves for Gaussian process regression: Approximations and bounds,
Sollich, Peter, and Anason Halees (2002), “Learning curves for Gaussian process regression: Approximations and bounds,” Neural computation 14 (6), 1393–1428. Spigler, Stefano, Mario Geiger, and Matthieu Wyart (2020), “Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm,” Journal of Statistical Mechanics: Theory and...
work page 2002
-
[22]
Feature- learning networks are consistent across widths at realistic scales,
Voiculescu, Dan V (1997), Free probability theory, Vol. 12 (American Mathematical Soc.). Voiculescu, Dan V, Ken J Dykema, and Alexandru Nica (1992), Free random variables (American Mathematical Society). Vyas, Nikhil, Alexander Atanasov, Blake Bordelon, Depen Morwani, Sabarish Sainathan, and Cengiz Pehlevan (2024), “Feature- learning networks are consiste...
work page 1997
-
[23]
The statistical mechanics of learning a rule,
Watkin, Timothy L H, Albrecht Rau, and Michael Biehl (1993), “The statistical mechanics of learning a rule,” Rev. Mod. Phys. 65, 499–556. Wei, Alexander, Wei Hu, and Jacob Steinhardt (2022), “More than a toy: Random matrix models predict how real-world neural representations generalize,” in International Conference on Machine Learning (PMLR) pp. 23549–235...
work page 1993
-
[24]
The renormalization group and the ϵ expansion,
Wilson, Kenneth G, and John Kogut (1974), “The renormalization group and the ϵ expansion,” Physics reports 12 (2), 75–199. Wu, Denny, and Ji Xu (2020), “On the optimal weighted ℓ2 regularization in overparameterized linear regression,” in Advances in Neural Information Processing Systems , Vol. 33, edited by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Bal...
work page 1974
-
[25]
Understanding deep learning requires rethinking generalization
Zavatone-Veth, Jacob A, William L Tong, and Cengiz Pehlevan (2022b), “Contrasting random and learned features in deep Bayesian linear regression,” Physical Review E 105 (6), 064118. Zhai, Xiaohua, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer (2022), “Scaling vision transformers,” in Proceedings of the IEEE/CVF conference on computer vision and patt...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.