pith. sign in

arxiv: 2405.00592 · v4 · submitted 2024-05-01 · 📊 stat.ML · cond-mat.dis-nn· cs.LG

Scaling and renormalization in high-dimensional regression

Pith reviewed 2026-05-24 01:52 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncs.LG
keywords ridge regressionhigh-dimensional statisticsrandom matrix theoryfree probabilityS-transformgeneralization errorscaling lawsrandom features
0
0 comments X

The pith

Fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in high-dimensional ridge regression, random fluctuations around the population covariance are exactly equivalent to a shifted value of the regularization parameter. This deterministic mapping, obtained via the S-transform of free probability, produces closed-form expressions for both training and generalization error. The same object directly gives the train-test gap and an analogue of generalized cross-validation. The resulting asymptotics also isolate the sources of power-law scaling and extend to structured random-feature models, where feature variance or weight anisotropy can dominate in the overparameterized regime.

Core claim

Statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This deterministic equivalence allows analytic formulas for the training and generalization errors to be derived in a few lines via the S-transform of free probability. In all models the S-transform equals the train-test generalization gap and supplies a natural estimator analogous to generalized cross-validation. The same machinery yields fine-grained bias-variance decompositions for random-feature models and identifies regimes in which feature variance or anisotropic weights set the leading performance limits.

What carries the argument

The S-transform of free probability, which encodes the deterministic equivalence obtained by renormalizing the ridge parameter.

If this is right

  • Exact asymptotic expressions for training and generalization error follow directly from the S-transform once the renormalized ridge parameter is identified.
  • The train-test gap is given exactly by the S-transform itself, supplying a theoretical analogue of generalized cross-validation.
  • Power-law scalings in performance arise from the analytic structure of the S-transform for different covariance ensembles.
  • In random-feature models the variance contributed by the random features themselves can become the performance bottleneck once the model is overparameterized.
  • Anisotropic structure in the random weights produces nontrivial exponents for finite-width corrections to the overparameterized limit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same renormalization perspective may apply to other quadratic estimators or losses whose population versions admit an S-transform description.
  • Practical regularization schedules could be improved by estimating the effective ridge shift from finite-sample covariance spectra.
  • If the deterministic equivalence survives in certain non-Gaussian or dependent-data settings, it would supply a route to scaling laws for more realistic feature distributions.

Load-bearing premise

The high-dimensional asymptotic regime holds and the data distribution permits the use of free probability, in particular the existence and algebraic properties of the S-transform for the relevant covariance ensembles.

What would settle it

Numerical experiments in large but finite dimension that compare the paper's predicted training and test errors, obtained from the renormalized ridge formula, against direct computation of ridge regression on the same synthetic data; systematic disagreement would falsify the deterministic equivalence.

read the original abstract

From benign overfitting in overparameterized models to rich power-law scalings in performance, simple ridge regression displays surprising behaviors sometimes thought to be limited to deep neural networks. This balance of phenomenological richness with analytical tractability makes ridge regression the model system of choice in high-dimensional machine learning. In this paper, we present a unifying perspective on recent results on ridge regression using the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning. We highlight the fact that statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This `deterministic equivalence' allows us to obtain analytic formulas for the training and generalization errors in a few lines of algebra by leveraging the properties of the $S$-transform of free probability. From these precise asymptotics, we can easily identify sources of power-law scaling in model performance. In all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. This allows us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that in high-dimensional ridge regression, statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter via deterministic equivalence. This yields closed-form asymptotic expressions for training and generalization errors through the S-transform of free probability, which also equals the train-test gap and provides a GCV analogue. The framework is applied to derive bias-variance decompositions for a broad class of random feature models with structured (non-i.i.d.) covariates and anisotropic weights, identifying scaling regimes where feature variance or anisotropy limits overparameterized performance and producing power-law exponents.

Significance. If the deterministic equivalence and S-transform application hold for the claimed generality, the work supplies a compact analytic toolkit that unifies disparate scaling results in overparameterized ridge and random-feature models, directly linking covariance fluctuations to renormalized regularization and exposing sources of power-law behavior. The explicit identification of the S-transform with the generalization gap is a clean observation with potential for broader use. The paper credits its derivations to standard free-probability identities rather than ad-hoc fitting.

major comments (3)
  1. [§3.3, §4.1] §3.3 and §4.1: the deterministic equivalence that absorbs covariance fluctuations into a renormalized ridge is stated for 'very general' structured covariates, yet the justification that the S-transform continues to encode the fluctuation renormalization without additional correction terms (required for asymptotic freeness) is not supplied for low-rank perturbations or feature correlations that violate standard Wishart/Marchenko-Pastur assumptions; this step is load-bearing for the unification claim and the subsequent bias-variance formulas.
  2. [§5.2, Eq. (27)–(29)] §5.2, Eq. (27)–(29): the finite-width correction exponents for anisotropic weights are derived under the renormalized ridge; however, the derivation assumes the anisotropy matrix commutes with the limiting covariance in a manner that preserves the S-transform factorization, which is not verified for the structured random-feature ensembles considered, undermining the claimed nontrivial exponents.
  3. [§4.3] §4.3: the claim that the same S-transform directly supplies both the train-test gap and the GCV analogue for all models in the class rests on the deterministic equivalence holding uniformly; without an explicit statement of the required independence or freeness conditions for the structured covariates, the extension from the i.i.d. case remains unverified.
minor comments (2)
  1. Notation for the S-transform is introduced without a self-contained definition or reference to the precise normalization used; a short appendix recalling the functional equation would improve readability for the target physics/DL audience.
  2. Figure 3 caption states 'variance due to features limits performance' but the plotted curves mix feature variance with label noise; separating the two contributions would clarify the scaling regime identified in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The points raised help clarify the scope of our assumptions. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [§3.3, §4.1] §3.3 and §4.1: the deterministic equivalence that absorbs covariance fluctuations into a renormalized ridge is stated for 'very general' structured covariates, yet the justification that the S-transform continues to encode the fluctuation renormalization without additional correction terms (required for asymptotic freeness) is not supplied for low-rank perturbations or feature correlations that violate standard Wishart/Marchenko-Pastur assumptions; this step is load-bearing for the unification claim and the subsequent bias-variance formulas.

    Authors: The deterministic equivalence in §3.3 is derived from the resolvent analysis and holds for the broad class of structured covariates under the paper's stated moment and scaling conditions, which ensure asymptotic freeness. Low-rank perturbations contribute vanishing terms to the limiting spectrum and do not introduce corrections to the S-transform. We will revise the text in §3.3 and §4.1 to explicitly reference the relevant free-probability theorems justifying the absence of additional terms. revision: yes

  2. Referee: [§5.2, Eq. (27)–(29)] §5.2, Eq. (27)–(29): the finite-width correction exponents for anisotropic weights are derived under the renormalized ridge; however, the derivation assumes the anisotropy matrix commutes with the limiting covariance in a manner that preserves the S-transform factorization, which is not verified for the structured random-feature ensembles considered, undermining the claimed nontrivial exponents.

    Authors: The derivation assumes alignment between the anisotropy and covariance eigenbases to enable S-transform factorization, which is satisfied for the ensembles analyzed. We acknowledge that this does not cover completely arbitrary non-commuting structures. We will revise §5.2 to state this assumption explicitly and qualify the scope of the nontrivial exponents. revision: partial

  3. Referee: [§4.3] §4.3: the claim that the same S-transform directly supplies both the train-test gap and the GCV analogue for all models in the class rests on the deterministic equivalence holding uniformly; without an explicit statement of the required independence or freeness conditions for the structured covariates, the extension from the i.i.d. case remains unverified.

    Authors: The uniform application follows from the freeness conditions already used to establish the deterministic equivalence in §3.3. We will revise §4.3 to include an explicit paragraph listing the independence and asymptotic freeness requirements for the structured covariates. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies standard free-probability S-transform to renormalized ridge

full rationale

The paper states that fluctuations are absorbed into a renormalized ridge parameter, after which analytic formulas follow from the properties of the S-transform of free probability. This S-transform is invoked as an external mathematical object whose properties are leveraged, not defined or fitted within the paper. No self-citations, ansatzes smuggled via prior work, or predictions that reduce to fitted inputs appear in the provided derivation outline. The approach is therefore self-contained against external benchmarks from random matrix theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the ledger is inferred from the described techniques. The paper relies on standard free-probability identities rather than introducing new fitted parameters or entities.

axioms (1)
  • standard math Existence and algebraic properties of the S-transform for the relevant random matrix ensembles
    Invoked to obtain closed-form expressions for errors and the generalization gap

pith-pipeline@v0.9.0 · 5817 in / 1336 out tokens · 25349 ms · 2026-05-24T01:52:19.567631+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

    stat.ML 2026-05 accept novelty 7.0

    A solvable hierarchical model with power-law feature strengths yields explicit power-law scaling of prediction error through sequential recovery of latent directions by a layer-wise spectral algorithm.

  2. Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

    cond-mat.dis-nn 2026-05 unverdicted novelty 7.0

    A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.

  3. Random Matrix Theory of Early-Stopped Gradient Flow: A Transient BBP Scenario

    stat.ML 2026-04 unverdicted novelty 7.0

    In an anisotropic random-matrix model of gradient flow, the teacher signal produces a transient BBP transition where the outlier eigenvalue emerges only in an intermediate time window before overfitting.

  4. Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

    cond-mat.dis-nn 2026-05 unverdicted novelty 6.0

    A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructurin...

  5. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 6.0

    Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...

  6. Double Descent in Quantum Kernel Ridge Regression

    quant-ph 2026-04 unverdicted novelty 6.0

    Quantum kernel ridge regression shows double descent in test risk, with the interpolation peak suppressible by regularization, via random matrix theory asymptotics in the high-dimensional limit.

  7. Renormalization group for spectral collapse in random matrices with power-law variance profiles

    cond-mat.stat-mech 2025-12 unverdicted novelty 6.0

    A renormalization group scheme with running normalization collapses eigenvalue spectra of Wigner and Wishart matrices modified by power-law variance profiles, confirmed via fixed-point equations and simulations.

  8. Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

    cond-mat.dis-nn 2025-02 unverdicted novelty 6.0

    Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.

  9. Asymmetric Scaling Laws from Sparse Features

    stat.ML 2026-05 unverdicted novelty 5.0

    A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.

  10. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 5.0

    Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...

  11. There Will Be a Scientific Theory of Deep Learning

    stat.ML 2026-04 unverdicted novelty 2.0

    A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 9 Pith papers · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, Josh, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. (2023), “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774. Adlam, Ben, and Jeffrey Pennington (2020a), “The neural tangent kernel in high dimensions: Triple descent and a multi-scale ...

  2. [2]

    On the asymptotics of wide networks with polynomial activations,

    Aitken, Kyle, and Guy Gur-Ari (2020), “On the asymptotics of wide networks with polynomial activations,” arXiv preprint arXiv:2006.06687 arXiv:2006.06687. Alabdulmohsin, Ibrahim M, Xiaohua Zhai, Alexander Kolesnikov, and Lucas Beyer (2024), “Getting ViT in shape: Scaling laws for compute-optimal model design,” Advances in Neural Information Processing Systems

  3. [3]

    Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E

    Ali, Alnur, J. Zico Kolter, and Ryan J. Tibshirani (2019), “A continuous-time view of early stopping for least squares regression,” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , Proceedings of Machine Learning Research, Vol. 89, edited by Kamalika Chaudhuri and Masashi Sugiyama (PMLR) pp. 1370–137...

  4. [4]

    Explaining neural scaling laws,

    Bahri, Yasaman, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma (2024), “Explaining neural scaling laws,” Proceed- ings of the National Academy of Sciences 121 (27), e2311878121, https://www.pnas.org/doi/pdf/10.1073/pnas.2311878121. Banica, Teodor (2010), “The orthogonal Weingarten formula in compact form,” Letters in Mathematical Physics 91 (2)...

  5. [5]

    A spectral theory of neural prediction and alignment,

    Canatar, Abdulkadir, Jenelle Feather, Albert Wakhloo, and SueYeon Chung (2024), “A spectral theory of neural prediction and alignment,” Advances in Neural Information Processing Systems

  6. [6]

    Optimal rates for the regularized least-squares algorithm,

    Caponnetto, Andrea, and Ernesto De Vito (2007), “Optimal rates for the regularized least-squares algorithm,” Foundations of Computational Mathematics 7, 331–368. Caponnetto, Andrea, and Ernesto De Vito (2005), Fast rates for regularized least-squares algorithm , Tech. Rep. (Massachusetts Institute of Technology Computer Science and Artificial Intelligence...

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Collins, Benoˆ ıt, and Sho Matsumoto (2009), “On some properties of orthogonal Weingarten functions,” Journal of Mathematical Physics 50 (11). Cram´ er, Harald (1999),Mathematical methods of statistics , Vol. 26 (Princeton university press). Craven, Peter, and Grace Wahba (1978), “Smoothing noisy data with spline functions: estimating the correct degree o...

  8. [8]

    Statistical mechanics of support vector networks,

    Dietrich, Rainer, Manfred Opper, and Haim Sompolinsky (1999), “Statistical mechanics of support vector networks,” Physical review letters 82 (14),

  9. [9]

    High-dimensional asymptotics of prediction: Ridge regression and classification,

    Dobriban, Edgar, and Stefan Wager (2018), “High-dimensional asymptotics of prediction: Ridge regression and classification,” The Annals of Statistics 46 (1), 247 –

  10. [10]

    Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime,

    Dubova, Sofiia, Yue M. Lu, Benjamin McKenna, and Horng-Tzer Yau (2023), “Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime,” arXiv arXiv:2310.18280 [math.PR]. 71 Dyer, Ethan, and Guy Gur-Ari (2019), “Asymptotics of wide networks from feynman diagrams,” arXiv preprint arXiv:1909.11304. d’Ascoli, St´ ephan...

  11. [11]

    Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel,

    Fort, Stanislav, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli (2020), “Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel,” Advances in Neural Information Processing Systems 33, 5850–5861. Geiger, Mario, Arthur Jacot, Stef...

  12. [12]

    Static phenomena near critical points: theory and experiment,

    Kadanoff, Leo P, Wolfgang G¨ otze, David Hamblen, Robert Hecht, EAS Lewis, V V Palciauskas, Martin Rayl, J Swift, David Aspnes, and Joseph Kane (1967), “Static phenomena near critical points: theory and experiment,” Reviews of Modern Physics 39 (2),

  13. [13]

    Scaling Laws for Neural Language Models

    Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei (2020), “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361. Kobak, Dmitry, Jonathan Lomond, and Benoit Sanchez (2020), “The optimal ridge penalty for real-world high-dimensional data can be z...

  14. [14]

    Wide neural networks of any depth evolve as linear models under gradient descent,

    72 Lee, Jaehoon, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington (2019), “Wide neural networks of any depth evolve as linear models under gradient descent,” Advances in neural information processing systems

  15. [15]

    Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions,

    Lee, Kiwon, Andrew Cheng, Elliot Paquette, and Courtney Paquette (2022), “Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions,” in Advances in Neural Information Processing Systems , Vol. 35, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc.) pp. 36944–36957. LeJeun...

  16. [16]

    arXiv preprint arXiv:2204.10425 , year=

    Mingo, James A, and Roland Speicher (2017), Free probability and random matrices, Vol. 35 (Springer). Misiakiewicz, Theodor (2022), “Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression,” arXiv arXiv:2204.10425 [math.ST]. Misiakiewicz, Theodor, and Andrea Montanari (2023), “Six lectu...

  17. [17]

    On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels,

    Muller, Ralf R (2002), “On the asymptotic eigenvalue distribution of concatenated vector-valued fading channels,” IEEE Transactions on Information Theory 48 (7), 2086–2091. Nakkiran, Preetum (2019), “More data can hurt for linear regression: Sample-wise double descent,” arXiv preprint arXiv:1912.07242. Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan...

  18. [18]

    Are Gaussian data all you need? The extents and limits of universality in high-dimensional generalized linear estimation,

    Pesce, Luca, Florent Krzakala, Bruno Loureiro, and Ludovic Stephan (2023), “Are Gaussian data all you need? The extents and limits of universality in high-dimensional generalized linear estimation,” in Proceedings of the 40th International Conference on Machine Learning , Proceedings of Machine Learning Research, Vol. 202, edited by Andreas Krause, Emma B...

  19. [19]

    Improving language understanding by generative pre-training,

    Potters, Marc, and Jean-Philippe Bouchaud (2020), A First Course in Random Matrix Theory: For Physicists, Engineers and Data Scientists (Cambridge University Press). Radford, Alec, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. (2018), “Improving language understanding by generative pre-training,”. Radford, Alec, Jeffrey Wu, Rewon Child, David L...

  20. [20]

    Bias-variance decomposition of overparameterized regression with random linear features,

    Roberts, Daniel A, Sho Yaida, and Boris Hanin (2022), The principles of deep learning theory , Vol. 46 (Cambridge University Press Cambridge, MA, USA). Rocks, Jason W, and Pankaj Mehta (2022), “Bias-variance decomposition of overparameterized regression with random linear features,” Physical Review E 106 (2), 025304. Rosenfeld, Jonathan S, Amir Rosenfeld,...

  21. [21]

    Learning curves for Gaussian process regression: Approximations and bounds,

    Sollich, Peter, and Anason Halees (2002), “Learning curves for Gaussian process regression: Approximations and bounds,” Neural computation 14 (6), 1393–1428. Spigler, Stefano, Mario Geiger, and Matthieu Wyart (2020), “Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm,” Journal of Statistical Mechanics: Theory and...

  22. [22]

    Feature- learning networks are consistent across widths at realistic scales,

    Voiculescu, Dan V (1997), Free probability theory, Vol. 12 (American Mathematical Soc.). Voiculescu, Dan V, Ken J Dykema, and Alexandru Nica (1992), Free random variables (American Mathematical Society). Vyas, Nikhil, Alexander Atanasov, Blake Bordelon, Depen Morwani, Sabarish Sainathan, and Cengiz Pehlevan (2024), “Feature- learning networks are consiste...

  23. [23]

    The statistical mechanics of learning a rule,

    Watkin, Timothy L H, Albrecht Rau, and Michael Biehl (1993), “The statistical mechanics of learning a rule,” Rev. Mod. Phys. 65, 499–556. Wei, Alexander, Wei Hu, and Jacob Steinhardt (2022), “More than a toy: Random matrix models predict how real-world neural representations generalize,” in International Conference on Machine Learning (PMLR) pp. 23549–235...

  24. [24]

    The renormalization group and the ϵ expansion,

    Wilson, Kenneth G, and John Kogut (1974), “The renormalization group and the ϵ expansion,” Physics reports 12 (2), 75–199. Wu, Denny, and Ji Xu (2020), “On the optimal weighted ℓ2 regularization in overparameterized linear regression,” in Advances in Neural Information Processing Systems , Vol. 33, edited by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Bal...

  25. [25]

    Understanding deep learning requires rethinking generalization

    Zavatone-Veth, Jacob A, William L Tong, and Cengiz Pehlevan (2022b), “Contrasting random and learned features in deep Bayesian linear regression,” Physical Review E 105 (6), 064118. Zhai, Xiaohua, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer (2022), “Scaling vision transformers,” in Proceedings of the IEEE/CVF conference on computer vision and patt...