Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Alexander Atanasov; Blake Bordelon; Cengiz Pehlevan; Courtney Paquette; Jacob A. Zavatone-Veth

arxiv: 2502.05074 · v3 · submitted 2025-02-07 · ❄️ cond-mat.dis-nn · cs.LG· stat.ML

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Alexander Atanasov , Blake Bordelon , Jacob A. Zavatone-Veth , Courtney Paquette , Cengiz Pehlevan This is my paper

Pith reviewed 2026-05-23 04:12 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cs.LGstat.ML

keywords deterministic equivalencestochastic gradient descenthigh-dimensional asymptoticsrandom matrix theorylinear regressionkernel regressionrandom features

0 comments

The pith

A novel deterministic equivalence for the two-point resolvent function unifies the asymptotic analysis of stochastic gradient descent across high-dimensional linear models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a new identity that replaces the stochastic two-point function of a random matrix resolvent with a deterministic expression that can be computed from macroscopic parameters alone. This identity is applied to obtain exact high-dimensional asymptotics for the training and test performance of linear regression, kernel regression, and linear random feature models trained by stochastic gradient descent. The derivations recover previously known results for these models as special cases while also producing new predictions for their behavior under different noise levels and learning rates. A sympathetic reader would care because the same object now governs the dynamics of several distinct learning settings that previously required separate analyses.

Core claim

The central claim is that the two-point function of the resolvent of a random matrix admits a deterministic equivalent in the high-dimensional limit, and that this equivalent directly yields the performance of stochastic gradient descent for a broad class of linear models including regression, kernel methods, and random features, encompassing both known and previously unknown asymptotic expressions.

What carries the argument

The two-point deterministic equivalence for the random matrix resolvent function, which converts the stochastic two-point correlations arising in SGD dynamics into a closed deterministic form.

If this is right

A single derivation now covers the asymptotic performance of SGD for high-dimensional linear regression.
The same derivation covers kernel regression and linear random feature models.
Both previously known asymptotics and new ones for varying noise and step-size regimes follow from the equivalence.
The performance metrics are expressed in terms of the deterministic equivalent without needing to average over individual matrix realizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The equivalence may allow similar closed-form predictions for other first-order optimizers whose updates can be written in terms of the same resolvent structure.
If the two-point object can be generalized to non-Gaussian or structured data, the framework could extend beyond the linear and kernel settings treated here.
The deterministic reduction suggests that the leading-order behavior of these models is controlled by the spectrum of the data covariance rather than higher-order statistics of the noise.

Load-bearing premise

The system must be in the high-dimensional asymptotic regime where the two-point resolvent function admits a deterministic equivalent that does not depend on the specific realization of the random matrix.

What would settle it

A direct numerical computation, for large but finite dimension, of the two-point resolvent function under the paper's random matrix model that shows a persistent deviation from the derived deterministic expression as dimension grows.

read the original abstract

We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and linear random feature models. Our results include previously known asymptotics as well as novel ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a new two-point resolvent equivalence and uses it to unify SGD asymptotics across linear models in the high-dimensional limit.

read the letter

The main takeaway is that they establish a deterministic equivalence for the two-point function of a random matrix resolvent. This lets them handle the covariance structure needed for SGD dynamics in one framework instead of case-by-case arguments. They apply it to linear regression, kernel regression, and random feature models, recovering prior results and producing some new asymptotics along the way. The unification is the real advance here; it streamlines work that previously required separate derivations for each model class. The derivation follows the usual random matrix playbook but extends it to the two-point case, which appears to be the novel step. No obvious circularity or data-fitting issues show up in the setup. The high-dimensional regime is the standard assumption for these equivalences, and the stress-test note confirms nothing in the abstract violates the usual conditions. A minor limitation is that the work inherits the typical restrictions of asymptotic random matrix analysis: it may not translate directly when dimensions are moderate or when the data matrix deviates from the assumed randomness. That is not a flaw in the paper itself but a boundary on where the results apply. This is written for readers already comfortable with resolvent methods and high-dimensional statistics in machine learning. Someone tracking SGD theory in linear and kernel settings will get immediate use from the unified expressions and the new asymptotics. The technical engagement looks honest and the claims are grounded in the derivation rather than loose interpretation. It deserves a serious referee rather than a desk rejection.

Referee Report

0 major / 2 minor

Summary. The manuscript derives a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this equivalence, it provides a unified derivation of the asymptotic performance of high-dimensional linear models trained with stochastic gradient descent, covering linear regression, kernel regression, and linear random feature models. The results recover previously known asymptotics and introduce new ones in the high-dimensional regime.

Significance. If the central derivation holds, the work offers a significant technical contribution by supplying a parameter-free two-point resolvent equivalence that unifies the analysis of SGD dynamics across multiple linear models via random matrix theory. This framework could streamline future analyses in high-dimensional statistics and machine learning. The absence of free parameters and the derivation from first principles (as indicated by the axiom ledger) are explicit strengths.

minor comments (2)

[Introduction] The abstract is clear but the introduction would benefit from an explicit statement of the precise high-dimensional scaling assumptions (e.g., the relation between dimension, sample size, and step size) before the main theorem.
[Section 2] Notation for the two-point resolvent function should be introduced with a short display equation in §2 to aid readability for readers unfamiliar with the specific RMT objects.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of its contributions, and recommendation to accept. We appreciate the recognition of the parameter-free two-point resolvent equivalence and its unifying role across linear models trained by SGD.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's central claim is a novel deterministic equivalence for the two-point resolvent function derived via random matrix theory techniques, followed by its application to obtain asymptotics for SGD in linear models. No load-bearing step reduces by construction to a fit, a self-citation chain, or a renamed input; the derivation is presented as independent of the target performance quantities and relies on standard high-dimensional asymptotic assumptions rather than internal redefinitions. The abstract and structure indicate an external mathematical result used to unify known and new expressions, with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The work appears to rest on standard high-dimensional random matrix assumptions that are not enumerated here.

pith-pipeline@v0.9.0 · 5603 in / 1122 out tokens · 66547 ms · 2026-05-23T04:12:35.910351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 6 internal anchors

[1]

explicitly vanishes in the high-dimensionalD→ ∞limit whenζ >0

work page
[2]

imaginary ridges

contributes negligibly after sufficient timetforζ= 0. In the first case, maintaining stable dynamics withη= Θ(1) requires choosing a batch size large enough such that χ= Θ(D −ζ) (see Section III D). With this choice, the two terms generated by SGD effects scale as ηχF F ⊤ ˆΣCt ˆΣF F ⊤ = Θ(D−ζ), ηχF F ⊤ ˆΣF F ⊤ Tr[Ct ˆΣ] = Θ(1), justifying neglecting the l...

work page
[3]

Thus, even in the case of linear regression at finitet, one requires two-point resolvents

When the dataset is finite,A= ˆΣand the population covarianceΣdoes not generally commute withA, regardless of whetherBis included. Thus, even in the case of linear regression at finitet, one requires two-point resolvents

work page
[4]

This is why the general finiteN, P, texpressions of Bordelon et

A random feature model at finite model sizeNand finite dataset sizePwill not haveA=F F ⊤ andB= ˆΣ commute. This is why the general finiteN, P, texpressions of Bordelon et. al [16] required computing two point resolvents that are functions of two frequenciesω, ω ′. At finiteNbut infiniteP,A=M=Σand one-point resolvents are in fact sufficient as in the work ...

work page
[5]

two-point

For covariate shift settings, the matrixM=Σ ′ does not generally commute withAandB. Even in the linear regression setting, whereB=I, one still requires a two-point deterministic equivalent in this case [26, 40]. In cases 1. and 2. thet→ ∞limits of the forcing term recover ridgeless regression, whose precise asymptotics does not require two-point equivalen...

work page
[6]

Here, because ˆΣ=Σ∗WforWa white Wishart, the renormalization of each frequency is given by the multiplication of theS-transform of a white Wishart

¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w. Here, because ˆΣ=Σ∗WforWa white Wishart, the renormalization of each frequency is given by the multiplication of theS-transform of a white Wishart. This is found in the standard literature, see e.g. [24]. Our notation convention in this section is thus: df1 ≡df 1 ˆΣ(ω)≃df 1 Σ(ω1),df ′ 1 ≡df 1 ˆΣ(ω′)≃df 1 Σ(ω′ 1) S≡ 1 1− D ...

work page
[7]

In Appendix B 2, we compute the SGD terms in terms of double frequency Fourier transforms and two-point equivalents

¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w + Z ω e2iωt χTr[Σ2(Σ+iω 1)−1] 1−χTr[Σ 2(Σ+iω 1)−1]−χiω 1(Ddf1)2/P ¯w⊤Σ(Σ+iω 1)−1 ¯w, respectively. In Appendix B 2, we compute the SGD terms in terms of double frequency Fourier transforms and two-point equivalents. C. Covariate Shift In this section, we show how the two-point deterministic equivalences allow for direct cal...

work page
[8]

Integrating Over Data Pushing throughΣ 1/2 on both sides, we apply the two-point equivalence (A.2) withM=IandA=Σ 1/2F F ⊤Σ1/2 to obtain: F(iω, iω ′)≃ SW S′ W 1−γ 1 ¯w⊤(iω1 +ΣF F ⊤)−1Σ(iω′ 1 +F F ⊤Σ)−1 ¯w for γ1 ≡ D P tr (ΣF F ⊤)2(iω1 +ΣF F ⊤)−1(iω′ 1 +ΣF F ⊤)−1

work page
[9]

We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D

Integrating over Features We will evaluateγ 1 separately since it concentrates. We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D. Here,M=I,A=Σ,B=F F ⊤. γ1 ≃ D P df2 + D P D N (iω2)(iω′ 2)tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]2 1− D N df2 . We now apply (IV.2) withM=I,A=Σ, andB=F F ⊤ to get: F(iω, iω ′)≃ SS ′ 1−γ 1 " ¯w⊤(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1 ¯w + ¯...

work page 2024
[10]

Deep Learning Scaling is Predictable, Empirically

J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,et al., Training compute-optimal large language models, arXiv preprint arXiv:2203.15556 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Ghorbani, O

B. Ghorbani, O. Firat, M. Freitag, A. Bapna, M. Krikun, X. Garcia, C. Chelba, and C. Cherry, Scaling laws for neural machine translation, arXiv preprint arXiv:2109.07740 (2021)

work page arXiv 2021
[14]

Scaling Laws for Transfer

D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish, Scaling laws for transfer, arXiv preprint arXiv:2102.01293 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Scaling Laws and Interpretability of Learning from Repeated Data

D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume,et al., Scaling laws and interpretability of learning from repeated data, arXiv preprint arXiv:2205.10487 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

M. A. Gordon, K. Duh, and J. Kaplan, Data and parameter scaling laws for neural machine translation, inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021) pp. 5915–5922

work page 2021
[17]

Muennighoff, A

N. Muennighoff, A. Rush, B. Barak, T. Le Scao, N. Tazi, A. Piktus, S. Pyysalo, T. Wolf, and C. A. Raffel, Scaling data-constrained language models, Advances in Neural Information Processing Systems36(2024)

work page 2024
[18]

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, Scaling vision transformers, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition(2022) pp. 12104–12113

work page 2022
[19]

I. M. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer, Getting ViT in shape: Scaling laws for compute-optimal model design, Advances in Neural Information Processing Systems36(2024)

work page 2024
[20]

Bachmann, S

G. Bachmann, S. Anagnostidis, and T. Hofmann, Scaling MLPs: A tale of inductive bias, Advances in Neural Information Processing Systems36(2024)

work page 2024
[21]

Everett, L

K. Everett, L. Xiao, M. Wortsman, A. A. Alemi, R. Novak, P. J. Liu, I. Gur, J. Sohl-Dickstein, L. P. Kaelbling, J. Lee, et al., Scaling exponents across parameterizations and optimizers, arXiv preprint arXiv:2407.05872 (2024)

work page arXiv 2024
[22]

Lin and L

J. Lin and L. Rosasco, Optimal rates for multi-pass stochastic gradient methods, Journal of Machine Learning Research 18, 1 (2017)

work page 2017
[23]

Pillaud-Vivien, A

L. Pillaud-Vivien, A. Rudi, and F. Bach, Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes, Advances in Neural Information Processing Systems31(2018)

work page 2018
[24]

Ying and M

Y. Ying and M. Pontil, Online gradient descent learning algorithms, Foundations of Computational Mathematics8, 561 (2008)

work page 2008
[25]

Bordelon, A

B. Bordelon, A. Atanasov, and C. Pehlevan, A dynamical model of neural scaling laws, inForty-first International Con- ference on Machine Learning(2024)

work page 2024
[26]

Paquette, C

E. Paquette, C. Paquette, L. Xiao, and J. Pennington, 4 + 3 phases of compute-optimal neural scaling laws, arXiv preprint arXiv:2405.15074 (2024)

work page arXiv 2024
[27]

L. Lin, J. Wu, S. M. Kakade, P. L. Bartlett, and J. D. Lee, Scaling laws in linear regression: Compute, parameters, and data, arXiv preprint arXiv:2406.08466 (2024)

work page arXiv 2024
[28]

Carratino, A

L. Carratino, A. Rudi, and L. Rosasco, Learning with sgd and random features, Advances in neural information processing systems31(2018). 17

work page 2018
[29]

Scaling and renormalization in high-dimensional regression

A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Scaling and renormalization in high-dimensional regression, arXiv preprint arXiv:2405.00592 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Bordelon and C

B. Bordelon and C. Pehlevan, Learning curves for SGD on structured features, inInternational Conference on Learning Representations(2022)

work page 2022
[31]

Zinn-Justin,Quantum field theory and critical phenomena, Vol

J. Zinn-Justin,Quantum field theory and critical phenomena, Vol. 171 (Oxford university press, 2021)

work page 2021
[32]

Rahimi and B

A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in neural information processing systems20(2007)

work page 2007
[33]

Potters and J.-P

M. Potters and J.-P. Bouchaud,A first course in random matrix theory: for physicists, engineers and data scientists (Cambridge University Press, 2020)

work page 2020
[34]

Hastie, A

T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, The Annals of Statistics50, 949 (2022)

work page 2022
[35]

Atanasov, J

A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Risk and cross validation in ridge regression with correlated samples, arXiv preprint arXiv:2408.04607 (2024)

work page arXiv 2024
[36]

J. A. Zavatone-Veth and C. Pehlevan, Learning curves for deep structured Gaussian feature models, inAdvances in Neural Information Processing Systems(2023)

work page 2023
[37]

Bach, High-dimensional analysis of double descent for linear regression with random projections, SIAM Journal on Mathematics of Data Science6, 26 (2024)

F. Bach, High-dimensional analysis of double descent for linear regression with random projections, SIAM Journal on Mathematics of Data Science6, 26 (2024)

work page 2024
[38]

Knowles and J

A. Knowles and J. Yin, Anisotropic local laws for random matrices, Probability Theory and Related Fields169, 257 (2017)

work page 2017
[39]

Louart, Z

C. Louart, Z. Liao, and R. Couillet, A random matrix approach to neural networks, The Annals of Applied Probability 28, 1190 (2018)

work page 2018
[40]

Schr¨ oder, H

D. Schr¨ oder, H. Cui, D. Dmitriev, and B. Loureiro, Deterministic equivalent and error universality of deep random features learning, Journal of Statistical Mechanics: Theory and Experiment2024, 104017 (2024)

work page 2024
[41]

Defilippis, B

L. Defilippis, B. Loureiro, and T. Misiakiewicz, Dimension-free deterministic equivalents for random feature regression, arXiv preprint arXiv:2405.15699 (2024)

work page arXiv 2024
[42]

Misiakiewicz and B

T. Misiakiewicz and B. Saeed, A non-asymptotic theory of kernel ridge regression: deterministic equivalents, test error, and GCV estimator, arXiv preprint arXiv:2403.08938 (2024)

work page arXiv 2024
[43]

Canatar, B

A. Canatar, B. Bordelon, and C. Pehlevan, Spectral bias and task-model alignment explain generalization in kernel regres- sion and infinitely wide neural networks, Nature communications12, 2914 (2021)

work page 2021
[44]

Paquette, K

C. Paquette, K. Lee, F. Pedregosa, and E. Paquette, Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality, inProceedings of Thirty Fourth Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 134, edited by M. Belkin and S. Kpotufe (PMLR, 2021) pp. 3548–3626

work page 2021
[45]

Paquette, E

C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, inAdvances in Neural Information Processing Systems, Vol. 35, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 35984–35999

work page 2022
[46]

Paquette, E

C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, Advances in Neural Information Processing Systems35, 35984 (2022)

work page 2022
[47]

Strominger,Lectures on the infrared structure of gravity and gauge theory(Princeton University Press, 2018)

A. Strominger,Lectures on the infrared structure of gravity and gauge theory(Princeton University Press, 2018)

work page 2018
[48]

Loureiro, C

B. Loureiro, C. Gerbelot, H. Cui, S. Goldt, F. Krzakala, M. Mezard, and L. Zdeborov´ a, Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems34, 18137 (2021)

work page 2021
[49]

Patil, J.-H

P. Patil, J.-H. Du, and R. J. Tibshirani, Optimal ridge regularization for out-of-distribution prediction, arXiv (2024), arXiv:2404.01233 [math.ST]

work page arXiv 2024
[50]

Canatar, B

A. Canatar, B. Bordelon, and C. Pehlevan, Out-of-distribution generalization in kernel regression, inAdvances in Neural Information Processing Systems, Vol. 34, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 12600–12612

work page 2021
[51]

Adlam and J

B. Adlam and J. Pennington, The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization, inInternational Conference on Machine Learning(PMLR, 2020) pp. 74–84. 18 Appendix A: All Two-Point Deterministic Equivalences In this section, we report all variants of the two-point deterministic equivalences. These extend ...

work page 2020
[52]

In the case where M=I, the last two deterministic equivalences are equal

Sanity Check of Two Point Functions We now takeA=Σ,B=WforWa white Wishart and consider the matrix ˆΣ=Σ∗W. In the case where M=I, the last two deterministic equivalences are equal. Further, by takingλ=λ ′ so thatκ=κ ′ we get: ˆΣ( ˆΣ+λ) −2 ≃SΣ(Σ+κ) −2 −SκΣ(Σ+κ) −2 qtr[Σ(Σ+κ) −2] 1−γ =Σ(Σ+κ) −2S 1− −qκdf ′ 1 1−γ = 1 1−γ Σ(Σ+κ) −2S[1−γ+q(df 1 −df 2)] = 1 1−γ ...

work page
[53]

Then, Fourier transforming in each variable separately and restricting to thet=t ′ diagonal yields: ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ( ˆΣ+iω) −1( ˆΣ+iω ′)−1 ¯w| {z } ˆF(ω,ω ′)

Linear Regression Forcing Term We can write the empirical loss under gradient flow in terms of two time variables as: ˆF(t, t′)≡∆w ⊤ t ˆΣ∆wt′ = ¯w⊤e−t ˆΣ ˆΣe−t′ ˆΣ ¯w. Then, Fourier transforming in each variable separately and restricting to thet=t ′ diagonal yields: ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ( ˆΣ+iω) −1( ˆΣ+iω ′)−1 ¯w| {z } ˆF(ω,ω ′) . We now appl...

work page
[54]

Namely, we consider extendingK, ˆKto: K(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣΣ], ˆK(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣ ˆΣ]

SGD in Linear Regression Here, to match the gradient flow term, we apply two Fourier transforms to the kernel and apply the two-point deterministic equivalence. Namely, we consider extendingK, ˆKto: K(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣΣ], ˆK(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω...

work page
[55]

D P tr[Σ(Σ+iω 1)−1(Σ+iω ′ 1)−1]2 1−γ

work page
[56]

We now apply the deterministic equivalence (A.4) to obtain: ˆF(ω, ω ′)≃ 1 1−γ 1 ¯w⊤Σ(ΣF F ⊤ +iω 1)−1(ΣF F ⊤ +iω ′ 1)−1 ¯w

Linear Random Features Forcing Term The empirical loss under gradient flow in bi-frequency space is given by ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ(F F ⊤ ˆΣ+iω) −1(F F ⊤ ˆΣ+iω ′)−1 ¯w. We now apply the deterministic equivalence (A.4) to obtain: ˆF(ω, ω ′)≃ 1 1−γ 1 ¯w⊤Σ(ΣF F ⊤ +iω 1)−1(ΣF F ⊤ +iω ′ 1)−1 ¯w. Finally, we apply the deterministic equivalence (A.1) ...

work page
[57]

SGD in Linear Random Features We now compute deterministic equivalents for the kernel term, again defining the bi-temporal kernels by: K(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤Σ], ˆK(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω ′)K(ω, ω′), ˆK(t) =D Z ...

work page
[58]

Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′

We recognize again that the numerator and denominator depend only onγ 1 ≡ D P df2 F F ⊤Σ, We have already computed equivalents for this in the prior section. Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′

work page
[59]

Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated

D P tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1] 1−γ 1(iω1, iω′ 1) . Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated. It remains to apply a final deterministic equivalence (A.4) over the features to the last term to get: tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1]≃ tr[Σ(Σ+iω 2)−1(Σ+iω ′ 2)−1] 1− D N df2(iω2, iω′

work page
[60]

These formulas are obtained from a straightforward though slightly tedious application of the two-point deterministic equivalences derived in the text

Appendix C: Covariate Shift in Random Features In this section we write down the exact formula for the precise asymptotics of random feature regression when tested under covariate shift. These formulas are obtained from a straightforward though slightly tedious application of the two-point deterministic equivalences derived in the text

work page
[61]

Under gradient flow this can be written as: F(iω, iω ′) = ¯w⊤(iω+ ˆΣF F ⊤)−1Σ′F F ⊤(iω′ + ˆΣF F ⊤)−1 (F F ⊤)−1 ¯w

Gradient Flow Term We are interested in evaluating the test error out-of-distribution. Under gradient flow this can be written as: F(iω, iω ′) = ¯w⊤(iω+ ˆΣF F ⊤)−1Σ′F F ⊤(iω′ + ˆΣF F ⊤)−1 (F F ⊤)−1 ¯w. a. Data Average We apply (A.1) withM=A=ΣF F ⊤ to obtain. (iω+ ˆΣF F ⊤)−1ΣF F ⊤(iω′ + ˆΣF F ⊤)−1 ≃S W S′ W (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1 +S W SW γ...

work page
[62]

tr[(iω2 +Σ) −1Σ′(iω′ 2 +Σ) −1] tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]. 22 The remaining part of the forcing term that has not yet been evaluated can gain be rewritten via the push-through identity as: (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1(F F ⊤)−1 = (iω1 +ΣF F ⊤)−1Σ′(iω′ 1 +F F ⊤Σ)−1. We now apply (IV.2) withM=I,A=Σ,B=F F ⊤. After some rewriting, we obtain: FOOD...

work page
[63]

The train kernel is (by definition) unaffected by distribution shift

SGD Kernel Term By applying a one-point deterministic equivalence, the OOD test kernel is: KOOD(ω, ω′)≃Tr[Σ ′Σ(Σ+iω 2)−1] + iω2 N Tr[Σ′(Σ+iω 2)−1] Tr[Σ(Σ+iω 2)−1]. The train kernel is (by definition) unaffected by distribution shift

work page

[1] [1]

explicitly vanishes in the high-dimensionalD→ ∞limit whenζ >0

work page

[2] [2]

imaginary ridges

contributes negligibly after sufficient timetforζ= 0. In the first case, maintaining stable dynamics withη= Θ(1) requires choosing a batch size large enough such that χ= Θ(D −ζ) (see Section III D). With this choice, the two terms generated by SGD effects scale as ηχF F ⊤ ˆΣCt ˆΣF F ⊤ = Θ(D−ζ), ηχF F ⊤ ˆΣF F ⊤ Tr[Ct ˆΣ] = Θ(1), justifying neglecting the l...

work page

[3] [3]

Thus, even in the case of linear regression at finitet, one requires two-point resolvents

When the dataset is finite,A= ˆΣand the population covarianceΣdoes not generally commute withA, regardless of whetherBis included. Thus, even in the case of linear regression at finitet, one requires two-point resolvents

work page

[4] [4]

This is why the general finiteN, P, texpressions of Bordelon et

A random feature model at finite model sizeNand finite dataset sizePwill not haveA=F F ⊤ andB= ˆΣ commute. This is why the general finiteN, P, texpressions of Bordelon et. al [16] required computing two point resolvents that are functions of two frequenciesω, ω ′. At finiteNbut infiniteP,A=M=Σand one-point resolvents are in fact sufficient as in the work ...

work page

[5] [5]

two-point

For covariate shift settings, the matrixM=Σ ′ does not generally commute withAandB. Even in the linear regression setting, whereB=I, one still requires a two-point deterministic equivalent in this case [26, 40]. In cases 1. and 2. thet→ ∞limits of the forcing term recover ridgeless regression, whose precise asymptotics does not require two-point equivalen...

work page

[6] [6]

Here, because ˆΣ=Σ∗WforWa white Wishart, the renormalization of each frequency is given by the multiplication of theS-transform of a white Wishart

¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w. Here, because ˆΣ=Σ∗WforWa white Wishart, the renormalization of each frequency is given by the multiplication of theS-transform of a white Wishart. This is found in the standard literature, see e.g. [24]. Our notation convention in this section is thus: df1 ≡df 1 ˆΣ(ω)≃df 1 Σ(ω1),df ′ 1 ≡df 1 ˆΣ(ω′)≃df 1 Σ(ω′ 1) S≡ 1 1− D ...

work page

[7] [7]

In Appendix B 2, we compute the SGD terms in terms of double frequency Fourier transforms and two-point equivalents

¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w + Z ω e2iωt χTr[Σ2(Σ+iω 1)−1] 1−χTr[Σ 2(Σ+iω 1)−1]−χiω 1(Ddf1)2/P ¯w⊤Σ(Σ+iω 1)−1 ¯w, respectively. In Appendix B 2, we compute the SGD terms in terms of double frequency Fourier transforms and two-point equivalents. C. Covariate Shift In this section, we show how the two-point deterministic equivalences allow for direct cal...

work page

[8] [8]

Integrating Over Data Pushing throughΣ 1/2 on both sides, we apply the two-point equivalence (A.2) withM=IandA=Σ 1/2F F ⊤Σ1/2 to obtain: F(iω, iω ′)≃ SW S′ W 1−γ 1 ¯w⊤(iω1 +ΣF F ⊤)−1Σ(iω′ 1 +F F ⊤Σ)−1 ¯w for γ1 ≡ D P tr (ΣF F ⊤)2(iω1 +ΣF F ⊤)−1(iω′ 1 +ΣF F ⊤)−1

work page

[9] [9]

We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D

Integrating over Features We will evaluateγ 1 separately since it concentrates. We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D. Here,M=I,A=Σ,B=F F ⊤. γ1 ≃ D P df2 + D P D N (iω2)(iω′ 2)tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]2 1− D N df2 . We now apply (IV.2) withM=I,A=Σ, andB=F F ⊤ to get: F(iω, iω ′)≃ SS ′ 1−γ 1 " ¯w⊤(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1 ¯w + ¯...

work page 2024

[10] [10]

Deep Learning Scaling is Predictable, Empirically

J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[12] [12]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,et al., Training compute-optimal large language models, arXiv preprint arXiv:2203.15556 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Ghorbani, O

B. Ghorbani, O. Firat, M. Freitag, A. Bapna, M. Krikun, X. Garcia, C. Chelba, and C. Cherry, Scaling laws for neural machine translation, arXiv preprint arXiv:2109.07740 (2021)

work page arXiv 2021

[14] [14]

Scaling Laws for Transfer

D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish, Scaling laws for transfer, arXiv preprint arXiv:2102.01293 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Scaling Laws and Interpretability of Learning from Repeated Data

D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume,et al., Scaling laws and interpretability of learning from repeated data, arXiv preprint arXiv:2205.10487 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

M. A. Gordon, K. Duh, and J. Kaplan, Data and parameter scaling laws for neural machine translation, inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021) pp. 5915–5922

work page 2021

[17] [17]

Muennighoff, A

N. Muennighoff, A. Rush, B. Barak, T. Le Scao, N. Tazi, A. Piktus, S. Pyysalo, T. Wolf, and C. A. Raffel, Scaling data-constrained language models, Advances in Neural Information Processing Systems36(2024)

work page 2024

[18] [18]

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, Scaling vision transformers, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition(2022) pp. 12104–12113

work page 2022

[19] [19]

I. M. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer, Getting ViT in shape: Scaling laws for compute-optimal model design, Advances in Neural Information Processing Systems36(2024)

work page 2024

[20] [20]

Bachmann, S

G. Bachmann, S. Anagnostidis, and T. Hofmann, Scaling MLPs: A tale of inductive bias, Advances in Neural Information Processing Systems36(2024)

work page 2024

[21] [21]

Everett, L

K. Everett, L. Xiao, M. Wortsman, A. A. Alemi, R. Novak, P. J. Liu, I. Gur, J. Sohl-Dickstein, L. P. Kaelbling, J. Lee, et al., Scaling exponents across parameterizations and optimizers, arXiv preprint arXiv:2407.05872 (2024)

work page arXiv 2024

[22] [22]

Lin and L

J. Lin and L. Rosasco, Optimal rates for multi-pass stochastic gradient methods, Journal of Machine Learning Research 18, 1 (2017)

work page 2017

[23] [23]

Pillaud-Vivien, A

L. Pillaud-Vivien, A. Rudi, and F. Bach, Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes, Advances in Neural Information Processing Systems31(2018)

work page 2018

[24] [24]

Ying and M

Y. Ying and M. Pontil, Online gradient descent learning algorithms, Foundations of Computational Mathematics8, 561 (2008)

work page 2008

[25] [25]

Bordelon, A

B. Bordelon, A. Atanasov, and C. Pehlevan, A dynamical model of neural scaling laws, inForty-first International Con- ference on Machine Learning(2024)

work page 2024

[26] [26]

Paquette, C

E. Paquette, C. Paquette, L. Xiao, and J. Pennington, 4 + 3 phases of compute-optimal neural scaling laws, arXiv preprint arXiv:2405.15074 (2024)

work page arXiv 2024

[27] [27]

L. Lin, J. Wu, S. M. Kakade, P. L. Bartlett, and J. D. Lee, Scaling laws in linear regression: Compute, parameters, and data, arXiv preprint arXiv:2406.08466 (2024)

work page arXiv 2024

[28] [28]

Carratino, A

L. Carratino, A. Rudi, and L. Rosasco, Learning with sgd and random features, Advances in neural information processing systems31(2018). 17

work page 2018

[29] [29]

Scaling and renormalization in high-dimensional regression

A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Scaling and renormalization in high-dimensional regression, arXiv preprint arXiv:2405.00592 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Bordelon and C

B. Bordelon and C. Pehlevan, Learning curves for SGD on structured features, inInternational Conference on Learning Representations(2022)

work page 2022

[31] [31]

Zinn-Justin,Quantum field theory and critical phenomena, Vol

J. Zinn-Justin,Quantum field theory and critical phenomena, Vol. 171 (Oxford university press, 2021)

work page 2021

[32] [32]

Rahimi and B

A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in neural information processing systems20(2007)

work page 2007

[33] [33]

Potters and J.-P

M. Potters and J.-P. Bouchaud,A first course in random matrix theory: for physicists, engineers and data scientists (Cambridge University Press, 2020)

work page 2020

[34] [34]

Hastie, A

T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, The Annals of Statistics50, 949 (2022)

work page 2022

[35] [35]

Atanasov, J

A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Risk and cross validation in ridge regression with correlated samples, arXiv preprint arXiv:2408.04607 (2024)

work page arXiv 2024

[36] [36]

J. A. Zavatone-Veth and C. Pehlevan, Learning curves for deep structured Gaussian feature models, inAdvances in Neural Information Processing Systems(2023)

work page 2023

[37] [37]

Bach, High-dimensional analysis of double descent for linear regression with random projections, SIAM Journal on Mathematics of Data Science6, 26 (2024)

F. Bach, High-dimensional analysis of double descent for linear regression with random projections, SIAM Journal on Mathematics of Data Science6, 26 (2024)

work page 2024

[38] [38]

Knowles and J

A. Knowles and J. Yin, Anisotropic local laws for random matrices, Probability Theory and Related Fields169, 257 (2017)

work page 2017

[39] [39]

Louart, Z

C. Louart, Z. Liao, and R. Couillet, A random matrix approach to neural networks, The Annals of Applied Probability 28, 1190 (2018)

work page 2018

[40] [40]

Schr¨ oder, H

D. Schr¨ oder, H. Cui, D. Dmitriev, and B. Loureiro, Deterministic equivalent and error universality of deep random features learning, Journal of Statistical Mechanics: Theory and Experiment2024, 104017 (2024)

work page 2024

[41] [41]

Defilippis, B

L. Defilippis, B. Loureiro, and T. Misiakiewicz, Dimension-free deterministic equivalents for random feature regression, arXiv preprint arXiv:2405.15699 (2024)

work page arXiv 2024

[42] [42]

Misiakiewicz and B

T. Misiakiewicz and B. Saeed, A non-asymptotic theory of kernel ridge regression: deterministic equivalents, test error, and GCV estimator, arXiv preprint arXiv:2403.08938 (2024)

work page arXiv 2024

[43] [43]

Canatar, B

A. Canatar, B. Bordelon, and C. Pehlevan, Spectral bias and task-model alignment explain generalization in kernel regres- sion and infinitely wide neural networks, Nature communications12, 2914 (2021)

work page 2021

[44] [44]

Paquette, K

C. Paquette, K. Lee, F. Pedregosa, and E. Paquette, Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality, inProceedings of Thirty Fourth Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 134, edited by M. Belkin and S. Kpotufe (PMLR, 2021) pp. 3548–3626

work page 2021

[45] [45]

Paquette, E

C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, inAdvances in Neural Information Processing Systems, Vol. 35, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 35984–35999

work page 2022

[46] [46]

Paquette, E

C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, Advances in Neural Information Processing Systems35, 35984 (2022)

work page 2022

[47] [47]

Strominger,Lectures on the infrared structure of gravity and gauge theory(Princeton University Press, 2018)

A. Strominger,Lectures on the infrared structure of gravity and gauge theory(Princeton University Press, 2018)

work page 2018

[48] [48]

Loureiro, C

B. Loureiro, C. Gerbelot, H. Cui, S. Goldt, F. Krzakala, M. Mezard, and L. Zdeborov´ a, Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems34, 18137 (2021)

work page 2021

[49] [49]

Patil, J.-H

P. Patil, J.-H. Du, and R. J. Tibshirani, Optimal ridge regularization for out-of-distribution prediction, arXiv (2024), arXiv:2404.01233 [math.ST]

work page arXiv 2024

[50] [50]

Canatar, B

A. Canatar, B. Bordelon, and C. Pehlevan, Out-of-distribution generalization in kernel regression, inAdvances in Neural Information Processing Systems, Vol. 34, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 12600–12612

work page 2021

[51] [51]

Adlam and J

B. Adlam and J. Pennington, The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization, inInternational Conference on Machine Learning(PMLR, 2020) pp. 74–84. 18 Appendix A: All Two-Point Deterministic Equivalences In this section, we report all variants of the two-point deterministic equivalences. These extend ...

work page 2020

[52] [52]

In the case where M=I, the last two deterministic equivalences are equal

Sanity Check of Two Point Functions We now takeA=Σ,B=WforWa white Wishart and consider the matrix ˆΣ=Σ∗W. In the case where M=I, the last two deterministic equivalences are equal. Further, by takingλ=λ ′ so thatκ=κ ′ we get: ˆΣ( ˆΣ+λ) −2 ≃SΣ(Σ+κ) −2 −SκΣ(Σ+κ) −2 qtr[Σ(Σ+κ) −2] 1−γ =Σ(Σ+κ) −2S 1− −qκdf ′ 1 1−γ = 1 1−γ Σ(Σ+κ) −2S[1−γ+q(df 1 −df 2)] = 1 1−γ ...

work page

[53] [53]

Then, Fourier transforming in each variable separately and restricting to thet=t ′ diagonal yields: ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ( ˆΣ+iω) −1( ˆΣ+iω ′)−1 ¯w| {z } ˆF(ω,ω ′)

Linear Regression Forcing Term We can write the empirical loss under gradient flow in terms of two time variables as: ˆF(t, t′)≡∆w ⊤ t ˆΣ∆wt′ = ¯w⊤e−t ˆΣ ˆΣe−t′ ˆΣ ¯w. Then, Fourier transforming in each variable separately and restricting to thet=t ′ diagonal yields: ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ( ˆΣ+iω) −1( ˆΣ+iω ′)−1 ¯w| {z } ˆF(ω,ω ′) . We now appl...

work page

[54] [54]

Namely, we consider extendingK, ˆKto: K(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣΣ], ˆK(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣ ˆΣ]

SGD in Linear Regression Here, to match the gradient flow term, we apply two Fourier transforms to the kernel and apply the two-point deterministic equivalence. Namely, we consider extendingK, ˆKto: K(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣΣ], ˆK(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω...

work page

[55] [55]

D P tr[Σ(Σ+iω 1)−1(Σ+iω ′ 1)−1]2 1−γ

work page

[56] [56]

We now apply the deterministic equivalence (A.4) to obtain: ˆF(ω, ω ′)≃ 1 1−γ 1 ¯w⊤Σ(ΣF F ⊤ +iω 1)−1(ΣF F ⊤ +iω ′ 1)−1 ¯w

Linear Random Features Forcing Term The empirical loss under gradient flow in bi-frequency space is given by ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ(F F ⊤ ˆΣ+iω) −1(F F ⊤ ˆΣ+iω ′)−1 ¯w. We now apply the deterministic equivalence (A.4) to obtain: ˆF(ω, ω ′)≃ 1 1−γ 1 ¯w⊤Σ(ΣF F ⊤ +iω 1)−1(ΣF F ⊤ +iω ′ 1)−1 ¯w. Finally, we apply the deterministic equivalence (A.1) ...

work page

[57] [57]

SGD in Linear Random Features We now compute deterministic equivalents for the kernel term, again defining the bi-temporal kernels by: K(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤Σ], ˆK(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω ′)K(ω, ω′), ˆK(t) =D Z ...

work page

[58] [58]

Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′

We recognize again that the numerator and denominator depend only onγ 1 ≡ D P df2 F F ⊤Σ, We have already computed equivalents for this in the prior section. Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′

work page

[59] [59]

Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated

D P tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1] 1−γ 1(iω1, iω′ 1) . Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated. It remains to apply a final deterministic equivalence (A.4) over the features to the last term to get: tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1]≃ tr[Σ(Σ+iω 2)−1(Σ+iω ′ 2)−1] 1− D N df2(iω2, iω′

work page

[60] [60]

These formulas are obtained from a straightforward though slightly tedious application of the two-point deterministic equivalences derived in the text

Appendix C: Covariate Shift in Random Features In this section we write down the exact formula for the precise asymptotics of random feature regression when tested under covariate shift. These formulas are obtained from a straightforward though slightly tedious application of the two-point deterministic equivalences derived in the text

work page

[61] [61]

Under gradient flow this can be written as: F(iω, iω ′) = ¯w⊤(iω+ ˆΣF F ⊤)−1Σ′F F ⊤(iω′ + ˆΣF F ⊤)−1 (F F ⊤)−1 ¯w

Gradient Flow Term We are interested in evaluating the test error out-of-distribution. Under gradient flow this can be written as: F(iω, iω ′) = ¯w⊤(iω+ ˆΣF F ⊤)−1Σ′F F ⊤(iω′ + ˆΣF F ⊤)−1 (F F ⊤)−1 ¯w. a. Data Average We apply (A.1) withM=A=ΣF F ⊤ to obtain. (iω+ ˆΣF F ⊤)−1ΣF F ⊤(iω′ + ˆΣF F ⊤)−1 ≃S W S′ W (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1 +S W SW γ...

work page

[62] [62]

tr[(iω2 +Σ) −1Σ′(iω′ 2 +Σ) −1] tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]. 22 The remaining part of the forcing term that has not yet been evaluated can gain be rewritten via the push-through identity as: (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1(F F ⊤)−1 = (iω1 +ΣF F ⊤)−1Σ′(iω′ 1 +F F ⊤Σ)−1. We now apply (IV.2) withM=I,A=Σ,B=F F ⊤. After some rewriting, we obtain: FOOD...

work page

[63] [63]

The train kernel is (by definition) unaffected by distribution shift

SGD Kernel Term By applying a one-point deterministic equivalence, the OOD test kernel is: KOOD(ω, ω′)≃Tr[Σ ′Σ(Σ+iω 2)−1] + iω2 N Tr[Σ′(Σ+iω 2)−1] Tr[Σ(Σ+iω 2)−1]. The train kernel is (by definition) unaffected by distribution shift

work page