pith. sign in

arxiv: 2502.05074 · v3 · submitted 2025-02-07 · ❄️ cond-mat.dis-nn · cs.LG· stat.ML

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Pith reviewed 2026-05-23 04:12 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cs.LGstat.ML
keywords deterministic equivalencestochastic gradient descenthigh-dimensional asymptoticsrandom matrix theorylinear regressionkernel regressionrandom features
0
0 comments X

The pith

A novel deterministic equivalence for the two-point resolvent function unifies the asymptotic analysis of stochastic gradient descent across high-dimensional linear models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a new identity that replaces the stochastic two-point function of a random matrix resolvent with a deterministic expression that can be computed from macroscopic parameters alone. This identity is applied to obtain exact high-dimensional asymptotics for the training and test performance of linear regression, kernel regression, and linear random feature models trained by stochastic gradient descent. The derivations recover previously known results for these models as special cases while also producing new predictions for their behavior under different noise levels and learning rates. A sympathetic reader would care because the same object now governs the dynamics of several distinct learning settings that previously required separate analyses.

Core claim

The central claim is that the two-point function of the resolvent of a random matrix admits a deterministic equivalent in the high-dimensional limit, and that this equivalent directly yields the performance of stochastic gradient descent for a broad class of linear models including regression, kernel methods, and random features, encompassing both known and previously unknown asymptotic expressions.

What carries the argument

The two-point deterministic equivalence for the random matrix resolvent function, which converts the stochastic two-point correlations arising in SGD dynamics into a closed deterministic form.

If this is right

  • A single derivation now covers the asymptotic performance of SGD for high-dimensional linear regression.
  • The same derivation covers kernel regression and linear random feature models.
  • Both previously known asymptotics and new ones for varying noise and step-size regimes follow from the equivalence.
  • The performance metrics are expressed in terms of the deterministic equivalent without needing to average over individual matrix realizations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The equivalence may allow similar closed-form predictions for other first-order optimizers whose updates can be written in terms of the same resolvent structure.
  • If the two-point object can be generalized to non-Gaussian or structured data, the framework could extend beyond the linear and kernel settings treated here.
  • The deterministic reduction suggests that the leading-order behavior of these models is controlled by the spectrum of the data covariance rather than higher-order statistics of the noise.

Load-bearing premise

The system must be in the high-dimensional asymptotic regime where the two-point resolvent function admits a deterministic equivalent that does not depend on the specific realization of the random matrix.

What would settle it

A direct numerical computation, for large but finite dimension, of the two-point resolvent function under the paper's random matrix model that shows a persistent deviation from the derived deterministic expression as dimension grows.

read the original abstract

We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and linear random feature models. Our results include previously known asymptotics as well as novel ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript derives a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this equivalence, it provides a unified derivation of the asymptotic performance of high-dimensional linear models trained with stochastic gradient descent, covering linear regression, kernel regression, and linear random feature models. The results recover previously known asymptotics and introduce new ones in the high-dimensional regime.

Significance. If the central derivation holds, the work offers a significant technical contribution by supplying a parameter-free two-point resolvent equivalence that unifies the analysis of SGD dynamics across multiple linear models via random matrix theory. This framework could streamline future analyses in high-dimensional statistics and machine learning. The absence of free parameters and the derivation from first principles (as indicated by the axiom ledger) are explicit strengths.

minor comments (2)
  1. [Introduction] The abstract is clear but the introduction would benefit from an explicit statement of the precise high-dimensional scaling assumptions (e.g., the relation between dimension, sample size, and step size) before the main theorem.
  2. [Section 2] Notation for the two-point resolvent function should be introduced with a short display equation in §2 to aid readability for readers unfamiliar with the specific RMT objects.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of its contributions, and recommendation to accept. We appreciate the recognition of the parameter-free two-point resolvent equivalence and its unifying role across linear models trained by SGD.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's central claim is a novel deterministic equivalence for the two-point resolvent function derived via random matrix theory techniques, followed by its application to obtain asymptotics for SGD in linear models. No load-bearing step reduces by construction to a fit, a self-citation chain, or a renamed input; the derivation is presented as independent of the target performance quantities and relies on standard high-dimensional asymptotic assumptions rather than internal redefinitions. The abstract and structure indicate an external mathematical result used to unify known and new expressions, with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The work appears to rest on standard high-dimensional random matrix assumptions that are not enumerated here.

pith-pipeline@v0.9.0 · 5603 in / 1122 out tokens · 66547 ms · 2026-05-23T04:12:35.910351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 6 internal anchors

  1. [1]

    explicitly vanishes in the high-dimensionalD→ ∞limit whenζ >0

  2. [2]

    imaginary ridges

    contributes negligibly after sufficient timetforζ= 0. In the first case, maintaining stable dynamics withη= Θ(1) requires choosing a batch size large enough such that χ= Θ(D −ζ) (see Section III D). With this choice, the two terms generated by SGD effects scale as ηχF F ⊤ ˆΣCt ˆΣF F ⊤ = Θ(D−ζ), ηχF F ⊤ ˆΣF F ⊤ Tr[Ct ˆΣ] = Θ(1), justifying neglecting the l...

  3. [3]

    Thus, even in the case of linear regression at finitet, one requires two-point resolvents

    When the dataset is finite,A= ˆΣand the population covarianceΣdoes not generally commute withA, regardless of whetherBis included. Thus, even in the case of linear regression at finitet, one requires two-point resolvents

  4. [4]

    This is why the general finiteN, P, texpressions of Bordelon et

    A random feature model at finite model sizeNand finite dataset sizePwill not haveA=F F ⊤ andB= ˆΣ commute. This is why the general finiteN, P, texpressions of Bordelon et. al [16] required computing two point resolvents that are functions of two frequenciesω, ω ′. At finiteNbut infiniteP,A=M=Σand one-point resolvents are in fact sufficient as in the work ...

  5. [5]

    two-point

    For covariate shift settings, the matrixM=Σ ′ does not generally commute withAandB. Even in the linear regression setting, whereB=I, one still requires a two-point deterministic equivalent in this case [26, 40]. In cases 1. and 2. thet→ ∞limits of the forcing term recover ridgeless regression, whose precise asymptotics does not require two-point equivalen...

  6. [6]

    Here, because ˆΣ=Σ∗WforWa white Wishart, the renormalization of each frequency is given by the multiplication of theS-transform of a white Wishart

    ¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w. Here, because ˆΣ=Σ∗WforWa white Wishart, the renormalization of each frequency is given by the multiplication of theS-transform of a white Wishart. This is found in the standard literature, see e.g. [24]. Our notation convention in this section is thus: df1 ≡df 1 ˆΣ(ω)≃df 1 Σ(ω1),df ′ 1 ≡df 1 ˆΣ(ω′)≃df 1 Σ(ω′ 1) S≡ 1 1− D ...

  7. [7]

    In Appendix B 2, we compute the SGD terms in terms of double frequency Fourier transforms and two-point equivalents

    ¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w + Z ω e2iωt χTr[Σ2(Σ+iω 1)−1] 1−χTr[Σ 2(Σ+iω 1)−1]−χiω 1(Ddf1)2/P ¯w⊤Σ(Σ+iω 1)−1 ¯w, respectively. In Appendix B 2, we compute the SGD terms in terms of double frequency Fourier transforms and two-point equivalents. C. Covariate Shift In this section, we show how the two-point deterministic equivalences allow for direct cal...

  8. [8]

    Integrating Over Data Pushing throughΣ 1/2 on both sides, we apply the two-point equivalence (A.2) withM=IandA=Σ 1/2F F ⊤Σ1/2 to obtain: F(iω, iω ′)≃ SW S′ W 1−γ 1 ¯w⊤(iω1 +ΣF F ⊤)−1Σ(iω′ 1 +F F ⊤Σ)−1 ¯w for γ1 ≡ D P tr (ΣF F ⊤)2(iω1 +ΣF F ⊤)−1(iω′ 1 +ΣF F ⊤)−1

  9. [9]

    We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D

    Integrating over Features We will evaluateγ 1 separately since it concentrates. We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D. Here,M=I,A=Σ,B=F F ⊤. γ1 ≃ D P df2 + D P D N (iω2)(iω′ 2)tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]2 1− D N df2 . We now apply (IV.2) withM=I,A=Σ, andB=F F ⊤ to get: F(iω, iω ′)≃ SS ′ 1−γ 1 " ¯w⊤(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1 ¯w + ¯...

  10. [10]

    Deep Learning Scaling is Predictable, Empirically

    J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409 (2017)

  11. [11]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

  12. [12]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,et al., Training compute-optimal large language models, arXiv preprint arXiv:2203.15556 (2022)

  13. [13]

    Ghorbani, O

    B. Ghorbani, O. Firat, M. Freitag, A. Bapna, M. Krikun, X. Garcia, C. Chelba, and C. Cherry, Scaling laws for neural machine translation, arXiv preprint arXiv:2109.07740 (2021)

  14. [14]

    Scaling Laws for Transfer

    D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish, Scaling laws for transfer, arXiv preprint arXiv:2102.01293 (2021)

  15. [15]

    Scaling Laws and Interpretability of Learning from Repeated Data

    D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume,et al., Scaling laws and interpretability of learning from repeated data, arXiv preprint arXiv:2205.10487 (2022)

  16. [16]

    M. A. Gordon, K. Duh, and J. Kaplan, Data and parameter scaling laws for neural machine translation, inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021) pp. 5915–5922

  17. [17]

    Muennighoff, A

    N. Muennighoff, A. Rush, B. Barak, T. Le Scao, N. Tazi, A. Piktus, S. Pyysalo, T. Wolf, and C. A. Raffel, Scaling data-constrained language models, Advances in Neural Information Processing Systems36(2024)

  18. [18]

    X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, Scaling vision transformers, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition(2022) pp. 12104–12113

  19. [19]

    I. M. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer, Getting ViT in shape: Scaling laws for compute-optimal model design, Advances in Neural Information Processing Systems36(2024)

  20. [20]

    Bachmann, S

    G. Bachmann, S. Anagnostidis, and T. Hofmann, Scaling MLPs: A tale of inductive bias, Advances in Neural Information Processing Systems36(2024)

  21. [21]

    Everett, L

    K. Everett, L. Xiao, M. Wortsman, A. A. Alemi, R. Novak, P. J. Liu, I. Gur, J. Sohl-Dickstein, L. P. Kaelbling, J. Lee, et al., Scaling exponents across parameterizations and optimizers, arXiv preprint arXiv:2407.05872 (2024)

  22. [22]

    Lin and L

    J. Lin and L. Rosasco, Optimal rates for multi-pass stochastic gradient methods, Journal of Machine Learning Research 18, 1 (2017)

  23. [23]

    Pillaud-Vivien, A

    L. Pillaud-Vivien, A. Rudi, and F. Bach, Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes, Advances in Neural Information Processing Systems31(2018)

  24. [24]

    Ying and M

    Y. Ying and M. Pontil, Online gradient descent learning algorithms, Foundations of Computational Mathematics8, 561 (2008)

  25. [25]

    Bordelon, A

    B. Bordelon, A. Atanasov, and C. Pehlevan, A dynamical model of neural scaling laws, inForty-first International Con- ference on Machine Learning(2024)

  26. [26]

    Paquette, C

    E. Paquette, C. Paquette, L. Xiao, and J. Pennington, 4 + 3 phases of compute-optimal neural scaling laws, arXiv preprint arXiv:2405.15074 (2024)

  27. [27]

    L. Lin, J. Wu, S. M. Kakade, P. L. Bartlett, and J. D. Lee, Scaling laws in linear regression: Compute, parameters, and data, arXiv preprint arXiv:2406.08466 (2024)

  28. [28]

    Carratino, A

    L. Carratino, A. Rudi, and L. Rosasco, Learning with sgd and random features, Advances in neural information processing systems31(2018). 17

  29. [29]

    Scaling and renormalization in high-dimensional regression

    A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Scaling and renormalization in high-dimensional regression, arXiv preprint arXiv:2405.00592 (2024)

  30. [30]

    Bordelon and C

    B. Bordelon and C. Pehlevan, Learning curves for SGD on structured features, inInternational Conference on Learning Representations(2022)

  31. [31]

    Zinn-Justin,Quantum field theory and critical phenomena, Vol

    J. Zinn-Justin,Quantum field theory and critical phenomena, Vol. 171 (Oxford university press, 2021)

  32. [32]

    Rahimi and B

    A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in neural information processing systems20(2007)

  33. [33]

    Potters and J.-P

    M. Potters and J.-P. Bouchaud,A first course in random matrix theory: for physicists, engineers and data scientists (Cambridge University Press, 2020)

  34. [34]

    Hastie, A

    T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, The Annals of Statistics50, 949 (2022)

  35. [35]

    Atanasov, J

    A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Risk and cross validation in ridge regression with correlated samples, arXiv preprint arXiv:2408.04607 (2024)

  36. [36]

    J. A. Zavatone-Veth and C. Pehlevan, Learning curves for deep structured Gaussian feature models, inAdvances in Neural Information Processing Systems(2023)

  37. [37]

    Bach, High-dimensional analysis of double descent for linear regression with random projections, SIAM Journal on Mathematics of Data Science6, 26 (2024)

    F. Bach, High-dimensional analysis of double descent for linear regression with random projections, SIAM Journal on Mathematics of Data Science6, 26 (2024)

  38. [38]

    Knowles and J

    A. Knowles and J. Yin, Anisotropic local laws for random matrices, Probability Theory and Related Fields169, 257 (2017)

  39. [39]

    Louart, Z

    C. Louart, Z. Liao, and R. Couillet, A random matrix approach to neural networks, The Annals of Applied Probability 28, 1190 (2018)

  40. [40]

    Schr¨ oder, H

    D. Schr¨ oder, H. Cui, D. Dmitriev, and B. Loureiro, Deterministic equivalent and error universality of deep random features learning, Journal of Statistical Mechanics: Theory and Experiment2024, 104017 (2024)

  41. [41]

    Defilippis, B

    L. Defilippis, B. Loureiro, and T. Misiakiewicz, Dimension-free deterministic equivalents for random feature regression, arXiv preprint arXiv:2405.15699 (2024)

  42. [42]

    Misiakiewicz and B

    T. Misiakiewicz and B. Saeed, A non-asymptotic theory of kernel ridge regression: deterministic equivalents, test error, and GCV estimator, arXiv preprint arXiv:2403.08938 (2024)

  43. [43]

    Canatar, B

    A. Canatar, B. Bordelon, and C. Pehlevan, Spectral bias and task-model alignment explain generalization in kernel regres- sion and infinitely wide neural networks, Nature communications12, 2914 (2021)

  44. [44]

    Paquette, K

    C. Paquette, K. Lee, F. Pedregosa, and E. Paquette, Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality, inProceedings of Thirty Fourth Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 134, edited by M. Belkin and S. Kpotufe (PMLR, 2021) pp. 3548–3626

  45. [45]

    Paquette, E

    C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, inAdvances in Neural Information Processing Systems, Vol. 35, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 35984–35999

  46. [46]

    Paquette, E

    C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, Advances in Neural Information Processing Systems35, 35984 (2022)

  47. [47]

    Strominger,Lectures on the infrared structure of gravity and gauge theory(Princeton University Press, 2018)

    A. Strominger,Lectures on the infrared structure of gravity and gauge theory(Princeton University Press, 2018)

  48. [48]

    Loureiro, C

    B. Loureiro, C. Gerbelot, H. Cui, S. Goldt, F. Krzakala, M. Mezard, and L. Zdeborov´ a, Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems34, 18137 (2021)

  49. [49]

    Patil, J.-H

    P. Patil, J.-H. Du, and R. J. Tibshirani, Optimal ridge regularization for out-of-distribution prediction, arXiv (2024), arXiv:2404.01233 [math.ST]

  50. [50]

    Canatar, B

    A. Canatar, B. Bordelon, and C. Pehlevan, Out-of-distribution generalization in kernel regression, inAdvances in Neural Information Processing Systems, Vol. 34, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 12600–12612

  51. [51]

    Adlam and J

    B. Adlam and J. Pennington, The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization, inInternational Conference on Machine Learning(PMLR, 2020) pp. 74–84. 18 Appendix A: All Two-Point Deterministic Equivalences In this section, we report all variants of the two-point deterministic equivalences. These extend ...

  52. [52]

    In the case where M=I, the last two deterministic equivalences are equal

    Sanity Check of Two Point Functions We now takeA=Σ,B=WforWa white Wishart and consider the matrix ˆΣ=Σ∗W. In the case where M=I, the last two deterministic equivalences are equal. Further, by takingλ=λ ′ so thatκ=κ ′ we get: ˆΣ( ˆΣ+λ) −2 ≃SΣ(Σ+κ) −2 −SκΣ(Σ+κ) −2 qtr[Σ(Σ+κ) −2] 1−γ =Σ(Σ+κ) −2S 1− −qκdf ′ 1 1−γ = 1 1−γ Σ(Σ+κ) −2S[1−γ+q(df 1 −df 2)] = 1 1−γ ...

  53. [53]

    Then, Fourier transforming in each variable separately and restricting to thet=t ′ diagonal yields: ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ( ˆΣ+iω) −1( ˆΣ+iω ′)−1 ¯w| {z } ˆF(ω,ω ′)

    Linear Regression Forcing Term We can write the empirical loss under gradient flow in terms of two time variables as: ˆF(t, t′)≡∆w ⊤ t ˆΣ∆wt′ = ¯w⊤e−t ˆΣ ˆΣe−t′ ˆΣ ¯w. Then, Fourier transforming in each variable separately and restricting to thet=t ′ diagonal yields: ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ( ˆΣ+iω) −1( ˆΣ+iω ′)−1 ¯w| {z } ˆF(ω,ω ′) . We now appl...

  54. [54]

    Namely, we consider extendingK, ˆKto: K(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣΣ], ˆK(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣ ˆΣ]

    SGD in Linear Regression Here, to match the gradient flow term, we apply two Fourier transforms to the kernel and apply the two-point deterministic equivalence. Namely, we consider extendingK, ˆKto: K(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣΣ], ˆK(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω...

  55. [55]

    D P tr[Σ(Σ+iω 1)−1(Σ+iω ′ 1)−1]2 1−γ

  56. [56]

    We now apply the deterministic equivalence (A.4) to obtain: ˆF(ω, ω ′)≃ 1 1−γ 1 ¯w⊤Σ(ΣF F ⊤ +iω 1)−1(ΣF F ⊤ +iω ′ 1)−1 ¯w

    Linear Random Features Forcing Term The empirical loss under gradient flow in bi-frequency space is given by ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ(F F ⊤ ˆΣ+iω) −1(F F ⊤ ˆΣ+iω ′)−1 ¯w. We now apply the deterministic equivalence (A.4) to obtain: ˆF(ω, ω ′)≃ 1 1−γ 1 ¯w⊤Σ(ΣF F ⊤ +iω 1)−1(ΣF F ⊤ +iω ′ 1)−1 ¯w. Finally, we apply the deterministic equivalence (A.1) ...

  57. [57]

    SGD in Linear Random Features We now compute deterministic equivalents for the kernel term, again defining the bi-temporal kernels by: K(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤Σ], ˆK(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω ′)K(ω, ω′), ˆK(t) =D Z ...

  58. [58]

    Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′

    We recognize again that the numerator and denominator depend only onγ 1 ≡ D P df2 F F ⊤Σ, We have already computed equivalents for this in the prior section. Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′

  59. [59]

    Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated

    D P tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1] 1−γ 1(iω1, iω′ 1) . Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated. It remains to apply a final deterministic equivalence (A.4) over the features to the last term to get: tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1]≃ tr[Σ(Σ+iω 2)−1(Σ+iω ′ 2)−1] 1− D N df2(iω2, iω′

  60. [60]

    These formulas are obtained from a straightforward though slightly tedious application of the two-point deterministic equivalences derived in the text

    Appendix C: Covariate Shift in Random Features In this section we write down the exact formula for the precise asymptotics of random feature regression when tested under covariate shift. These formulas are obtained from a straightforward though slightly tedious application of the two-point deterministic equivalences derived in the text

  61. [61]

    Under gradient flow this can be written as: F(iω, iω ′) = ¯w⊤(iω+ ˆΣF F ⊤)−1Σ′F F ⊤(iω′ + ˆΣF F ⊤)−1 (F F ⊤)−1 ¯w

    Gradient Flow Term We are interested in evaluating the test error out-of-distribution. Under gradient flow this can be written as: F(iω, iω ′) = ¯w⊤(iω+ ˆΣF F ⊤)−1Σ′F F ⊤(iω′ + ˆΣF F ⊤)−1 (F F ⊤)−1 ¯w. a. Data Average We apply (A.1) withM=A=ΣF F ⊤ to obtain. (iω+ ˆΣF F ⊤)−1ΣF F ⊤(iω′ + ˆΣF F ⊤)−1 ≃S W S′ W (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1 +S W SW γ...

  62. [62]

    tr[(iω2 +Σ) −1Σ′(iω′ 2 +Σ) −1] tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]. 22 The remaining part of the forcing term that has not yet been evaluated can gain be rewritten via the push-through identity as: (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1(F F ⊤)−1 = (iω1 +ΣF F ⊤)−1Σ′(iω′ 1 +F F ⊤Σ)−1. We now apply (IV.2) withM=I,A=Σ,B=F F ⊤. After some rewriting, we obtain: FOOD...

  63. [63]

    The train kernel is (by definition) unaffected by distribution shift

    SGD Kernel Term By applying a one-point deterministic equivalence, the OOD test kernel is: KOOD(ω, ω′)≃Tr[Σ ′Σ(Σ+iω 2)−1] + iω2 N Tr[Σ′(Σ+iω 2)−1] Tr[Σ(Σ+iω 2)−1]. The train kernel is (by definition) unaffected by distribution shift