Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Pith reviewed 2026-05-23 04:12 UTC · model grok-4.3
The pith
A novel deterministic equivalence for the two-point resolvent function unifies the asymptotic analysis of stochastic gradient descent across high-dimensional linear models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the two-point function of the resolvent of a random matrix admits a deterministic equivalent in the high-dimensional limit, and that this equivalent directly yields the performance of stochastic gradient descent for a broad class of linear models including regression, kernel methods, and random features, encompassing both known and previously unknown asymptotic expressions.
What carries the argument
The two-point deterministic equivalence for the random matrix resolvent function, which converts the stochastic two-point correlations arising in SGD dynamics into a closed deterministic form.
If this is right
- A single derivation now covers the asymptotic performance of SGD for high-dimensional linear regression.
- The same derivation covers kernel regression and linear random feature models.
- Both previously known asymptotics and new ones for varying noise and step-size regimes follow from the equivalence.
- The performance metrics are expressed in terms of the deterministic equivalent without needing to average over individual matrix realizations.
Where Pith is reading between the lines
- The equivalence may allow similar closed-form predictions for other first-order optimizers whose updates can be written in terms of the same resolvent structure.
- If the two-point object can be generalized to non-Gaussian or structured data, the framework could extend beyond the linear and kernel settings treated here.
- The deterministic reduction suggests that the leading-order behavior of these models is controlled by the spectrum of the data covariance rather than higher-order statistics of the noise.
Load-bearing premise
The system must be in the high-dimensional asymptotic regime where the two-point resolvent function admits a deterministic equivalent that does not depend on the specific realization of the random matrix.
What would settle it
A direct numerical computation, for large but finite dimension, of the two-point resolvent function under the paper's random matrix model that shows a persistent deviation from the derived deterministic expression as dimension grows.
read the original abstract
We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and linear random feature models. Our results include previously known asymptotics as well as novel ones.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this equivalence, it provides a unified derivation of the asymptotic performance of high-dimensional linear models trained with stochastic gradient descent, covering linear regression, kernel regression, and linear random feature models. The results recover previously known asymptotics and introduce new ones in the high-dimensional regime.
Significance. If the central derivation holds, the work offers a significant technical contribution by supplying a parameter-free two-point resolvent equivalence that unifies the analysis of SGD dynamics across multiple linear models via random matrix theory. This framework could streamline future analyses in high-dimensional statistics and machine learning. The absence of free parameters and the derivation from first principles (as indicated by the axiom ledger) are explicit strengths.
minor comments (2)
- [Introduction] The abstract is clear but the introduction would benefit from an explicit statement of the precise high-dimensional scaling assumptions (e.g., the relation between dimension, sample size, and step size) before the main theorem.
- [Section 2] Notation for the two-point resolvent function should be introduced with a short display equation in §2 to aid readability for readers unfamiliar with the specific RMT objects.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, accurate summary of its contributions, and recommendation to accept. We appreciate the recognition of the parameter-free two-point resolvent equivalence and its unifying role across linear models trained by SGD.
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper's central claim is a novel deterministic equivalence for the two-point resolvent function derived via random matrix theory techniques, followed by its application to obtain asymptotics for SGD in linear models. No load-bearing step reduces by construction to a fit, a self-citation chain, or a renamed input; the derivation is presented as independent of the target performance quantities and relies on standard high-dimensional asymptotic assumptions rather than internal redefinitions. The abstract and structure indicate an external mathematical result used to unify known and new expressions, with no evidence of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
explicitly vanishes in the high-dimensionalD→ ∞limit whenζ >0
-
[2]
contributes negligibly after sufficient timetforζ= 0. In the first case, maintaining stable dynamics withη= Θ(1) requires choosing a batch size large enough such that χ= Θ(D −ζ) (see Section III D). With this choice, the two terms generated by SGD effects scale as ηχF F ⊤ ˆΣCt ˆΣF F ⊤ = Θ(D−ζ), ηχF F ⊤ ˆΣF F ⊤ Tr[Ct ˆΣ] = Θ(1), justifying neglecting the l...
-
[3]
Thus, even in the case of linear regression at finitet, one requires two-point resolvents
When the dataset is finite,A= ˆΣand the population covarianceΣdoes not generally commute withA, regardless of whetherBis included. Thus, even in the case of linear regression at finitet, one requires two-point resolvents
-
[4]
This is why the general finiteN, P, texpressions of Bordelon et
A random feature model at finite model sizeNand finite dataset sizePwill not haveA=F F ⊤ andB= ˆΣ commute. This is why the general finiteN, P, texpressions of Bordelon et. al [16] required computing two point resolvents that are functions of two frequenciesω, ω ′. At finiteNbut infiniteP,A=M=Σand one-point resolvents are in fact sufficient as in the work ...
-
[5]
For covariate shift settings, the matrixM=Σ ′ does not generally commute withAandB. Even in the linear regression setting, whereB=I, one still requires a two-point deterministic equivalent in this case [26, 40]. In cases 1. and 2. thet→ ∞limits of the forcing term recover ridgeless regression, whose precise asymptotics does not require two-point equivalen...
-
[6]
¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w. Here, because ˆΣ=Σ∗WforWa white Wishart, the renormalization of each frequency is given by the multiplication of theS-transform of a white Wishart. This is found in the standard literature, see e.g. [24]. Our notation convention in this section is thus: df1 ≡df 1 ˆΣ(ω)≃df 1 Σ(ω1),df ′ 1 ≡df 1 ˆΣ(ω′)≃df 1 Σ(ω′ 1) S≡ 1 1− D ...
-
[7]
¯w⊤(iω1 +Σ) −1Σ(iω′ 1 +Σ) −1 ¯w + Z ω e2iωt χTr[Σ2(Σ+iω 1)−1] 1−χTr[Σ 2(Σ+iω 1)−1]−χiω 1(Ddf1)2/P ¯w⊤Σ(Σ+iω 1)−1 ¯w, respectively. In Appendix B 2, we compute the SGD terms in terms of double frequency Fourier transforms and two-point equivalents. C. Covariate Shift In this section, we show how the two-point deterministic equivalences allow for direct cal...
-
[8]
Integrating Over Data Pushing throughΣ 1/2 on both sides, we apply the two-point equivalence (A.2) withM=IandA=Σ 1/2F F ⊤Σ1/2 to obtain: F(iω, iω ′)≃ SW S′ W 1−γ 1 ¯w⊤(iω1 +ΣF F ⊤)−1Σ(iω′ 1 +F F ⊤Σ)−1 ¯w for γ1 ≡ D P tr (ΣF F ⊤)2(iω1 +ΣF F ⊤)−1(iω′ 1 +ΣF F ⊤)−1
-
[9]
We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D
Integrating over Features We will evaluateγ 1 separately since it concentrates. We apply Equation (A.3) for aGram Wishartwithq= D/N, q−1 =N/D. Here,M=I,A=Σ,B=F F ⊤. γ1 ≃ D P df2 + D P D N (iω2)(iω′ 2)tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]2 1− D N df2 . We now apply (IV.2) withM=I,A=Σ, andB=F F ⊤ to get: F(iω, iω ′)≃ SS ′ 1−γ 1 " ¯w⊤(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1 ¯w + ¯...
work page 2024
-
[10]
Deep Learning Scaling is Predictable, Empirically
J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[12]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,et al., Training compute-optimal large language models, arXiv preprint arXiv:2203.15556 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
B. Ghorbani, O. Firat, M. Freitag, A. Bapna, M. Krikun, X. Garcia, C. Chelba, and C. Cherry, Scaling laws for neural machine translation, arXiv preprint arXiv:2109.07740 (2021)
-
[14]
D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish, Scaling laws for transfer, arXiv preprint arXiv:2102.01293 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Scaling Laws and Interpretability of Learning from Repeated Data
D. Hernandez, T. Brown, T. Conerly, N. DasSarma, D. Drain, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, T. Henighan, T. Hume,et al., Scaling laws and interpretability of learning from repeated data, arXiv preprint arXiv:2205.10487 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
M. A. Gordon, K. Duh, and J. Kaplan, Data and parameter scaling laws for neural machine translation, inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021) pp. 5915–5922
work page 2021
-
[17]
N. Muennighoff, A. Rush, B. Barak, T. Le Scao, N. Tazi, A. Piktus, S. Pyysalo, T. Wolf, and C. A. Raffel, Scaling data-constrained language models, Advances in Neural Information Processing Systems36(2024)
work page 2024
-
[18]
X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, Scaling vision transformers, inProceedings of the IEEE/CVF conference on computer vision and pattern recognition(2022) pp. 12104–12113
work page 2022
-
[19]
I. M. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer, Getting ViT in shape: Scaling laws for compute-optimal model design, Advances in Neural Information Processing Systems36(2024)
work page 2024
-
[20]
G. Bachmann, S. Anagnostidis, and T. Hofmann, Scaling MLPs: A tale of inductive bias, Advances in Neural Information Processing Systems36(2024)
work page 2024
-
[21]
K. Everett, L. Xiao, M. Wortsman, A. A. Alemi, R. Novak, P. J. Liu, I. Gur, J. Sohl-Dickstein, L. P. Kaelbling, J. Lee, et al., Scaling exponents across parameterizations and optimizers, arXiv preprint arXiv:2407.05872 (2024)
- [22]
-
[23]
L. Pillaud-Vivien, A. Rudi, and F. Bach, Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes, Advances in Neural Information Processing Systems31(2018)
work page 2018
-
[24]
Y. Ying and M. Pontil, Online gradient descent learning algorithms, Foundations of Computational Mathematics8, 561 (2008)
work page 2008
-
[25]
B. Bordelon, A. Atanasov, and C. Pehlevan, A dynamical model of neural scaling laws, inForty-first International Con- ference on Machine Learning(2024)
work page 2024
-
[26]
E. Paquette, C. Paquette, L. Xiao, and J. Pennington, 4 + 3 phases of compute-optimal neural scaling laws, arXiv preprint arXiv:2405.15074 (2024)
- [27]
-
[28]
L. Carratino, A. Rudi, and L. Rosasco, Learning with sgd and random features, Advances in neural information processing systems31(2018). 17
work page 2018
-
[29]
Scaling and renormalization in high-dimensional regression
A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Scaling and renormalization in high-dimensional regression, arXiv preprint arXiv:2405.00592 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
B. Bordelon and C. Pehlevan, Learning curves for SGD on structured features, inInternational Conference on Learning Representations(2022)
work page 2022
-
[31]
Zinn-Justin,Quantum field theory and critical phenomena, Vol
J. Zinn-Justin,Quantum field theory and critical phenomena, Vol. 171 (Oxford university press, 2021)
work page 2021
-
[32]
A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in neural information processing systems20(2007)
work page 2007
-
[33]
M. Potters and J.-P. Bouchaud,A first course in random matrix theory: for physicists, engineers and data scientists (Cambridge University Press, 2020)
work page 2020
- [34]
-
[35]
A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, Risk and cross validation in ridge regression with correlated samples, arXiv preprint arXiv:2408.04607 (2024)
-
[36]
J. A. Zavatone-Veth and C. Pehlevan, Learning curves for deep structured Gaussian feature models, inAdvances in Neural Information Processing Systems(2023)
work page 2023
-
[37]
F. Bach, High-dimensional analysis of double descent for linear regression with random projections, SIAM Journal on Mathematics of Data Science6, 26 (2024)
work page 2024
-
[38]
A. Knowles and J. Yin, Anisotropic local laws for random matrices, Probability Theory and Related Fields169, 257 (2017)
work page 2017
- [39]
-
[40]
D. Schr¨ oder, H. Cui, D. Dmitriev, and B. Loureiro, Deterministic equivalent and error universality of deep random features learning, Journal of Statistical Mechanics: Theory and Experiment2024, 104017 (2024)
work page 2024
-
[41]
L. Defilippis, B. Loureiro, and T. Misiakiewicz, Dimension-free deterministic equivalents for random feature regression, arXiv preprint arXiv:2405.15699 (2024)
-
[42]
T. Misiakiewicz and B. Saeed, A non-asymptotic theory of kernel ridge regression: deterministic equivalents, test error, and GCV estimator, arXiv preprint arXiv:2403.08938 (2024)
-
[43]
A. Canatar, B. Bordelon, and C. Pehlevan, Spectral bias and task-model alignment explain generalization in kernel regres- sion and infinitely wide neural networks, Nature communications12, 2914 (2021)
work page 2021
-
[44]
C. Paquette, K. Lee, F. Pedregosa, and E. Paquette, Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality, inProceedings of Thirty Fourth Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 134, edited by M. Belkin and S. Kpotufe (PMLR, 2021) pp. 3548–3626
work page 2021
-
[45]
C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, inAdvances in Neural Information Processing Systems, Vol. 35, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 35984–35999
work page 2022
-
[46]
C. Paquette, E. Paquette, B. Adlam, and J. Pennington, Implicit regularization or implicit conditioning? exact risk trajectories of sgd in high dimensions, Advances in Neural Information Processing Systems35, 35984 (2022)
work page 2022
-
[47]
A. Strominger,Lectures on the infrared structure of gravity and gauge theory(Princeton University Press, 2018)
work page 2018
-
[48]
B. Loureiro, C. Gerbelot, H. Cui, S. Goldt, F. Krzakala, M. Mezard, and L. Zdeborov´ a, Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems34, 18137 (2021)
work page 2021
-
[49]
P. Patil, J.-H. Du, and R. J. Tibshirani, Optimal ridge regularization for out-of-distribution prediction, arXiv (2024), arXiv:2404.01233 [math.ST]
-
[50]
A. Canatar, B. Bordelon, and C. Pehlevan, Out-of-distribution generalization in kernel regression, inAdvances in Neural Information Processing Systems, Vol. 34, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 12600–12612
work page 2021
-
[51]
B. Adlam and J. Pennington, The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization, inInternational Conference on Machine Learning(PMLR, 2020) pp. 74–84. 18 Appendix A: All Two-Point Deterministic Equivalences In this section, we report all variants of the two-point deterministic equivalences. These extend ...
work page 2020
-
[52]
In the case where M=I, the last two deterministic equivalences are equal
Sanity Check of Two Point Functions We now takeA=Σ,B=WforWa white Wishart and consider the matrix ˆΣ=Σ∗W. In the case where M=I, the last two deterministic equivalences are equal. Further, by takingλ=λ ′ so thatκ=κ ′ we get: ˆΣ( ˆΣ+λ) −2 ≃SΣ(Σ+κ) −2 −SκΣ(Σ+κ) −2 qtr[Σ(Σ+κ) −2] 1−γ =Σ(Σ+κ) −2S 1− −qκdf ′ 1 1−γ = 1 1−γ Σ(Σ+κ) −2S[1−γ+q(df 1 −df 2)] = 1 1−γ ...
-
[53]
Linear Regression Forcing Term We can write the empirical loss under gradient flow in terms of two time variables as: ˆF(t, t′)≡∆w ⊤ t ˆΣ∆wt′ = ¯w⊤e−t ˆΣ ˆΣe−t′ ˆΣ ¯w. Then, Fourier transforming in each variable separately and restricting to thet=t ′ diagonal yields: ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ( ˆΣ+iω) −1( ˆΣ+iω ′)−1 ¯w| {z } ˆF(ω,ω ′) . We now appl...
-
[54]
SGD in Linear Regression Here, to match the gradient flow term, we apply two Fourier transforms to the kernel and apply the two-point deterministic equivalence. Namely, we consider extendingK, ˆKto: K(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣΣ], ˆK(t, t′)≡Tr[e −t ˆΣ ˆΣe−t′ ˆΣ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω...
-
[55]
D P tr[Σ(Σ+iω 1)−1(Σ+iω ′ 1)−1]2 1−γ
-
[56]
Linear Random Features Forcing Term The empirical loss under gradient flow in bi-frequency space is given by ˆF(t) = Z ω,ω ′ eit(ω+ω ′) ¯w⊤ ˆΣ(F F ⊤ ˆΣ+iω) −1(F F ⊤ ˆΣ+iω ′)−1 ¯w. We now apply the deterministic equivalence (A.4) to obtain: ˆF(ω, ω ′)≃ 1 1−γ 1 ¯w⊤Σ(ΣF F ⊤ +iω 1)−1(ΣF F ⊤ +iω ′ 1)−1 ¯w. Finally, we apply the deterministic equivalence (A.1) ...
-
[57]
SGD in Linear Random Features We now compute deterministic equivalents for the kernel term, again defining the bi-temporal kernels by: K(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤Σ], ˆK(t, t′)≡Tr[e −tF F ⊤ ˆΣF F ⊤ ˆΣe−t′F F ⊤ ˆΣF F ⊤ ˆΣ]. We then perform Fourier transforms separately int, t ′ to obtain: K(t) =D Z ω,ω ′ eit(ω+ω ′)K(ω, ω′), ˆK(t) =D Z ...
-
[58]
Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′
We recognize again that the numerator and denominator depend only onγ 1 ≡ D P df2 F F ⊤Σ, We have already computed equivalents for this in the prior section. Similarly for the train kernel we get: ˆK(ω, ω′)≃Ddf 2 F F ⊤Σ(iω1, iω′ 1) 1−(iω 1)(iω′
-
[59]
Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated
D P tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1] 1−γ 1(iω1, iω′ 1) . Again equivalents for df 2 F F ⊤Σ andγ 1 are already calculated. It remains to apply a final deterministic equivalence (A.4) over the features to the last term to get: tr[ΣF F ⊤(ΣF F ⊤+iω 1)−1(ΣF F ⊤+iω ′ 1)−1]≃ tr[Σ(Σ+iω 2)−1(Σ+iω ′ 2)−1] 1− D N df2(iω2, iω′
-
[60]
Appendix C: Covariate Shift in Random Features In this section we write down the exact formula for the precise asymptotics of random feature regression when tested under covariate shift. These formulas are obtained from a straightforward though slightly tedious application of the two-point deterministic equivalences derived in the text
-
[61]
Gradient Flow Term We are interested in evaluating the test error out-of-distribution. Under gradient flow this can be written as: F(iω, iω ′) = ¯w⊤(iω+ ˆΣF F ⊤)−1Σ′F F ⊤(iω′ + ˆΣF F ⊤)−1 (F F ⊤)−1 ¯w. a. Data Average We apply (A.1) withM=A=ΣF F ⊤ to obtain. (iω+ ˆΣF F ⊤)−1ΣF F ⊤(iω′ + ˆΣF F ⊤)−1 ≃S W S′ W (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1 +S W SW γ...
-
[62]
tr[(iω2 +Σ) −1Σ′(iω′ 2 +Σ) −1] tr[(iω2 +Σ) −1Σ(iω′ 2 +Σ) −1]. 22 The remaining part of the forcing term that has not yet been evaluated can gain be rewritten via the push-through identity as: (iω1 +ΣF F ⊤)−1Σ′F F ⊤(iω′ 1 +ΣF F ⊤)−1(F F ⊤)−1 = (iω1 +ΣF F ⊤)−1Σ′(iω′ 1 +F F ⊤Σ)−1. We now apply (IV.2) withM=I,A=Σ,B=F F ⊤. After some rewriting, we obtain: FOOD...
-
[63]
The train kernel is (by definition) unaffected by distribution shift
SGD Kernel Term By applying a one-point deterministic equivalence, the OOD test kernel is: KOOD(ω, ω′)≃Tr[Σ ′Σ(Σ+iω 2)−1] + iω2 N Tr[Σ′(Σ+iω 2)−1] Tr[Σ(Σ+iω 2)−1]. The train kernel is (by definition) unaffected by distribution shift
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.