Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression
Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3
The pith
A trajectory-regression estimator achieves the minimax optimal rate for Hessian-free covariance estimation from averaged SGD trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish the minimax rate Theta(n^{-(1-alpha)/2}) for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives Omega(n^{-(1-alpha)/2}), and a trajectory-regression estimator which estimates the Hessian by regressing SGD increments on iterates achieves O(n^{-(1-alpha)/2}), matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.
What carries the argument
The trajectory-regression estimator, which estimates the Hessian matrix by regressing observed SGD increments against the sequence of iterates to extract curvature information from the optimization path itself.
If this is right
- Re-tuned block growth improves the online batch-means estimator to O(n^{-(1-alpha)/3}) while keeping O(d^2) memory and no Hessian access.
- A complete error decomposition separates variance, stationarity bias, and nonlinearity bias terms for the batch-means procedure.
- A weighted-averaging variant of the batch-means estimator avoids hard truncation of blocks.
- The minimax rate cannot be improved by any Hessian-free method because the Le Cam lower bound matches the upper bound achieved by regression on the trajectory.
Where Pith is reading between the lines
- The sublinear information accumulation suggests that occasional direct Hessian probes at sparse times could lift the rate beyond the trajectory-only limit.
- The same regression idea on increments may extend to estimating other functionals of the limiting distribution, such as higher moments or bias corrections, using only the SGD path.
Load-bearing premise
The SGD trajectory accumulates information about the Hessian only sublinearly through its natural drift under standard strong-convexity and smoothness conditions on the objective.
What would settle it
A direct numerical check, on a simple quadratic or logistic problem, of whether the operator-norm error of the trajectory-regression covariance estimator decays exactly as n to the power of minus (1-alpha) over 2 when alpha is fixed away from zero would confirm or refute the matching upper and lower bounds.
read the original abstract
We study online covariance matrix estimation for Polyak--Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of $O(n^{-(1-\alpha)/4})$, which yields $O(n^{-1/8})$ at the optimal learning-rate exponent $\alpha \rightarrow 1/2^+$. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to $O(n^{-(1-\alpha)/3})$, achieving $O(n^{-1/6})$. The modified estimator requires no Hessian access and preserves $O(d^2)$ memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate $\Theta(n^{-(1-\alpha)/2})$ for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives $\Omega(n^{-(1-\alpha)/2})$, and a trajectory-regression estimator--which estimates the Hessian by regressing SGD increments on iterates--achieves $O(n^{-(1-\alpha)/2})$, matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies online covariance matrix estimation for Polyak-Ruppert averaged SGD. It shows that re-tuning the block-growth parameter improves the online batch-means estimator from O(n^{-(1-α)/4}) to O(n^{-(1-α)/3}) while preserving O(d²) memory and no Hessian access; provides a complete error decomposition into variance, stationarity bias, and nonlinearity bias; introduces a weighted-averaging variant; and establishes the minimax rate Θ(n^{-(1-α)/2}) via a Le Cam lower bound of Ω(n^{-(1-α)/2}) matched by an O(n^{-(1-α)/2}) upper bound from a trajectory-regression estimator that recovers the Hessian by regressing SGD increments on iterates. The bottleneck is identified as sublinear accumulation of Hessian information from the SGD drift.
Significance. If the claims hold, the results tighten practical rates for Hessian-free online covariance estimation from SGD trajectories and clarify the information-theoretic limit. The explicit three-component error decomposition and the matching minimax bounds (with the lower bound via Le Cam and upper bound via explicit regression) are strengths that would advance the literature on averaged SGD statistics.
major comments (2)
- [Abstract] Abstract: The trajectory-regression upper bound claims O(n^{-(1-α)/2}) by regressing increments Δx_t on iterates x_t to estimate the Hessian. This step implicitly requires the empirical Gram matrix formed by the trajectory to be well-conditioned at the required rate. Under only the standard strong-convexity/smoothness/noise assumptions that support the Le Cam lower bound Ω(n^{-(1-α)/2}), the averaged trajectory can be highly correlated, so the smallest eigenvalue of the design matrix may grow too slowly and inflate regression variance beyond the claimed rate. An explicit non-asymptotic lower bound on λ_min of the trajectory Gram matrix (or a proof that the regression error is absorbed into the O term) is needed to confirm the upper bound matches the lower bound.
- [Error decomposition and batch-means analysis] The per-block bias analysis section: The claim that re-tuning the block-growth parameter improves the batch-means rate to O(n^{-(1-α)/3}) rests on a rigorous per-block bias analysis. Without the explicit assumptions on the objective (strong convexity, smoothness, noise structure) and the full decomposition showing how the three bias/variance terms combine to yield the improved exponent, it is unclear whether the rate holds uniformly or only under additional conditions not required for the Le Cam lower bound.
minor comments (1)
- [Abstract] The abstract mentions a weighted-averaging variant that avoids hard truncation but provides no rate or implementation details; a brief statement of its convergence properties would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify two areas where additional clarity would strengthen the manuscript. We address each point below and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The trajectory-regression upper bound claims O(n^{-(1-α)/2}) by regressing increments Δx_t on iterates x_t to estimate the Hessian. This step implicitly requires the empirical Gram matrix formed by the trajectory to be well-conditioned at the required rate. Under only the standard strong-convexity/smoothness/noise assumptions that support the Le Cam lower bound Ω(n^{-(1-α)/2}), the averaged trajectory can be highly correlated, so the smallest eigenvalue of the design matrix may grow too slowly and inflate regression variance beyond the claimed rate. An explicit non-asymptotic lower bound on λ_min of the trajectory Gram matrix (or a proof that the regression error is absorbed into the O term) is needed to confirm the upper bound matches the lower bound.
Authors: We appreciate the referee's observation on the design-matrix conditioning. Under the paper's standard assumptions (μ-strong convexity, L-smoothness, and sub-Gaussian noise), the persistent stochastic noise in the SGD increments ensures that the trajectory Gram matrix accumulates sufficient excitation; our existing upper-bound proof already shows that any resulting regression variance remains absorbed in the O(n^{-(1-α)/2}) term. To make the argument fully explicit and non-asymptotic, we will insert a supporting lemma that lower-bounds λ_min of the empirical Gram matrix and verifies that the regression error does not exceed the claimed rate. revision: yes
-
Referee: [Error decomposition and batch-means analysis] The per-block bias analysis section: The claim that re-tuning the block-growth parameter improves the batch-means rate to O(n^{-(1-α)/3}) rests on a rigorous per-block bias analysis. Without the explicit assumptions on the objective (strong convexity, smoothness, noise structure) and the full decomposition showing how the three bias/variance terms combine to yield the improved exponent, it is unclear whether the rate holds uniformly or only under additional conditions not required for the Le Cam lower bound.
Authors: The per-block bias analysis is carried out under precisely the same assumptions used for the Le Cam lower bound: μ-strong convexity, L-smoothness, and sub-Gaussian noise with bounded variance. The manuscript already contains the complete three-component decomposition (variance, stationarity bias, nonlinearity bias) and shows how re-tuning the block-growth parameter balances these terms to obtain the improved O(n^{-(1-α)/3}) rate. For improved readability we will restate the assumptions at the opening of the batch-means section and add a short paragraph summarizing the combination of the three error terms. revision: yes
Circularity Check
No circularity: independent Le Cam lower bound and explicit regression upper bound
full rationale
The paper's central minimax claim rests on a Le Cam lower bound establishing Ω(n^{-(1-α)/2}) and a separately constructed trajectory-regression estimator (regressing SGD increments on iterates to recover the Hessian) that achieves the matching O(n^{-(1-α)/2}) upper bound. Neither step reduces by the paper's own equations to a fitted quantity, self-definition, or self-citation chain; the batch-means improvement cites Zhu, Chen and Wu (2023) by different authors as external baseline. The derivation is self-contained against the stated assumptions on strong convexity, smoothness, and noise, with no load-bearing step that collapses to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- block-growth parameter
axioms (1)
- domain assumption Standard SGD assumptions (strong convexity, smoothness, noise structure) hold so that the stated rates and Le Cam lower bound apply.
Reference graph
Works this paper leans on
-
[1]
(2024).Learning Theory from First Principles
BACH, F. (2024).Learning Theory from First Principles. MIT Press
work page 2024
-
[2]
CHEN, X., LEE, J. D., TONG, X. T. and ZHANG, Y. (2020). Statistical Inference for Model Parameters in Stochastic Gradient Descent.Annals of Statistics48251–273
work page 2020
-
[3]
COVER, T. M. and THOMAS, J. A. (2006).Elements of Information Theory, 2nd ed. Wiley
work page 2006
-
[4]
FANG, Y., XU, J. and YANG, L. (2018). Online Bootstrap Confidence Intervals for the Stochastic Gradient Descent Estimator.Journal of Machine Learning Research193053–3073
work page 2018
- [5]
- [6]
-
[7]
arXiv preprint arXiv:2212.01259 , year=
LUO, Y., HUO, X. and MEI, Y. (2022). Covariance Estimators for the ROOT-SGD Algorithm in Online Learning. arXiv preprint arXiv:2212.01259
-
[8]
MOULINES, E. and BACH, F. (2011). Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. InAdvances in Neural Information Processing Systems (NeurIPS)
work page 2011
-
[9]
PAULIN, D. (2015). Concentration Inequalities for Markov Chains by Marton Couplings and Spectral Methods. Electronic Journal of Probability201–32. ONLINE COV ARIANCE ESTIMATION IN A VERAGED SGD25
work page 2015
-
[10]
POLYAK, B. T. and JUDITSKY, A. B. (1992). Acceleration of Stochastic Approximation by Averaging.SIAM Journal on Control and Optimization30838–855
work page 1992
-
[11]
arXiv preprint arXiv:2308.01481 , year =
ROY, A. and BALASUBRAMANIAN, K. (2023). Online Covariance Estimation for Stochastic Gradient Descent under Markovian Sampling.arXiv preprint arXiv:2308.01481
-
[12]
RUPPERT, D. (1988). Efficient Estimations from a Slowly Convergent Robbins–Monro Process Technical Report No. 781, School of Operations Research and Industrial Engineering, Cornell University
work page 1988
-
[13]
SAMSONOV, S., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2024). Gaussian Approxima- tion and Multiplier Bootstrap for Polyak–Ruppert Averaged Linear Stochastic Approximation with Applica- tions to TD Learning. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2405.16644
- [14]
- [15]
-
[16]
TROPP, J. A. (2015).An Introduction to Matrix Concentration Inequalities8(1–2). Foundations and Trends in Machine Learning
work page 2015
-
[17]
(2018).High-Dimensional Probability: An Introduction with Applications in Data Science
VERSHYNIN, R. (2018).High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press
work page 2018
-
[18]
YU, B. (1997). Assouad, Fano, and Le Cam.Festschrift for Lucien Le Cam423–435
work page 1997
- [19]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.