Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

Xiaoming Huo; Yijin Ni

arxiv: 2604.10814 · v1 · submitted 2026-04-12 · 💻 cs.LG · math.ST· stat.TH

Online Covariance Estimation in Averaged SGD: Improved Batch-Mean Rates and Minimax Optimality via Trajectory Regression

Yijin Ni , Xiaoming Huo This is my paper

Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH

keywords covariance estimationstochastic gradient descentaveraged SGDminimax ratesonline estimationHessian estimationtrajectory regressionLe Cam lower bound

0 comments

The pith

A trajectory-regression estimator achieves the minimax optimal rate for Hessian-free covariance estimation from averaged SGD trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that the best possible rate for estimating the covariance matrix of the limiting distribution of Polyak-Ruppert averaged SGD, without direct access to the Hessian, is Theta of n to the power of minus (1 minus alpha) over 2. It improves earlier batch-means estimators from O of n to the minus (1-alpha) over 4 up to O of n to the minus (1-alpha) over 3 by re-tuning block growth, and then shows this is still slower than optimal. A new estimator that regresses the SGD increments directly onto the iterates recovers an estimate of the Hessian and reaches the faster optimal rate, matching a matching lower bound derived via Le Cam's method. This matters for online uncertainty quantification in large-scale stochastic optimization because it works with only the existing trajectory, O of d squared memory, and no extra derivative evaluations. The core bottleneck identified is that the SGD path itself accumulates information about the curvature only sublinearly.

Core claim

We establish the minimax rate Theta(n^{-(1-alpha)/2}) for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives Omega(n^{-(1-alpha)/2}), and a trajectory-regression estimator which estimates the Hessian by regressing SGD increments on iterates achieves O(n^{-(1-alpha)/2}), matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.

What carries the argument

The trajectory-regression estimator, which estimates the Hessian matrix by regressing observed SGD increments against the sequence of iterates to extract curvature information from the optimization path itself.

If this is right

Re-tuned block growth improves the online batch-means estimator to O(n^{-(1-alpha)/3}) while keeping O(d^2) memory and no Hessian access.
A complete error decomposition separates variance, stationarity bias, and nonlinearity bias terms for the batch-means procedure.
A weighted-averaging variant of the batch-means estimator avoids hard truncation of blocks.
The minimax rate cannot be improved by any Hessian-free method because the Le Cam lower bound matches the upper bound achieved by regression on the trajectory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sublinear information accumulation suggests that occasional direct Hessian probes at sparse times could lift the rate beyond the trajectory-only limit.
The same regression idea on increments may extend to estimating other functionals of the limiting distribution, such as higher moments or bias corrections, using only the SGD path.

Load-bearing premise

The SGD trajectory accumulates information about the Hessian only sublinearly through its natural drift under standard strong-convexity and smoothness conditions on the objective.

What would settle it

A direct numerical check, on a simple quadratic or logistic problem, of whether the operator-norm error of the trajectory-regression covariance estimator decays exactly as n to the power of minus (1-alpha) over 2 when alpha is fixed away from zero would confirm or refute the matching upper and lower bounds.

read the original abstract

We study online covariance matrix estimation for Polyak--Ruppert averaged stochastic gradient descent (SGD). The online batch-means estimator of Zhu, Chen and Wu (2023) achieves an operator-norm convergence rate of $O(n^{-(1-\alpha)/4})$, which yields $O(n^{-1/8})$ at the optimal learning-rate exponent $\alpha \rightarrow 1/2^+$. A rigorous per-block bias analysis reveals that re-tuning the block-growth parameter improves the batch-means rate to $O(n^{-(1-\alpha)/3})$, achieving $O(n^{-1/6})$. The modified estimator requires no Hessian access and preserves $O(d^2)$ memory. We provide a complete error decomposition into variance, stationarity bias, and nonlinearity bias components. A weighted-averaging variant that avoids hard truncation is also discussed. We establish the minimax rate $\Theta(n^{-(1-\alpha)/2})$ for Hessian-free covariance estimation from the SGD trajectory: a Le Cam lower bound gives $\Omega(n^{-(1-\alpha)/2})$, and a trajectory-regression estimator--which estimates the Hessian by regressing SGD increments on iterates--achieves $O(n^{-(1-\alpha)/2})$, matching the lower bound. The construction reveals that the bottleneck is the sublinear accumulation of information about the Hessian from the SGD drift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper improves the batch-means rate for online covariance estimation in averaged SGD and claims a matching minimax rate via trajectory regression, but the upper bound may rest on unshown conditioning of the path Gram matrix.

read the letter

The main things to know are that re-tuning the block-growth parameter lifts the batch-means operator-norm rate from O(n^{-(1-α)/4}) to O(n^{-(1-α)/3}), and that a new regression estimator on the SGD trajectory is said to hit the information-theoretic lower bound of Θ(n^{-(1-α)/2}). Both are presented as advances over the 2023 batch-means work, with an explicit error split into variance, stationarity bias, and nonlinearity bias, plus a weighted-averaging option that avoids truncation. The Le Cam lower bound is standard and correctly identifies the sublinear Hessian information from the drift as the bottleneck. The regression construction itself is a fresh idea for Hessian-free covariance recovery. These pieces are useful for anyone who wants sharper post-processing of large SGD runs without extra Hessian calls or O(d^2) memory blow-up. The citation pattern is clean and the assumptions are the usual strong-convexity-plus-smoothness-plus-noise package. The soft spot is the claimed upper bound. Regressing increments on iterates to recover the Hessian requires the empirical Gram matrix along the trajectory to be sufficiently well-conditioned at the right scale. The abstract gives no non-asymptotic lower bound on its smallest eigenvalue, and under the same assumptions that support the lower bound the averaged path can stay correlated enough to inflate the regression variance. If the full proofs absorb that error into the O(n^{-(1-α)/2}) term, the matching rate holds; otherwise the minimax claim is optimistic. This is a real but fixable gap rather than a fatal one. The paper is aimed at people working on statistical inference after stochastic optimization. A reader interested in online covariance or minimax rates in averaged SGD will find concrete rate improvements and a new estimator worth checking. It deserves a serious referee to verify the regression analysis and the conditioning step.

Referee Report

2 major / 1 minor

Summary. The paper studies online covariance matrix estimation for Polyak-Ruppert averaged SGD. It shows that re-tuning the block-growth parameter improves the online batch-means estimator from O(n^{-(1-α)/4}) to O(n^{-(1-α)/3}) while preserving O(d²) memory and no Hessian access; provides a complete error decomposition into variance, stationarity bias, and nonlinearity bias; introduces a weighted-averaging variant; and establishes the minimax rate Θ(n^{-(1-α)/2}) via a Le Cam lower bound of Ω(n^{-(1-α)/2}) matched by an O(n^{-(1-α)/2}) upper bound from a trajectory-regression estimator that recovers the Hessian by regressing SGD increments on iterates. The bottleneck is identified as sublinear accumulation of Hessian information from the SGD drift.

Significance. If the claims hold, the results tighten practical rates for Hessian-free online covariance estimation from SGD trajectories and clarify the information-theoretic limit. The explicit three-component error decomposition and the matching minimax bounds (with the lower bound via Le Cam and upper bound via explicit regression) are strengths that would advance the literature on averaged SGD statistics.

major comments (2)

[Abstract] Abstract: The trajectory-regression upper bound claims O(n^{-(1-α)/2}) by regressing increments Δx_t on iterates x_t to estimate the Hessian. This step implicitly requires the empirical Gram matrix formed by the trajectory to be well-conditioned at the required rate. Under only the standard strong-convexity/smoothness/noise assumptions that support the Le Cam lower bound Ω(n^{-(1-α)/2}), the averaged trajectory can be highly correlated, so the smallest eigenvalue of the design matrix may grow too slowly and inflate regression variance beyond the claimed rate. An explicit non-asymptotic lower bound on λ_min of the trajectory Gram matrix (or a proof that the regression error is absorbed into the O term) is needed to confirm the upper bound matches the lower bound.
[Error decomposition and batch-means analysis] The per-block bias analysis section: The claim that re-tuning the block-growth parameter improves the batch-means rate to O(n^{-(1-α)/3}) rests on a rigorous per-block bias analysis. Without the explicit assumptions on the objective (strong convexity, smoothness, noise structure) and the full decomposition showing how the three bias/variance terms combine to yield the improved exponent, it is unclear whether the rate holds uniformly or only under additional conditions not required for the Le Cam lower bound.

minor comments (1)

[Abstract] The abstract mentions a weighted-averaging variant that avoids hard truncation but provides no rate or implementation details; a brief statement of its convergence properties would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify two areas where additional clarity would strengthen the manuscript. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The trajectory-regression upper bound claims O(n^{-(1-α)/2}) by regressing increments Δx_t on iterates x_t to estimate the Hessian. This step implicitly requires the empirical Gram matrix formed by the trajectory to be well-conditioned at the required rate. Under only the standard strong-convexity/smoothness/noise assumptions that support the Le Cam lower bound Ω(n^{-(1-α)/2}), the averaged trajectory can be highly correlated, so the smallest eigenvalue of the design matrix may grow too slowly and inflate regression variance beyond the claimed rate. An explicit non-asymptotic lower bound on λ_min of the trajectory Gram matrix (or a proof that the regression error is absorbed into the O term) is needed to confirm the upper bound matches the lower bound.

Authors: We appreciate the referee's observation on the design-matrix conditioning. Under the paper's standard assumptions (μ-strong convexity, L-smoothness, and sub-Gaussian noise), the persistent stochastic noise in the SGD increments ensures that the trajectory Gram matrix accumulates sufficient excitation; our existing upper-bound proof already shows that any resulting regression variance remains absorbed in the O(n^{-(1-α)/2}) term. To make the argument fully explicit and non-asymptotic, we will insert a supporting lemma that lower-bounds λ_min of the empirical Gram matrix and verifies that the regression error does not exceed the claimed rate. revision: yes
Referee: [Error decomposition and batch-means analysis] The per-block bias analysis section: The claim that re-tuning the block-growth parameter improves the batch-means rate to O(n^{-(1-α)/3}) rests on a rigorous per-block bias analysis. Without the explicit assumptions on the objective (strong convexity, smoothness, noise structure) and the full decomposition showing how the three bias/variance terms combine to yield the improved exponent, it is unclear whether the rate holds uniformly or only under additional conditions not required for the Le Cam lower bound.

Authors: The per-block bias analysis is carried out under precisely the same assumptions used for the Le Cam lower bound: μ-strong convexity, L-smoothness, and sub-Gaussian noise with bounded variance. The manuscript already contains the complete three-component decomposition (variance, stationarity bias, nonlinearity bias) and shows how re-tuning the block-growth parameter balances these terms to obtain the improved O(n^{-(1-α)/3}) rate. For improved readability we will restate the assumptions at the opening of the batch-means section and add a short paragraph summarizing the combination of the three error terms. revision: yes

Circularity Check

0 steps flagged

No circularity: independent Le Cam lower bound and explicit regression upper bound

full rationale

The paper's central minimax claim rests on a Le Cam lower bound establishing Ω(n^{-(1-α)/2}) and a separately constructed trajectory-regression estimator (regressing SGD increments on iterates to recover the Hessian) that achieves the matching O(n^{-(1-α)/2}) upper bound. Neither step reduces by the paper's own equations to a fitted quantity, self-definition, or self-citation chain; the batch-means improvement cites Zhu, Chen and Wu (2023) by different authors as external baseline. The derivation is self-contained against the stated assumptions on strong convexity, smoothness, and noise, with no load-bearing step that collapses to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claims rest on standard domain assumptions for SGD analysis (strong convexity, smoothness, bounded noise moments) that are not enumerated in the abstract, plus the implicit modeling choice that the Hessian can be recovered by linear regression on trajectory increments.

free parameters (1)

block-growth parameter
Re-tuned from prior work to achieve the improved (1-α)/3 exponent; its specific functional form is not given in the abstract.

axioms (1)

domain assumption Standard SGD assumptions (strong convexity, smoothness, noise structure) hold so that the stated rates and Le Cam lower bound apply.
Required for both the batch-means analysis and the minimax result; not listed explicitly in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1503 out tokens · 67169 ms · 2026-05-10T15:28:43.031675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

(2024).Learning Theory from First Principles

BACH, F. (2024).Learning Theory from First Principles. MIT Press

work page 2024
[2]

D., TONG, X

CHEN, X., LEE, J. D., TONG, X. T. and ZHANG, Y. (2020). Statistical Inference for Model Parameters in Stochastic Gradient Descent.Annals of Statistics48251–273

work page 2020
[3]

COVER, T. M. and THOMAS, J. A. (2006).Elements of Information Theory, 2nd ed. Wiley

work page 2006
[4]

and YANG, L

FANG, Y., XU, J. and YANG, L. (2018). Online Bootstrap Confidence Intervals for the Stochastic Gradient Descent Estimator.Journal of Machine Learning Research193053–3073

work page 2018
[5]

and NA, S

JIANG, L., ROY, A., BALASUBRAMANIAN, K., DAVIS, D., DRUSVYATSKIY, D. and NA, S. (2025). Online Covariance Estimation in Nonsmooth Stochastic Approximation. InProceedings of the 38th Conference on Learning Theory (COLT). arXiv:2502.05305

work page arXiv 2025
[6]

LEE, S., LIAO, Y., SEO, M. H. and SHIN, Y. (2022). Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling. InProceedings of the AAAI Conference on Artificial Intelligence. arXiv:2106.03156

work page arXiv 2022
[7]

arXiv preprint arXiv:2212.01259 , year=

LUO, Y., HUO, X. and MEI, Y. (2022). Covariance Estimators for the ROOT-SGD Algorithm in Online Learning. arXiv preprint arXiv:2212.01259

work page arXiv 2022
[8]

and BACH, F

MOULINES, E. and BACH, F. (2011). Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2011
[9]

PAULIN, D. (2015). Concentration Inequalities for Markov Chains by Marton Couplings and Spectral Methods. Electronic Journal of Probability201–32. ONLINE COV ARIANCE ESTIMATION IN A VERAGED SGD25

work page 2015
[10]

POLYAK, B. T. and JUDITSKY, A. B. (1992). Acceleration of Stochastic Approximation by Averaging.SIAM Journal on Control and Optimization30838–855

work page 1992
[11]

arXiv preprint arXiv:2308.01481 , year =

ROY, A. and BALASUBRAMANIAN, K. (2023). Online Covariance Estimation for Stochastic Gradient Descent under Markovian Sampling.arXiv preprint arXiv:2308.01481

work page arXiv 2023
[12]

RUPPERT, D. (1988). Efficient Estimations from a Slowly Convergent Robbins–Monro Process Technical Report No. 781, School of Operations Research and Industrial Engineering, Cornell University

work page 1988
[13]

and NAUMOV, A

SAMSONOV, S., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2024). Gaussian Approxima- tion and Multiplier Bootstrap for Polyak–Ruppert Averaged Linear Stochastic Approximation with Applica- tions to TD Learning. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2405.16644

work page arXiv 2024
[14]

Singh, A

SINGH, R., SHUKLA, A. and VATS, D. (2025). On the Utility of Equal Batch Sizes for Inference in Stochastic Gradient Descent.Journal of Machine Learning Research26. arXiv:2303.07706

work page arXiv 2025
[15]

SU, W. J. and ZHU, Y. (2023). HiGrad: Uncertainty Quantification for Online Learning and Stochastic Approxi- mation.Journal of Machine Learning Research24. arXiv:1802.04876

work page arXiv 2023
[16]

TROPP, J. A. (2015).An Introduction to Matrix Concentration Inequalities8(1–2). Foundations and Trends in Machine Learning

work page 2015
[17]

(2018).High-Dimensional Probability: An Introduction with Applications in Data Science

VERSHYNIN, R. (2018).High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press

work page 2018
[18]

YU, B. (1997). Assouad, Fano, and Le Cam.Festschrift for Lucien Le Cam423–435

work page 1997
[19]

and WU, W

ZHU, W., CHEN, X. and WU, W. B. (2023). Online Covariance Matrix Estimation in Stochastic Gradient De- scent.Journal of the American Statistical Association118393–404

work page 2023

[1] [1]

(2024).Learning Theory from First Principles

BACH, F. (2024).Learning Theory from First Principles. MIT Press

work page 2024

[2] [2]

D., TONG, X

CHEN, X., LEE, J. D., TONG, X. T. and ZHANG, Y. (2020). Statistical Inference for Model Parameters in Stochastic Gradient Descent.Annals of Statistics48251–273

work page 2020

[3] [3]

COVER, T. M. and THOMAS, J. A. (2006).Elements of Information Theory, 2nd ed. Wiley

work page 2006

[4] [4]

and YANG, L

FANG, Y., XU, J. and YANG, L. (2018). Online Bootstrap Confidence Intervals for the Stochastic Gradient Descent Estimator.Journal of Machine Learning Research193053–3073

work page 2018

[5] [5]

and NA, S

JIANG, L., ROY, A., BALASUBRAMANIAN, K., DAVIS, D., DRUSVYATSKIY, D. and NA, S. (2025). Online Covariance Estimation in Nonsmooth Stochastic Approximation. InProceedings of the 38th Conference on Learning Theory (COLT). arXiv:2502.05305

work page arXiv 2025

[6] [6]

LEE, S., LIAO, Y., SEO, M. H. and SHIN, Y. (2022). Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling. InProceedings of the AAAI Conference on Artificial Intelligence. arXiv:2106.03156

work page arXiv 2022

[7] [7]

arXiv preprint arXiv:2212.01259 , year=

LUO, Y., HUO, X. and MEI, Y. (2022). Covariance Estimators for the ROOT-SGD Algorithm in Online Learning. arXiv preprint arXiv:2212.01259

work page arXiv 2022

[8] [8]

and BACH, F

MOULINES, E. and BACH, F. (2011). Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2011

[9] [9]

PAULIN, D. (2015). Concentration Inequalities for Markov Chains by Marton Couplings and Spectral Methods. Electronic Journal of Probability201–32. ONLINE COV ARIANCE ESTIMATION IN A VERAGED SGD25

work page 2015

[10] [10]

POLYAK, B. T. and JUDITSKY, A. B. (1992). Acceleration of Stochastic Approximation by Averaging.SIAM Journal on Control and Optimization30838–855

work page 1992

[11] [11]

arXiv preprint arXiv:2308.01481 , year =

ROY, A. and BALASUBRAMANIAN, K. (2023). Online Covariance Estimation for Stochastic Gradient Descent under Markovian Sampling.arXiv preprint arXiv:2308.01481

work page arXiv 2023

[12] [12]

RUPPERT, D. (1988). Efficient Estimations from a Slowly Convergent Robbins–Monro Process Technical Report No. 781, School of Operations Research and Industrial Engineering, Cornell University

work page 1988

[13] [13]

and NAUMOV, A

SAMSONOV, S., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2024). Gaussian Approxima- tion and Multiplier Bootstrap for Polyak–Ruppert Averaged Linear Stochastic Approximation with Applica- tions to TD Learning. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2405.16644

work page arXiv 2024

[14] [14]

Singh, A

SINGH, R., SHUKLA, A. and VATS, D. (2025). On the Utility of Equal Batch Sizes for Inference in Stochastic Gradient Descent.Journal of Machine Learning Research26. arXiv:2303.07706

work page arXiv 2025

[15] [15]

SU, W. J. and ZHU, Y. (2023). HiGrad: Uncertainty Quantification for Online Learning and Stochastic Approxi- mation.Journal of Machine Learning Research24. arXiv:1802.04876

work page arXiv 2023

[16] [16]

TROPP, J. A. (2015).An Introduction to Matrix Concentration Inequalities8(1–2). Foundations and Trends in Machine Learning

work page 2015

[17] [17]

(2018).High-Dimensional Probability: An Introduction with Applications in Data Science

VERSHYNIN, R. (2018).High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press

work page 2018

[18] [18]

YU, B. (1997). Assouad, Fano, and Le Cam.Festschrift for Lucien Le Cam423–435

work page 1997

[19] [19]

and WU, W

ZHU, W., CHEN, X. and WU, W. B. (2023). Online Covariance Matrix Estimation in Stochastic Gradient De- scent.Journal of the American Statistical Association118393–404

work page 2023