pith. machine review for the scientific record. sign in

arxiv: 2604.07323 · v1 · submitted 2026-04-08 · 📊 stat.ML · cs.LG· math.PR

Recognition: no theorem link

Gaussian Approximation for Asynchronous Q-learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PR
keywords asynchronous Q-learninghigh-dimensional CLTmartingale differencesPolyak-Ruppert averagingconvergence ratesgeometric ergodicityreinforcement learningstochastic approximation
0
0 comments X

The pith

Averaged asynchronous Q-learning satisfies a high-dimensional central limit theorem with error rate up to n^{-1/6} log^4(nSA) over hyper-rectangles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit rates showing how closely the Polyak-Ruppert averaged iterates of asynchronous Q-learning match a multivariate normal distribution in high dimensions. The result applies to the algorithm run with polynomial stepsizes between k^{-1} and k^{-1/2}, and it holds uniformly over all axis-aligned hyper-rectangles in the space of Q-values. A sympathetic reader would care because the rate supplies a concrete error bound that can be used to justify asymptotic normality for downstream tasks such as constructing confidence sets for value functions or policies. The proof proceeds by establishing a new high-dimensional central limit theorem for sums of martingale differences that matches the structure of the Q-learning updates.

Core claim

Under the assumption that the sequence of state-action-next-state triples forms a uniformly geometrically ergodic Markov chain, the Polyak-Ruppert averaged iterates of asynchronous Q-learning with stepsize k^{-ω} for ω in (1/2,1] obey a high-dimensional central limit theorem whose approximation error, measured in the supremum norm over hyper-rectangles, is of order n^{-1/6} log^4(nSA). The authors obtain this bound by proving a general high-dimensional CLT for martingale difference sequences and then verifying that the Q-learning noise satisfies the required conditions; they additionally derive bounds on high-order moments of the algorithm's final (non-averaged) iterate.

What carries the argument

High-dimensional central limit theorem for sums of martingale differences, applied to control the distance between the distribution of the averaged Q-learning iterates and the normal distribution.

If this is right

  • The averaged Q-values may be treated as asymptotically normal, permitting construction of approximate confidence regions whose coverage error vanishes at rate n^{-1/6}.
  • Explicit logarithmic factors in the bound make the result usable for finite-sample guidance when the number of states and actions is large.
  • Moment bounds on the last iterate supply auxiliary control that can be combined with the CLT for non-asymptotic tail estimates.
  • The new martingale CLT stands as a reusable tool for other stochastic approximation schemes whose noise satisfies similar dependence and moment conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rate suggests that simple averaging suffices for asymptotic inference in large tabular reinforcement-learning problems provided geometric ergodicity holds.
  • Similar arguments could be tested on other single-time-scale algorithms such as SARSA or fitted value iteration under comparable mixing assumptions.
  • Empirical checks on synthetic chains with tunable mixing rates would reveal whether the log^4 factor is sharp or can be improved.

Load-bearing premise

The sequence of observed state-action-next-state triples must form a uniformly geometrically ergodic Markov chain.

What would settle it

Simulate asynchronous Q-learning on a Markov chain whose mixing time grows faster than any geometric rate and check whether the supremum error over hyper-rectangles fails to decay at the claimed n^{-1/6} order.

read the original abstract

In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-\omega},\, \omega \in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm's last iterate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript derives finite-sample rates in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates of asynchronous Q-learning using polynomial stepsizes k^{-ω} with ω ∈ (1/2, 1]. Under the assumption that the sequence (s_k, a_k, s_{k+1}) forms a uniformly geometrically ergodic Markov chain, the authors establish a Gaussian approximation rate of order up to n^{-1/6} log^4(n S A) over hyper-rectangles, where n is the number of samples. The argument proceeds by proving a new high-dimensional CLT for sums of martingale differences (of independent interest) together with high-order moment bounds on the algorithm's last iterate.

Significance. If the central claims hold, the paper supplies explicit finite-sample guarantees for asynchronous Q-learning in high dimensions, which is a meaningful advance for reinforcement learning theory. The new martingale-difference CLT and the moment bounds for the last iterate are technically substantive and may find use in related stochastic approximation settings. The explicit handling of the polynomial stepsize regime and the logarithmic dependence on the state-action space size are strengths.

minor comments (4)
  1. Abstract: the qualifier 'up to' in the rate n^{-1/6} log^4(nSA) should be accompanied by a brief statement of the precise range of ω for which this rate is attained, to avoid ambiguity for readers.
  2. Notation: the definition of the hyper-rectangle class and the precise normalization of the averaged iterates should be restated in the main theorem statement (rather than only in the preliminaries) for immediate readability.
  3. Proof of the moment bounds: several intermediate inequalities in the high-order moment analysis would benefit from one additional line of justification to make the passage from the last-iterate bound to the averaged process fully transparent.
  4. References: the introduction would be strengthened by citing the most recent high-dimensional CLT results for synchronous Q-learning or TD learning to better situate the novelty of the asynchronous martingale CLT.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and the positive overall assessment. We appreciate the recommendation for minor revision and will incorporate any suggested changes accordingly.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives its finite-sample Gaussian approximation rate for Polyak-Ruppert averaged asynchronous Q-learning via a new high-dimensional CLT for martingale differences (of independent interest) together with high-order moment bounds on the last iterate. Both components are developed from standard probabilistic tools under the external assumption that the state-action-next-state sequence forms a uniformly geometrically ergodic Markov chain. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about the Markov chain; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The sequence of state-action-next-state triples forms a uniformly geometrically ergodic Markov chain
    Invoked to control mixing and obtain the stated CLT rate for the averaged iterates.

pith-pipeline@v0.9.0 · 5478 in / 1169 out tokens · 56676 ms · 2026-05-10T17:07:22.714612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 15 canonical work pages · 2 internal anchors

  1. [1]

    Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Jacob Abernethy and Shivani Agarwal, editors,Proceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 67–83. PMLR, 09–12 Jul 2020

  2. [2]

    Anderson, Peter Hall, and D

    Niall H. Anderson, Peter Hall, and D. M. Titterington. Edgeworth expansions in very-high-dimensional problems.Journal of Statistical Planning and Inference, 70(1):1–18, July 1998

  3. [3]

    Mohammad Gheshlaghi Azar, R´emi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine Learning, 91(3):249–265, 2013

  4. [4]

    A high dimensional central limit theorem for martingales, with applications to context tree models.arXiv preprint arXiv:1809.02741, 2018

    Alexandre Belloni and Roberto I Oliveira. A high dimensional central limit theorem for martingales, with applications to context tree models.arXiv preprint arXiv:1809.02741, 2018

  5. [5]

    On the dependence of the Berry–Esseen bound on dimension.Journal of Statistical Planning and Inference, 113(2):385–402, 2003

    Vidmantas Bentkus. On the dependence of the Berry–Esseen bound on dimension.Journal of Statistical Planning and Inference, 113(2):385–402, 2003

  6. [6]

    Bertsekas and John N

    Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

  7. [7]

    Bhattacharya and R

    Rabi N. Bhattacharya and R. Ranga Rao.Normal Approximation and Asymptotic Expansions. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York, NY , 1976

  8. [8]

    Exact Convergence Rates in Some Martingale Central Limit Theorems.The Annals of Probability, 10(3):672 – 688, 1982

    Erwin Bolthausen. Exact Convergence Rates in Some Martingale Central Limit Theorems.The Annals of Probability, 10(3):672 – 688, 1982

  9. [9]

    and ZHANG, Z.-S

    Bogdan Butyrin, Eric Moulines, Alexey Naumov, Sergey Samsonov, Qi-Man Shao, and Zhuo-Song Zhang. Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approxi- mation.arXiv preprint arXiv:2510.12375, 2025

  10. [10]

    Gaussian approximation for two-timescale linear stochastic approximation

    Bogdan Butyrin, Artemy Rubtsov, Alexey Naumov, Vladimir V Ulyanov, and Sergey Samsonov. Gaussian approximation for two-timescale linear stochastic approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36627–36635, 2026

  11. [11]

    Lee, Xin T

    Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent.The Annals of Statistics, 48(1):251 – 273, 2020. 12

  12. [12]

    Finite-sample analysis of stochastic approximation using smooth convex envelopes.arXiv Preprint, arXiv:2002.00874,

    Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of stochastic approximation using smooth convex envelopes.arXiv Preprint, arXiv:2002.00874,

  13. [13]

    revised version v6 (June 2021)

  14. [14]

    Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

    Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

  15. [15]

    Central limit theorems and bootstrap in high dimensions.The Annals of Probability, 45(4):2309–2352, 2017

    Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limit theorems and bootstrap in high dimensions.The Annals of Probability, 45(4):2309–2352, 2017

  16. [16]

    Detailed proof of Nazarov’s inequality

    Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Detailed proof of Nazarov’s inequality. arXiv preprint arXiv:1711.10696, 2017

  17. [17]

    Improved central limit theorem and bootstrap approximations in high dimensions.The Annals of Statistics, 50(5):2562–2586, 2022

    Victor Chernozhukov, Denis Chetverikov, Kengo Kato, and Yuta Koike. Improved central limit theorem and bootstrap approximations in high dimensions.The Annals of Statistics, 50(5):2562–2586, 2022

  18. [18]

    Springer Series in Operations Research and Financial Engineering

    Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

  19. [19]

    Learning rates for q-learning.Journal of Machine Learning Research, 5:1–25, 2003

    Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning.Journal of Machine Learning Research, 5:1–25, 2003

  20. [20]

    Exact rates of convergence in some martingale central limit theorems.Journal of Mathematical Analysis and Applications, 469(2):1028–1044, 2019

    Xiequan Fan. Exact rates of convergence in some martingale central limit theorems.Journal of Mathematical Analysis and Applications, 469(2):1028–1044, 2019

  21. [21]

    High-dimensional central limit theorems by Stein’s method.The Annals of Applied Probability, 31(4):1660 – 1686, 2021

    Xiao Fang and Yuta Koike. High-dimensional central limit theorems by Stein’s method.The Annals of Applied Probability, 31(4):1660 – 1686, 2021

  22. [22]

    Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

    Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

  23. [23]

    Regularity of solutions of the stein equation and rates in the multivariate central limit theorem.arXiv preprint arXiv:1805.01720, 2018

    Thomas Gallou¨et, Guillaume Mijoule, and Yvik Swan. Regularity of solutions of the stein equation and rates in the multivariate central limit theorem.arXiv preprint arXiv:1805.01720, 2018

  24. [24]

    On the rate of convergence in the multivariate CLT.The Annals of Probability, 19(2):724–739, April 1991

    Friedrich G ¨otze. On the rate of convergence in the multivariate CLT.The Annals of Probability, 19(2):724–739, April 1991

  25. [25]

    Efficiently solving mdps with stochastic mirror descent

    Yujia Jin and Aaron Sidford. Efficiently solving mdps with stochastic mirror descent. InInternational Conference on Machine Learning, pages 4890–4900. PMLR, 2020

  26. [26]

    A Berry–Esseen bound for vector-valued martingales.Statistics & Probability Letters, 186:109448, 2022

    Denis Kojevnikov and Kyungchul Song. A Berry–Esseen bound for vector-valued martingales.Statistics & Probability Letters, 186:109448, 2022

  27. [27]

    Nonasymptotic clt and error bounds for two-time-scale stochastic approximation.arXiv preprint arXiv:2502.09884, 2025

    Seo Taek Kong, Sihan Zeng, Thinh T Doan, and R Srikant. Nonasymptotic clt and error bounds for two-time-scale stochastic approximation.arXiv preprint arXiv:2502.09884, 2025

  28. [28]

    High-dimensional CLT for Sums of Non-degenerate Random Vectors: n^

    Arun Kumar Kuchibhotla and Alessandro Rinaldo. High-dimensional clt for sums of non-degenerate random vectors:n −1/2-rate.arXiv preprint arXiv:2009.13673, 2020

  29. [29]

    Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

    Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024. 13

  30. [30]

    Breaking the sample size barrier in model-based reinforcement learning with a generative model

    Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 12861–12872. Curran Associates, Inc., 2020

  31. [31]

    Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

    Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, and Yuting Wei. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

  32. [32]

    Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models.arXiv preprint arXiv:2409.07434, 2024

    Jiaqi Li, Johannes Schmidt-Hieber, and Wei Biao Wu. Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models.arXiv preprint arXiv:2409.07434, 2024

  33. [33]

    A statistical analysis of polyak-ruppert averaged q-learning

    Xuefeng Li, Yi Chen, Kaiqing Zhang, and Tamer Basar. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 ofProceedings of Machine Learning Research, pages 2207–2261. PMLR, 2023

  34. [34]

    Central Limit Theorems for Asynchronous Averaged Q-Learning

    Xingtu Liu. Central limit theorems for asynchronous averaged q-learning.arXiv preprint arXiv:2509.18964, 2025

  35. [35]

    Concentration inequalities for sums of markov-dependent random matrices.Information and Inference: A Journal of the IMA, 13(4):iaae032, 2024

    Joe Neeman, Bobby Shi, and Rachel Ward. Concentration inequalities for sums of markov-dependent random matrices.Information and Inference: A Journal of the IMA, 13(4):iaae032, 2024

  36. [36]

    New stochastic approximation type procedures.Automat

    Boris T Polyak. New stochastic approximation type procedures.Automat. i Telemekh, 7(98-107):2, 1990

  37. [37]

    Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

  38. [38]

    Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

    Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994

  39. [39]

    Finite-time analysis of asynchronous stochastic approximation and q-learning

    Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. InConference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3185–3205. PMLR, 2020

  40. [40]

    Multivariate normal approximation with Stein’s method of ex- changeable pairs under a general linearity condition.The Annals of Probability, 37(6):2150 – 2173, 2009

    Gesine Reinert and Adrian R ¨ollin. Multivariate normal approximation with Stein’s method of ex- changeable pairs under a general linearity condition.The Annals of Probability, 37(6):2150 – 2173, 2009

  41. [41]

    On quantitative bounds in the mean martingale central limit theorem.Statistics & Probability Letters, 138:171–176, 2018

    Adrian R ¨ollin. On quantitative bounds in the mean martingale central limit theorem.Statistics & Probability Letters, 138:171–176, 2018

  42. [42]

    Efficient estimations from a slowly convergent Robbins-Monro process

    David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

  43. [43]

    Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for polyak-ruppert averaged linear stochastic approximation with applications to td learning.Advances in Neural Information Processing Systems, 37:12408–12460, 2024

  44. [44]

    Statistical Inference for Linear Stochastic Approximation with Markovian Noise.arXiv preprint arXiv:2505.19102, 2025

    Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexander Naumov. Statistical Inference for Linear Stochastic Approximation with Markovian Noise.arXiv preprint arXiv:2505.19102, 2025. 14

  45. [45]

    Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

    Marina Sheshukova, Sergey Samsonov, Denis Belomestny, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for stochastic gradient descent.arXiv preprint arXiv:2502.06719, 2025

  46. [46]

    R. Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning.Mathematics of Operations Research, 2025

  47. [47]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

  48. [48]

    Stochastic approximation with cone-contractive operators: Sharp l∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019

    Martin J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for q-learning.arXiv preprint arXiv:1905.06265, 2019

  49. [49]

    Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

    Martin J Wainwright. Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

  50. [50]

    Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time.Mathematics of Operations Research, 45(2):517–546, 2020

    Mengdi Wang. Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time.Mathematics of Operations Research, 45(2):517–546, 2020

  51. [51]

    Online Covariance Matrix Estimation in Stochastic Gradient Descent.Journal of the American Statistical Association, 118(541):393–404, 2023

    Xi Chen Wanrong Zhu and Wei Biao Wu. Online Covariance Matrix Estimation in Stochastic Gradient Descent.Journal of the American Statistical Association, 118(541):393–404, 2023

  52. [52]

    Christopher J. C. H. Watkins.Learning from Delayed Rewards. PhD thesis, University of Cambridge, 1989

  53. [53]

    Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992

  54. [54]

    Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024

    Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024

  55. [55]

    Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

    Weichen Wu, Yuting Wei, and Alessandro Rinaldo. Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025. 15 A Notations Table 1: Notation used throughout the paper Notation Meaning Markov Decision Process MMarkov Decision Process (MDP) SState space,|S|=S AAction space,|A|=A r(s, a...