Recognition: no theorem link
Gaussian Approximation for Asynchronous Q-learning
Pith reviewed 2026-05-10 17:07 UTC · model grok-4.3
The pith
Averaged asynchronous Q-learning satisfies a high-dimensional central limit theorem with error rate up to n^{-1/6} log^4(nSA) over hyper-rectangles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumption that the sequence of state-action-next-state triples forms a uniformly geometrically ergodic Markov chain, the Polyak-Ruppert averaged iterates of asynchronous Q-learning with stepsize k^{-ω} for ω in (1/2,1] obey a high-dimensional central limit theorem whose approximation error, measured in the supremum norm over hyper-rectangles, is of order n^{-1/6} log^4(nSA). The authors obtain this bound by proving a general high-dimensional CLT for martingale difference sequences and then verifying that the Q-learning noise satisfies the required conditions; they additionally derive bounds on high-order moments of the algorithm's final (non-averaged) iterate.
What carries the argument
High-dimensional central limit theorem for sums of martingale differences, applied to control the distance between the distribution of the averaged Q-learning iterates and the normal distribution.
If this is right
- The averaged Q-values may be treated as asymptotically normal, permitting construction of approximate confidence regions whose coverage error vanishes at rate n^{-1/6}.
- Explicit logarithmic factors in the bound make the result usable for finite-sample guidance when the number of states and actions is large.
- Moment bounds on the last iterate supply auxiliary control that can be combined with the CLT for non-asymptotic tail estimates.
- The new martingale CLT stands as a reusable tool for other stochastic approximation schemes whose noise satisfies similar dependence and moment conditions.
Where Pith is reading between the lines
- The rate suggests that simple averaging suffices for asymptotic inference in large tabular reinforcement-learning problems provided geometric ergodicity holds.
- Similar arguments could be tested on other single-time-scale algorithms such as SARSA or fitted value iteration under comparable mixing assumptions.
- Empirical checks on synthetic chains with tunable mixing rates would reveal whether the log^4 factor is sharp or can be improved.
Load-bearing premise
The sequence of observed state-action-next-state triples must form a uniformly geometrically ergodic Markov chain.
What would settle it
Simulate asynchronous Q-learning on a Markov chain whose mixing time grows faster than any geometric rate and check whether the supremum error over hyper-rectangles fails to decay at the claimed n^{-1/6} order.
read the original abstract
In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-\omega},\, \omega \in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm's last iterate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives finite-sample rates in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates of asynchronous Q-learning using polynomial stepsizes k^{-ω} with ω ∈ (1/2, 1]. Under the assumption that the sequence (s_k, a_k, s_{k+1}) forms a uniformly geometrically ergodic Markov chain, the authors establish a Gaussian approximation rate of order up to n^{-1/6} log^4(n S A) over hyper-rectangles, where n is the number of samples. The argument proceeds by proving a new high-dimensional CLT for sums of martingale differences (of independent interest) together with high-order moment bounds on the algorithm's last iterate.
Significance. If the central claims hold, the paper supplies explicit finite-sample guarantees for asynchronous Q-learning in high dimensions, which is a meaningful advance for reinforcement learning theory. The new martingale-difference CLT and the moment bounds for the last iterate are technically substantive and may find use in related stochastic approximation settings. The explicit handling of the polynomial stepsize regime and the logarithmic dependence on the state-action space size are strengths.
minor comments (4)
- Abstract: the qualifier 'up to' in the rate n^{-1/6} log^4(nSA) should be accompanied by a brief statement of the precise range of ω for which this rate is attained, to avoid ambiguity for readers.
- Notation: the definition of the hyper-rectangle class and the precise normalization of the averaged iterates should be restated in the main theorem statement (rather than only in the preliminaries) for immediate readability.
- Proof of the moment bounds: several intermediate inequalities in the high-order moment analysis would benefit from one additional line of justification to make the passage from the last-iterate bound to the averaged process fully transparent.
- References: the introduction would be strengthened by citing the most recent high-dimensional CLT results for synchronous Q-learning or TD learning to better situate the novelty of the asynchronous martingale CLT.
Simulated Author's Rebuttal
We thank the referee for the careful reading of our manuscript and the positive overall assessment. We appreciate the recommendation for minor revision and will incorporate any suggested changes accordingly.
Circularity Check
No significant circularity
full rationale
The paper derives its finite-sample Gaussian approximation rate for Polyak-Ruppert averaged asynchronous Q-learning via a new high-dimensional CLT for martingale differences (of independent interest) together with high-order moment bounds on the last iterate. Both components are developed from standard probabilistic tools under the external assumption that the state-action-next-state sequence forms a uniformly geometrically ergodic Markov chain. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the argument is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The sequence of state-action-next-state triples forms a uniformly geometrically ergodic Markov chain
Reference graph
Works this paper leans on
-
[1]
Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Jacob Abernethy and Shivani Agarwal, editors,Proceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 67–83. PMLR, 09–12 Jul 2020
2020
-
[2]
Anderson, Peter Hall, and D
Niall H. Anderson, Peter Hall, and D. M. Titterington. Edgeworth expansions in very-high-dimensional problems.Journal of Statistical Planning and Inference, 70(1):1–18, July 1998
1998
-
[3]
Mohammad Gheshlaghi Azar, R´emi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine Learning, 91(3):249–265, 2013
2013
-
[4]
Alexandre Belloni and Roberto I Oliveira. A high dimensional central limit theorem for martingales, with applications to context tree models.arXiv preprint arXiv:1809.02741, 2018
-
[5]
On the dependence of the Berry–Esseen bound on dimension.Journal of Statistical Planning and Inference, 113(2):385–402, 2003
Vidmantas Bentkus. On the dependence of the Berry–Esseen bound on dimension.Journal of Statistical Planning and Inference, 113(2):385–402, 2003
2003
-
[6]
Bertsekas and John N
Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996
1996
-
[7]
Bhattacharya and R
Rabi N. Bhattacharya and R. Ranga Rao.Normal Approximation and Asymptotic Expansions. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York, NY , 1976
1976
-
[8]
Exact Convergence Rates in Some Martingale Central Limit Theorems.The Annals of Probability, 10(3):672 – 688, 1982
Erwin Bolthausen. Exact Convergence Rates in Some Martingale Central Limit Theorems.The Annals of Probability, 10(3):672 – 688, 1982
1982
-
[9]
Bogdan Butyrin, Eric Moulines, Alexey Naumov, Sergey Samsonov, Qi-Man Shao, and Zhuo-Song Zhang. Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approxi- mation.arXiv preprint arXiv:2510.12375, 2025
-
[10]
Gaussian approximation for two-timescale linear stochastic approximation
Bogdan Butyrin, Artemy Rubtsov, Alexey Naumov, Vladimir V Ulyanov, and Sergey Samsonov. Gaussian approximation for two-timescale linear stochastic approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36627–36635, 2026
2026
-
[11]
Lee, Xin T
Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent.The Annals of Statistics, 48(1):251 – 273, 2020. 12
2020
-
[12]
Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of stochastic approximation using smooth convex envelopes.arXiv Preprint, arXiv:2002.00874,
-
[13]
revised version v6 (June 2021)
2021
-
[14]
Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013
Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013
2013
-
[15]
Central limit theorems and bootstrap in high dimensions.The Annals of Probability, 45(4):2309–2352, 2017
Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limit theorems and bootstrap in high dimensions.The Annals of Probability, 45(4):2309–2352, 2017
2017
-
[16]
Detailed proof of Nazarov’s inequality
Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Detailed proof of Nazarov’s inequality. arXiv preprint arXiv:1711.10696, 2017
-
[17]
Improved central limit theorem and bootstrap approximations in high dimensions.The Annals of Statistics, 50(5):2562–2586, 2022
Victor Chernozhukov, Denis Chetverikov, Kengo Kato, and Yuta Koike. Improved central limit theorem and bootstrap approximations in high dimensions.The Annals of Statistics, 50(5):2562–2586, 2022
2022
-
[18]
Springer Series in Operations Research and Financial Engineering
Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018
2018
-
[19]
Learning rates for q-learning.Journal of Machine Learning Research, 5:1–25, 2003
Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning.Journal of Machine Learning Research, 5:1–25, 2003
2003
-
[20]
Exact rates of convergence in some martingale central limit theorems.Journal of Mathematical Analysis and Applications, 469(2):1028–1044, 2019
Xiequan Fan. Exact rates of convergence in some martingale central limit theorems.Journal of Mathematical Analysis and Applications, 469(2):1028–1044, 2019
2019
-
[21]
High-dimensional central limit theorems by Stein’s method.The Annals of Applied Probability, 31(4):1660 – 1686, 2021
Xiao Fang and Yuta Koike. High-dimensional central limit theorems by Stein’s method.The Annals of Applied Probability, 31(4):1660 – 1686, 2021
2021
-
[22]
Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018
Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018
2018
-
[23]
Thomas Gallou¨et, Guillaume Mijoule, and Yvik Swan. Regularity of solutions of the stein equation and rates in the multivariate central limit theorem.arXiv preprint arXiv:1805.01720, 2018
-
[24]
On the rate of convergence in the multivariate CLT.The Annals of Probability, 19(2):724–739, April 1991
Friedrich G ¨otze. On the rate of convergence in the multivariate CLT.The Annals of Probability, 19(2):724–739, April 1991
1991
-
[25]
Efficiently solving mdps with stochastic mirror descent
Yujia Jin and Aaron Sidford. Efficiently solving mdps with stochastic mirror descent. InInternational Conference on Machine Learning, pages 4890–4900. PMLR, 2020
2020
-
[26]
A Berry–Esseen bound for vector-valued martingales.Statistics & Probability Letters, 186:109448, 2022
Denis Kojevnikov and Kyungchul Song. A Berry–Esseen bound for vector-valued martingales.Statistics & Probability Letters, 186:109448, 2022
2022
-
[27]
Seo Taek Kong, Sihan Zeng, Thinh T Doan, and R Srikant. Nonasymptotic clt and error bounds for two-time-scale stochastic approximation.arXiv preprint arXiv:2502.09884, 2025
-
[28]
High-dimensional CLT for Sums of Non-degenerate Random Vectors: n^
Arun Kumar Kuchibhotla and Alessandro Rinaldo. High-dimensional clt for sums of non-degenerate random vectors:n −1/2-rate.arXiv preprint arXiv:2009.13673, 2020
-
[29]
Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024
Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024. 13
2024
-
[30]
Breaking the sample size barrier in model-based reinforcement learning with a generative model
Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 12861–12872. Curran Associates, Inc., 2020
2020
-
[31]
Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024
Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, and Yuting Wei. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024
2024
-
[32]
Jiaqi Li, Johannes Schmidt-Hieber, and Wei Biao Wu. Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models.arXiv preprint arXiv:2409.07434, 2024
-
[33]
A statistical analysis of polyak-ruppert averaged q-learning
Xuefeng Li, Yi Chen, Kaiqing Zhang, and Tamer Basar. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 ofProceedings of Machine Learning Research, pages 2207–2261. PMLR, 2023
2023
-
[34]
Central Limit Theorems for Asynchronous Averaged Q-Learning
Xingtu Liu. Central limit theorems for asynchronous averaged q-learning.arXiv preprint arXiv:2509.18964, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Concentration inequalities for sums of markov-dependent random matrices.Information and Inference: A Journal of the IMA, 13(4):iaae032, 2024
Joe Neeman, Bobby Shi, and Rachel Ward. Concentration inequalities for sums of markov-dependent random matrices.Information and Inference: A Journal of the IMA, 13(4):iaae032, 2024
2024
-
[36]
New stochastic approximation type procedures.Automat
Boris T Polyak. New stochastic approximation type procedures.Automat. i Telemekh, 7(98-107):2, 1990
1990
-
[37]
Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992
1992
-
[38]
Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming
Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994
1994
-
[39]
Finite-time analysis of asynchronous stochastic approximation and q-learning
Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. InConference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3185–3205. PMLR, 2020
2020
-
[40]
Multivariate normal approximation with Stein’s method of ex- changeable pairs under a general linearity condition.The Annals of Probability, 37(6):2150 – 2173, 2009
Gesine Reinert and Adrian R ¨ollin. Multivariate normal approximation with Stein’s method of ex- changeable pairs under a general linearity condition.The Annals of Probability, 37(6):2150 – 2173, 2009
2009
-
[41]
On quantitative bounds in the mean martingale central limit theorem.Statistics & Probability Letters, 138:171–176, 2018
Adrian R ¨ollin. On quantitative bounds in the mean martingale central limit theorem.Statistics & Probability Letters, 138:171–176, 2018
2018
-
[42]
Efficient estimations from a slowly convergent Robbins-Monro process
David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988
1988
-
[43]
Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for polyak-ruppert averaged linear stochastic approximation with applications to td learning.Advances in Neural Information Processing Systems, 37:12408–12460, 2024
2024
-
[44]
Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexander Naumov. Statistical Inference for Linear Stochastic Approximation with Markovian Noise.arXiv preprint arXiv:2505.19102, 2025. 14
-
[45]
Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent
Marina Sheshukova, Sergey Samsonov, Denis Belomestny, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for stochastic gradient descent.arXiv preprint arXiv:2502.06719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
R. Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning.Mathematics of Operations Research, 2025
2025
-
[47]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018
2018
-
[48]
Martin J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for q-learning.arXiv preprint arXiv:1905.06265, 2019
-
[49]
Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019
Martin J Wainwright. Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019
-
[50]
Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time.Mathematics of Operations Research, 45(2):517–546, 2020
Mengdi Wang. Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time.Mathematics of Operations Research, 45(2):517–546, 2020
2020
-
[51]
Online Covariance Matrix Estimation in Stochastic Gradient Descent.Journal of the American Statistical Association, 118(541):393–404, 2023
Xi Chen Wanrong Zhu and Wei Biao Wu. Online Covariance Matrix Estimation in Stochastic Gradient Descent.Journal of the American Statistical Association, 118(541):393–404, 2023
2023
-
[52]
Christopher J. C. H. Watkins.Learning from Delayed Rewards. PhD thesis, University of Cambridge, 1989
1989
-
[53]
Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992
1992
-
[54]
Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024
-
[55]
Weichen Wu, Yuting Wei, and Alessandro Rinaldo. Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025. 15 A Notations Table 1: Notation used throughout the paper Notation Meaning Markov Decision Process MMarkov Decision Process (MDP) SState space,|S|=S AAction space,|A|=A r(s, a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.