arxiv: 2604.07323 · v1 · submitted 2026-04-08 · 📊 stat.ML · cs.LG· math.PR

Recognition: no theorem link

Gaussian Approximation for Asynchronous Q-learning

Artemy Rubtsov , Sergey Samsonov , Vladimir Ulyanov , Alexey Naumov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PR

keywords asynchronous Q-learninghigh-dimensional CLTmartingale differencesPolyak-Ruppert averagingconvergence ratesgeometric ergodicityreinforcement learningstochastic approximation

0 comments

The pith

Averaged asynchronous Q-learning satisfies a high-dimensional central limit theorem with error rate up to n^{-1/6} log^4(nSA) over hyper-rectangles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit rates showing how closely the Polyak-Ruppert averaged iterates of asynchronous Q-learning match a multivariate normal distribution in high dimensions. The result applies to the algorithm run with polynomial stepsizes between k^{-1} and k^{-1/2}, and it holds uniformly over all axis-aligned hyper-rectangles in the space of Q-values. A sympathetic reader would care because the rate supplies a concrete error bound that can be used to justify asymptotic normality for downstream tasks such as constructing confidence sets for value functions or policies. The proof proceeds by establishing a new high-dimensional central limit theorem for sums of martingale differences that matches the structure of the Q-learning updates.

Core claim

Under the assumption that the sequence of state-action-next-state triples forms a uniformly geometrically ergodic Markov chain, the Polyak-Ruppert averaged iterates of asynchronous Q-learning with stepsize k^{-ω} for ω in (1/2,1] obey a high-dimensional central limit theorem whose approximation error, measured in the supremum norm over hyper-rectangles, is of order n^{-1/6} log^4(nSA). The authors obtain this bound by proving a general high-dimensional CLT for martingale difference sequences and then verifying that the Q-learning noise satisfies the required conditions; they additionally derive bounds on high-order moments of the algorithm's final (non-averaged) iterate.

What carries the argument

High-dimensional central limit theorem for sums of martingale differences, applied to control the distance between the distribution of the averaged Q-learning iterates and the normal distribution.

If this is right

The averaged Q-values may be treated as asymptotically normal, permitting construction of approximate confidence regions whose coverage error vanishes at rate n^{-1/6}.
Explicit logarithmic factors in the bound make the result usable for finite-sample guidance when the number of states and actions is large.
Moment bounds on the last iterate supply auxiliary control that can be combined with the CLT for non-asymptotic tail estimates.
The new martingale CLT stands as a reusable tool for other stochastic approximation schemes whose noise satisfies similar dependence and moment conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rate suggests that simple averaging suffices for asymptotic inference in large tabular reinforcement-learning problems provided geometric ergodicity holds.
Similar arguments could be tested on other single-time-scale algorithms such as SARSA or fitted value iteration under comparable mixing assumptions.
Empirical checks on synthetic chains with tunable mixing rates would reveal whether the log^4 factor is sharp or can be improved.

Load-bearing premise

The sequence of observed state-action-next-state triples must form a uniformly geometrically ergodic Markov chain.

What would settle it

Simulate asynchronous Q-learning on a Markov chain whose mixing time grows faster than any geometric rate and check whether the supremum error over hyper-rectangles fails to decay at the claimed n^{-1/6} order.

read the original abstract

In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-\omega},\, \omega \in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm's last iterate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives an explicit n^{-1/6} log^4(nSA) rate for Gaussian approximation of averaged async Q-learning iterates over hyper-rectangles, backed by a new high-dimensional martingale CLT.

read the letter

The main takeaway is that the authors derive a concrete finite-sample rate for the high-dimensional CLT on Polyak-Ruppert averaged asynchronous Q-learning with polynomial stepsizes. They also prove a supporting martingale-difference CLT that looks like it could be reused elsewhere. Both results are stated under the assumption that the sequence of (s_k, a_k, s_{k+1}) triples forms a uniformly geometrically ergodic Markov chain, and they supply high-order moment bounds on the last iterate to close the argument. The rate they obtain is up to n^{-1/6} log^4(n S A) uniformly over hyper-rectangles. That combination of explicit rate plus the general martingale tool is the part that is not already in the cited literature. The derivation follows standard martingale techniques but extends them carefully to the high-dimensional and polynomial-stepsize regime without obvious circularity. The ergodicity assumption is stated up front and is the main modeling restriction. The n^{-1/6} exponent is slow and the log factors grow with the size of the state-action space, so the result is mainly of theoretical interest rather than immediately practical for large MDPs. Checking whether a given MDP satisfies the uniform geometric ergodicity condition can also be non-trivial. Still, the paper supplies the kind of verifiable, non-asymptotic bound that is rare in this area. It is aimed at people working on stochastic approximation, high-dimensional CLTs for dependent data, or finite-sample analysis of RL algorithms. Anyone who needs explicit Gaussian approximation rates or a reusable martingale CLT will find the technical pieces useful. The work is coherent on its own terms and shows clear engagement with the relevant literature, so it should go to serious referees rather than being desk-rejected.

Referee Report

0 major / 4 minor

Summary. The manuscript derives finite-sample rates in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates of asynchronous Q-learning using polynomial stepsizes k^{-ω} with ω ∈ (1/2, 1]. Under the assumption that the sequence (s_k, a_k, s_{k+1}) forms a uniformly geometrically ergodic Markov chain, the authors establish a Gaussian approximation rate of order up to n^{-1/6} log^4(n S A) over hyper-rectangles, where n is the number of samples. The argument proceeds by proving a new high-dimensional CLT for sums of martingale differences (of independent interest) together with high-order moment bounds on the algorithm's last iterate.

Significance. If the central claims hold, the paper supplies explicit finite-sample guarantees for asynchronous Q-learning in high dimensions, which is a meaningful advance for reinforcement learning theory. The new martingale-difference CLT and the moment bounds for the last iterate are technically substantive and may find use in related stochastic approximation settings. The explicit handling of the polynomial stepsize regime and the logarithmic dependence on the state-action space size are strengths.

minor comments (4)

Abstract: the qualifier 'up to' in the rate n^{-1/6} log^4(nSA) should be accompanied by a brief statement of the precise range of ω for which this rate is attained, to avoid ambiguity for readers.
Notation: the definition of the hyper-rectangle class and the precise normalization of the averaged iterates should be restated in the main theorem statement (rather than only in the preliminaries) for immediate readability.
Proof of the moment bounds: several intermediate inequalities in the high-order moment analysis would benefit from one additional line of justification to make the passage from the last-iterate bound to the averaged process fully transparent.
References: the introduction would be strengthened by citing the most recent high-dimensional CLT results for synchronous Q-learning or TD learning to better situate the novelty of the asynchronous martingale CLT.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and the positive overall assessment. We appreciate the recommendation for minor revision and will incorporate any suggested changes accordingly.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives its finite-sample Gaussian approximation rate for Polyak-Ruppert averaged asynchronous Q-learning via a new high-dimensional CLT for martingale differences (of independent interest) together with high-order moment bounds on the last iterate. Both components are developed from standard probabilistic tools under the external assumption that the state-action-next-state sequence forms a uniformly geometrically ergodic Markov chain. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the argument is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about the Markov chain; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The sequence of state-action-next-state triples forms a uniformly geometrically ergodic Markov chain
Invoked to control mixing and obtain the stated CLT rate for the averaged iterates.

pith-pipeline@v0.9.0 · 5478 in / 1169 out tokens · 56676 ms · 2026-05-10T17:07:22.714612+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Jacob Abernethy and Shivani Agarwal, editors,Proceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 67–83. PMLR, 09–12 Jul 2020

2020
[2]

Anderson, Peter Hall, and D

Niall H. Anderson, Peter Hall, and D. M. Titterington. Edgeworth expansions in very-high-dimensional problems.Journal of Statistical Planning and Inference, 70(1):1–18, July 1998

1998
[3]

Mohammad Gheshlaghi Azar, R´emi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine Learning, 91(3):249–265, 2013

2013
[4]

A high dimensional central limit theorem for martingales, with applications to context tree models.arXiv preprint arXiv:1809.02741, 2018

Alexandre Belloni and Roberto I Oliveira. A high dimensional central limit theorem for martingales, with applications to context tree models.arXiv preprint arXiv:1809.02741, 2018

work page arXiv 2018
[5]

On the dependence of the Berry–Esseen bound on dimension.Journal of Statistical Planning and Inference, 113(2):385–402, 2003

Vidmantas Bentkus. On the dependence of the Berry–Esseen bound on dimension.Journal of Statistical Planning and Inference, 113(2):385–402, 2003

2003
[6]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

1996
[7]

Bhattacharya and R

Rabi N. Bhattacharya and R. Ranga Rao.Normal Approximation and Asymptotic Expansions. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York, NY , 1976

1976
[8]

Exact Convergence Rates in Some Martingale Central Limit Theorems.The Annals of Probability, 10(3):672 – 688, 1982

Erwin Bolthausen. Exact Convergence Rates in Some Martingale Central Limit Theorems.The Annals of Probability, 10(3):672 – 688, 1982

1982
[9]

and ZHANG, Z.-S

Bogdan Butyrin, Eric Moulines, Alexey Naumov, Sergey Samsonov, Qi-Man Shao, and Zhuo-Song Zhang. Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approxi- mation.arXiv preprint arXiv:2510.12375, 2025

work page arXiv 2025
[10]

Gaussian approximation for two-timescale linear stochastic approximation

Bogdan Butyrin, Artemy Rubtsov, Alexey Naumov, Vladimir V Ulyanov, and Sergey Samsonov. Gaussian approximation for two-timescale linear stochastic approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36627–36635, 2026

2026
[11]

Lee, Xin T

Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent.The Annals of Statistics, 48(1):251 – 273, 2020. 12

2020
[12]

Finite-sample analysis of stochastic approximation using smooth convex envelopes.arXiv Preprint, arXiv:2002.00874,

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of stochastic approximation using smooth convex envelopes.arXiv Preprint, arXiv:2002.00874,

work page arXiv 2002
[13]

revised version v6 (June 2021)

2021
[14]

Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

2013
[15]

Central limit theorems and bootstrap in high dimensions.The Annals of Probability, 45(4):2309–2352, 2017

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limit theorems and bootstrap in high dimensions.The Annals of Probability, 45(4):2309–2352, 2017

2017
[16]

Detailed proof of Nazarov’s inequality

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Detailed proof of Nazarov’s inequality. arXiv preprint arXiv:1711.10696, 2017

work page arXiv 2017
[17]

Improved central limit theorem and bootstrap approximations in high dimensions.The Annals of Statistics, 50(5):2562–2586, 2022

Victor Chernozhukov, Denis Chetverikov, Kengo Kato, and Yuta Koike. Improved central limit theorem and bootstrap approximations in high dimensions.The Annals of Statistics, 50(5):2562–2586, 2022

2022
[18]

Springer Series in Operations Research and Financial Engineering

Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

2018
[19]

Learning rates for q-learning.Journal of Machine Learning Research, 5:1–25, 2003

Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning.Journal of Machine Learning Research, 5:1–25, 2003

2003
[20]

Exact rates of convergence in some martingale central limit theorems.Journal of Mathematical Analysis and Applications, 469(2):1028–1044, 2019

Xiequan Fan. Exact rates of convergence in some martingale central limit theorems.Journal of Mathematical Analysis and Applications, 469(2):1028–1044, 2019

2019
[21]

High-dimensional central limit theorems by Stein’s method.The Annals of Applied Probability, 31(4):1660 – 1686, 2021

Xiao Fang and Yuta Koike. High-dimensional central limit theorems by Stein’s method.The Annals of Applied Probability, 31(4):1660 – 1686, 2021

2021
[22]

Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

2018
[23]

Regularity of solutions of the stein equation and rates in the multivariate central limit theorem.arXiv preprint arXiv:1805.01720, 2018

Thomas Gallou¨et, Guillaume Mijoule, and Yvik Swan. Regularity of solutions of the stein equation and rates in the multivariate central limit theorem.arXiv preprint arXiv:1805.01720, 2018

work page arXiv 2018
[24]

On the rate of convergence in the multivariate CLT.The Annals of Probability, 19(2):724–739, April 1991

Friedrich G ¨otze. On the rate of convergence in the multivariate CLT.The Annals of Probability, 19(2):724–739, April 1991

1991
[25]

Efficiently solving mdps with stochastic mirror descent

Yujia Jin and Aaron Sidford. Efficiently solving mdps with stochastic mirror descent. InInternational Conference on Machine Learning, pages 4890–4900. PMLR, 2020

2020
[26]

A Berry–Esseen bound for vector-valued martingales.Statistics & Probability Letters, 186:109448, 2022

Denis Kojevnikov and Kyungchul Song. A Berry–Esseen bound for vector-valued martingales.Statistics & Probability Letters, 186:109448, 2022

2022
[27]

Nonasymptotic clt and error bounds for two-time-scale stochastic approximation.arXiv preprint arXiv:2502.09884, 2025

Seo Taek Kong, Sihan Zeng, Thinh T Doan, and R Srikant. Nonasymptotic clt and error bounds for two-time-scale stochastic approximation.arXiv preprint arXiv:2502.09884, 2025

work page arXiv 2025
[28]

High-dimensional CLT for Sums of Non-degenerate Random Vectors: n^

Arun Kumar Kuchibhotla and Alessandro Rinaldo. High-dimensional clt for sums of non-degenerate random vectors:n −1/2-rate.arXiv preprint arXiv:2009.13673, 2020

work page arXiv 2009
[29]

Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024. 13

2024
[30]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 12861–12872. Curran Associates, Inc., 2020

2020
[31]

Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, and Yuting Wei. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

2024
[32]

Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models.arXiv preprint arXiv:2409.07434, 2024

Jiaqi Li, Johannes Schmidt-Hieber, and Wei Biao Wu. Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models.arXiv preprint arXiv:2409.07434, 2024

work page arXiv 2024
[33]

A statistical analysis of polyak-ruppert averaged q-learning

Xuefeng Li, Yi Chen, Kaiqing Zhang, and Tamer Basar. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 ofProceedings of Machine Learning Research, pages 2207–2261. PMLR, 2023

2023
[34]

Central Limit Theorems for Asynchronous Averaged Q-Learning

Xingtu Liu. Central limit theorems for asynchronous averaged q-learning.arXiv preprint arXiv:2509.18964, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Concentration inequalities for sums of markov-dependent random matrices.Information and Inference: A Journal of the IMA, 13(4):iaae032, 2024

Joe Neeman, Bobby Shi, and Rachel Ward. Concentration inequalities for sums of markov-dependent random matrices.Information and Inference: A Journal of the IMA, 13(4):iaae032, 2024

2024
[36]

New stochastic approximation type procedures.Automat

Boris T Polyak. New stochastic approximation type procedures.Automat. i Telemekh, 7(98-107):2, 1990

1990
[37]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

1992
[38]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994

1994
[39]

Finite-time analysis of asynchronous stochastic approximation and q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. InConference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3185–3205. PMLR, 2020

2020
[40]

Multivariate normal approximation with Stein’s method of ex- changeable pairs under a general linearity condition.The Annals of Probability, 37(6):2150 – 2173, 2009

Gesine Reinert and Adrian R ¨ollin. Multivariate normal approximation with Stein’s method of ex- changeable pairs under a general linearity condition.The Annals of Probability, 37(6):2150 – 2173, 2009

2009
[41]

On quantitative bounds in the mean martingale central limit theorem.Statistics & Probability Letters, 138:171–176, 2018

Adrian R ¨ollin. On quantitative bounds in the mean martingale central limit theorem.Statistics & Probability Letters, 138:171–176, 2018

2018
[42]

Efficient estimations from a slowly convergent Robbins-Monro process

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

1988
[43]

Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for polyak-ruppert averaged linear stochastic approximation with applications to td learning.Advances in Neural Information Processing Systems, 37:12408–12460, 2024

2024
[44]

Statistical Inference for Linear Stochastic Approximation with Markovian Noise.arXiv preprint arXiv:2505.19102, 2025

Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexander Naumov. Statistical Inference for Linear Stochastic Approximation with Markovian Noise.arXiv preprint arXiv:2505.19102, 2025. 14

work page arXiv 2025
[45]

Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Marina Sheshukova, Sergey Samsonov, Denis Belomestny, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for stochastic gradient descent.arXiv preprint arXiv:2502.06719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

R. Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning.Mathematics of Operations Research, 2025

2025
[47]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

2018
[48]

Stochastic approximation with cone-contractive operators: Sharp l∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019

Martin J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for q-learning.arXiv preprint arXiv:1905.06265, 2019

work page arXiv 1905
[49]

Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

Martin J Wainwright. Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

work page arXiv 1906
[50]

Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time.Mathematics of Operations Research, 45(2):517–546, 2020

Mengdi Wang. Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time.Mathematics of Operations Research, 45(2):517–546, 2020

2020
[51]

Online Covariance Matrix Estimation in Stochastic Gradient Descent.Journal of the American Statistical Association, 118(541):393–404, 2023

Xi Chen Wanrong Zhu and Wei Biao Wu. Online Covariance Matrix Estimation in Stochastic Gradient Descent.Journal of the American Statistical Association, 118(541):393–404, 2023

2023
[52]

Christopher J. C. H. Watkins.Learning from Delayed Rewards. PhD thesis, University of Cambridge, 1989

1989
[53]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992

1992
[54]

Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024

Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024

work page arXiv 2024
[55]

Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

Weichen Wu, Yuting Wei, and Alessandro Rinaldo. Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025. 15 A Notations Table 1: Notation used throughout the paper Notation Meaning Markov Decision Process MMarkov Decision Process (MDP) SState space,|S|=S AAction space,|A|=A r(s, a...

work page arXiv 2025