On Gaussian approximation for entropy-regularized Q-learning with function approximation

arxiv: 2605.17678 · v1 · pith:7BXAU7YPnew · submitted 2026-05-17 · 📊 stat.ML · cs.LG

On Gaussian approximation for entropy-regularized Q-learning with function approximation

Artemy Rubtsov , Rahul Singh , Eric Moulines , Alexey Naumov , Sergey Samsonov This is my paper

Pith reviewed 2026-05-19 22:08 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Gaussian approximationQ-learningentropy regularizationfunction approximationcentral limit theoremPolyak-Ruppert averagingMarkov chainsreinforcement learning

0 comments p. Extension

pith:7BXAU7YP Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{7BXAU7YP}

Prints a linked pith:7BXAU7YP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Entropy-regularized Q-learning with linear function approximation yields a Gaussian approximation bound of order n to the minus one-fourth for Polyak-Ruppert averaged iterates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives high-dimensional central limit theorem rates for the averaged iterates produced by entropy-regularized asynchronous Q-learning that uses linear function approximation and a polynomial step-size schedule. Under the assumption that observed state-action-next-state triples form a uniformly geometrically ergodic Markov chain, together with regularity conditions on the projected soft Bellman equation, it obtains a Gaussian approximation bound in convex distance at rate n to the minus one-fourth, up to polylog factors. A sympathetic reader would care because the result supplies explicit non-asymptotic control on the statistical fluctuations of the learning process in high dimensions, which in turn informs uncertainty quantification and finite-sample behavior for this family of algorithms. The derivation proceeds by linearizing the soft Bellman recursion and isolating a leading martingale term whose Gaussian approximation is controlled separately. The work also records high-order moment bounds on the last iterate as an auxiliary result.

Core claim

We establish a Gaussian approximation bound in the convex distance with rate of order n^{-1/4}, up to polylogarithmic factors in n, for the Polyak-Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and polynomial step-size k^{-ω} for ω in (1/2,1). The bound holds when the sequence of observed triples forms a uniformly geometrically ergodic Markov chain and suitable regularity conditions are satisfied for the projected soft Bellman equation. The proof combines a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term.

What carries the argument

Linearization of the soft Bellman recursion together with Gaussian approximation applied to its leading martingale term.

If this is right

High-dimensional central limit theorem holds for the averaged iterates, giving explicit control on fluctuations in convex distance.
High-order moment bounds are available for the algorithm's final iterate.
The result applies to asynchronous updates driven by a single geometrically ergodic Markov chain of observations.
Polynomial step sizes in the open interval (1/2,1) are covered by the same rate.
The Gaussian approximation extends existing CLT statements from non-regularized to entropy-regularized Q-learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linearization-plus-martingale technique could be tested on nonlinear function approximators provided the projected operator remains contractive.
Finite-sample concentration inequalities derived from the moment bounds might be combined with the CLT to produce practical confidence intervals for value estimates.
The n^{-1/4} rate suggests that variance-reduction or multiple-trajectory averaging could improve the exponent in future refinements.
The ergodicity assumption points to a natural next question: how the mixing time of the underlying Markov chain enters the implicit constants.

Load-bearing premise

The observed state-action-next-state triples must form a uniformly geometrically ergodic Markov chain and the projected soft Bellman equation must satisfy suitable regularity conditions.

What would settle it

Run the algorithm on a simple finite-state MDP with known transition kernel, collect the averaged iterates after n steps, and test whether their empirical distribution deviates from the predicted Gaussian by more than the claimed n^{-1/4} rate in convex distance.

read the original abstract

In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak--Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and a polynomial stepsize $k^{-\omega}$, $\omega \in (1/2,1)$. Assuming that the sequence of observed triples $(s_k,a_k,s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, and under suitable regularity conditions for the projected soft Bellman equation, we establish a Gaussian approximation bound in the convex distance with rate of order $n^{-1/4}$, up to polylogarithmic factors in $n$, where $n$ is the number of samples used by the algorithm. To obtain this result, we combine a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term. Finally, we derive high-order moment bounds for the algorithm's last iterate, which might be of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives an n^{-1/4} Gaussian approximation rate in convex distance for averaged entropy-regularized Q-learning iterates under linear FA, polynomial stepsizes, and geometric ergodicity, via linearization plus martingale CLT.

read the letter

The main takeaway is a concrete convergence rate for the Polyak-Ruppert averaged iterates of this specific algorithm family. They linearize the projected soft Bellman operator, treat the leading term as a martingale, and apply Gaussian approximation tools to reach the n^{-1/4} bound (up to logs) in convex distance. They also supply high-order moment bounds on the last iterate as a side result. That combination is the actual new piece; it is a direct extension of existing CLT techniques rather than a broad new framework, but it fills in the entropy-regularized asynchronous case with linear approximation and the stated stepsize range. The assumptions are standard: uniform geometric ergodicity of the observed state-action-next-state triples plus regularity on the projected equation. The work is technically grounded and the citation pattern looks appropriate for this corner of stochastic approximation. The soft spot is exactly the one flagged in the stress-test note. The linearization remainder must be shown o(n^{-1/4}) uniformly in the convex metric; the abstract only gestures at “suitable regularity conditions,” and geometric ergodicity alone does not automatically control second-order terms from the entropy nonlinearity or projection bias. If the full proof does not close that gap tightly, the rate claim weakens. This paper is for people who work on rates for regularized RL with function approximation. A reader already comfortable with martingale CLTs and soft Bellman operators will extract the most value. It is honest, incremental progress and deserves a serious referee so the remainder control can be checked in detail.

Referee Report

2 major / 2 minor

Summary. The paper derives rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates of entropy-regularized asynchronous Q-learning with linear function approximation and polynomial stepsize k^{-ω} for ω ∈ (1/2,1). Under the assumption that observed triples (s_k, a_k, s_{k+1}) form a uniformly geometrically ergodic Markov chain and suitable regularity conditions on the projected soft Bellman equation, it establishes a Gaussian approximation bound in the convex distance with rate n^{-1/4} up to polylog factors in n. The proof combines linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term, and additionally derives high-order moment bounds for the algorithm's last iterate.

Significance. If the central claim holds, the result supplies a non-asymptotic Gaussian approximation result for averaged iterates in entropy-regularized Q-learning under linear function approximation, which is relevant for uncertainty quantification in high-dimensional RL settings. The explicit n^{-1/4} rate in convex distance (up to logs) and the separate high-order moment bounds for the last iterate are strengths that may be of independent interest. The approach of linearizing the projected soft Bellman operator and applying martingale CLT tools is technically standard but is applied here to this specific regularized asynchronous setting.

major comments (2)

[§4.2] §4.2, the linearization step around the projected soft Bellman fixed point: the argument claims the remainder after first-order expansion is o(n^{-1/4}) in the convex metric, but the stated regularity conditions on the projected equation only control the first-order term; a uniform bound on the second-order remainder (arising from the entropy nonlinearity or projection bias) must be verified explicitly to ensure it does not reach order n^{-1/4} or larger under the geometric ergodicity assumption alone.
[Theorem 3.1, §5] Theorem 3.1 and its proof in §5: the Gaussian approximation for the martingale term is combined with the linearization, but the paper does not provide a concrete verification that the convex-distance error from the remainder is strictly smaller than the target rate; without this, the claimed bound does not follow from the stated assumptions.

minor comments (2)

[Abstract] The abstract and introduction should clarify whether the polylog factors are explicit or hidden in the O(·) notation.
[§2] Notation for the convex distance metric (Definition 2.3) could include a brief reminder of its relation to the usual Wasserstein or Kolmogorov distances for reader convenience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments point by point below and will revise the manuscript to incorporate explicit verifications where needed.

read point-by-point responses

Referee: [§4.2] §4.2, the linearization step around the projected soft Bellman fixed point: the argument claims the remainder after first-order expansion is o(n^{-1/4}) in the convex metric, but the stated regularity conditions on the projected equation only control the first-order term; a uniform bound on the second-order remainder (arising from the entropy nonlinearity or projection bias) must be verified explicitly to ensure it does not reach order n^{-1/4} or larger under the geometric ergodicity assumption alone.

Authors: We agree that an explicit uniform bound on the second-order remainder is required for a fully rigorous justification. While the regularity conditions on the projected soft Bellman equation and uniform geometric ergodicity are intended to control the relevant Lipschitz and smoothness properties, we will add a dedicated lemma in the revised §4.2 that derives a concrete O(·) bound on the quadratic remainder term in convex distance. This will use the entropy nonlinearity's bounded Hessian and the projection bias control under the Markov chain assumptions to confirm the remainder is strictly o(n^{-1/4}) with high probability. revision: yes
Referee: [Theorem 3.1, §5] Theorem 3.1 and its proof in §5: the Gaussian approximation for the martingale term is combined with the linearization, but the paper does not provide a concrete verification that the convex-distance error from the remainder is strictly smaller than the target rate; without this, the claimed bound does not follow from the stated assumptions.

Authors: We acknowledge the need for a more explicit error propagation argument. In the revised proof of Theorem 3.1 in §5, we will insert an intermediate proposition that decomposes the total convex-distance error, bounds the martingale approximation separately, and then shows that the accumulated remainder from linearization is o(n^{-1/4}) (up to polylog n factors) by leveraging the polynomial stepsize decay and the ergodicity to control the recursion. This will make the combination of linearization and Gaussian approximation fully rigorous under the given assumptions. revision: yes

Circularity Check

0 steps flagged

Derivation relies on standard martingale CLT and ergodicity; no reduction to inputs by construction

full rationale

The paper establishes the n^{-1/4} Gaussian approximation in convex distance for averaged iterates by linearizing the projected soft Bellman operator and applying a martingale central limit theorem to the leading term, under the external assumptions of uniform geometric ergodicity of the Markov chain and suitable regularity conditions on the projected equation. These steps invoke standard probabilistic tools (martingale CLT, mixing bounds) that are independent of the target bound and are not obtained by fitting parameters to the result itself or by self-referential definitions. No load-bearing step reduces the claimed rate to a fitted quantity, a renamed empirical pattern, or a self-citation chain whose validity depends on the present paper. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result depends on standard domain assumptions about Markov chain mixing and Bellman equation regularity; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The sequence of observed triples (s_k,a_k,s_{k+1}) forms a uniformly geometrically ergodic Markov chain
Invoked to control the dependence in the stochastic process for the CLT.
domain assumption Suitable regularity conditions for the projected soft Bellman equation
Required to justify the linearization step and Gaussian approximation of the martingale term.

pith-pipeline@v0.9.0 · 5714 in / 1369 out tokens · 45572 ms · 2026-05-19T22:08:20.724234+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

Residual algorithms: Reinforcement learning with function approximation

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. InMachine learning proceedings 1995, pages 30–37. Elsevier, 1995

work page 1995
[2]

The reverse isoperimetric problem for gaussian measure.Discrete & Computational Geometry, 10(4):411–420, 1993

Keith Ball. The reverse isoperimetric problem for gaussian measure.Discrete & Computational Geometry, 10(4):411–420, 1993

work page 1993
[3]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

work page 1996
[4]

Gaussian approximation for two-timescale linear stochastic approximation

Bogdan Butyrin, Artemy Rubtsov, Alexey Naumov, Vladimir V Ulyanov, and Sergey Samsonov. Gaussian approximation for two-timescale linear stochastic approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36627–36635, 2026

work page 2026
[5]

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

work page 2022
[6]

Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

work page 2013
[7]

Springer Series in Operations Research and Financial Engineering

Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

work page 2018
[8]

Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964, 2025

Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964, 2025

work page 2025
[9]

Tight high probability bounds for linear stochastic approximation with fixed stepsize

Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Kevin Scaman, and Hoi-To Wai. Tight high probability bounds for linear stochastic approximation with fixed stepsize. InAdvances in Neural Information Processing Systems, volume 34, pages 30063–30074. Curran Associates, Inc., 2021

work page 2021
[10]

Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

work page 2018
[11]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR, 2017

work page 2017
[12]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[13]

Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

work page 2018
[14]

Kaledin, E

M. Kaledin, E. Moulines, A. Naumov, V . Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. InConference On Learning Theory, 2020

work page 2020
[15]

Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, and Yuting Wei. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024. 12

work page 2024
[16]

A statistical analysis of polyak-ruppert averaged q-learning

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023

work page 2023
[17]

Central Limit Theorems for Asynchronous Averaged Q-Learning

Xingtu Liu. Central limit theorems for asynchronous averaged q-learning.arXiv preprint arXiv:2509.18964, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

An analysis of reinforcement learning with function approximation

Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008

work page 2008
[19]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

work page 2015
[20]

Multivariate normal approximation on the wiener space: new bounds in the convex distance.Journal of Theoretical Probability, 35(3):2020–2037, 2022

Ivan Nourdin, Giovanni Peccati, and Xiaochuan Yang. Multivariate normal approximation on the wiener space: new bounds in the convex distance.Journal of Theoretical Probability, 35(3):2020–2037, 2022

work page 2020
[21]

Osekowski.Sharp Martingale and Semimartingale Inequalities

A. Osekowski.Sharp Martingale and Semimartingale Inequalities. Monografie Matematyczne 72. Birkh¨auser Basel, 1 edition, 2012

work page 2012
[22]

Concentration inequalities for markov chains by marton couplings and spectral methods

Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic journal of probability, 20:79, 2015

work page 2015
[23]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

work page 1992
[24]

Finite-time analysis of asynchronous stochastic approximation and q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. InConference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3185–3205. PMLR, 2020

work page 2020
[25]

Gaussian Approximation for Asynchronous Q-learning

Artemy Rubtsov, Sergey Samsonov, Vladimir Ulyanov, and Alexey Naumov. Gaussian approximation for asynchronous q-learning.arXiv preprint arXiv:2604.07323, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Efficient estimations from a slowly convergent Robbins-Monro process

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

work page 1988
[27]

Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for polyak-ruppert averaged linear stochastic approximation with applications to td learning.Advances in Neural Information Processing Systems, 37:12408–12460, 2024

work page 2024
[28]

Statistical inference for linear stochastic approximation with Markovian noise

Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexey Naumov. Statistical inference for linear stochastic approximation with Markovian noise. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[29]

Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli, 28(3):1548–1576, 2022

Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli, 28(3):1548–1576, 2022

work page 2022
[30]

Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Marina Sheshukova, Sergey Samsonov, Denis Belomestny, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for stochastic gradient descent.arXiv preprint arXiv:2502.06719, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Bootstrap confidence sets under model misspecification.The Annals of Statistics, 43(6):2653 – 2675, 2015

Vladimir Spokoiny and Mayya Zhilova. Bootstrap confidence sets under model misspecification.The Annals of Statistics, 43(6):2653 – 2675, 2015

work page 2015
[32]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

work page 2018
[33]

Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994

work page 1994
[34]

Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

Martin J Wainwright. Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

work page arXiv 1906
[35]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992

work page 1992
[36]

arXiv preprint arXiv:2410.16106 , year=

Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024

work page arXiv 2024
[37]

Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

Weichen Wu, Yuting Wei, and Alessandro Rinaldo. Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

work page arXiv 2025
[38]

A statistical online inference approach in averaged stochastic approxi- mation.Advances in Neural Information Processing Systems, 35:8998–9009, 2022

Chuhan Xie and Zhihua Zhang. A statistical online inference approach in averaged stochastic approxi- mation.Advances in Neural Information Processing Systems, 35:8998–9009, 2022

work page 2022
[39]

Maximum entropy inverse reinforcement learning

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

work page 2008
[40]

Finite-sample analysis for sarsa with linear function approximation.Advances in neural information processing systems, 32, 2019

Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for sarsa with linear function approximation.Advances in neural information processing systems, 32, 2019. 14 A Notations Table 1: Notation used throughout the paper Notation Meaning Markov Decision Process MMarkov Decision Process (MDP) SState space,|S|=S AAction space,|A|=A rReward functi...

work page 2019

[1] [1]

Residual algorithms: Reinforcement learning with function approximation

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. InMachine learning proceedings 1995, pages 30–37. Elsevier, 1995

work page 1995

[2] [2]

The reverse isoperimetric problem for gaussian measure.Discrete & Computational Geometry, 10(4):411–420, 1993

Keith Ball. The reverse isoperimetric problem for gaussian measure.Discrete & Computational Geometry, 10(4):411–420, 1993

work page 1993

[3] [3]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

work page 1996

[4] [4]

Gaussian approximation for two-timescale linear stochastic approximation

Bogdan Butyrin, Artemy Rubtsov, Alexey Naumov, Vladimir V Ulyanov, and Sergey Samsonov. Gaussian approximation for two-timescale linear stochastic approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36627–36635, 2026

work page 2026

[5] [5]

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

work page 2022

[6] [6]

Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

work page 2013

[7] [7]

Springer Series in Operations Research and Financial Engineering

Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

work page 2018

[8] [8]

Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964, 2025

Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964, 2025

work page 2025

[9] [9]

Tight high probability bounds for linear stochastic approximation with fixed stepsize

Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Kevin Scaman, and Hoi-To Wai. Tight high probability bounds for linear stochastic approximation with fixed stepsize. InAdvances in Neural Information Processing Systems, volume 34, pages 30063–30074. Curran Associates, Inc., 2021

work page 2021

[10] [10]

Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

work page 2018

[11] [11]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR, 2017

work page 2017

[12] [12]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018

[13] [13]

Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

work page 2018

[14] [14]

Kaledin, E

M. Kaledin, E. Moulines, A. Naumov, V . Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. InConference On Learning Theory, 2020

work page 2020

[15] [15]

Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, and Yuting Wei. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024. 12

work page 2024

[16] [16]

A statistical analysis of polyak-ruppert averaged q-learning

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023

work page 2023

[17] [17]

Central Limit Theorems for Asynchronous Averaged Q-Learning

Xingtu Liu. Central limit theorems for asynchronous averaged q-learning.arXiv preprint arXiv:2509.18964, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

An analysis of reinforcement learning with function approximation

Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008

work page 2008

[19] [19]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

work page 2015

[20] [20]

Multivariate normal approximation on the wiener space: new bounds in the convex distance.Journal of Theoretical Probability, 35(3):2020–2037, 2022

Ivan Nourdin, Giovanni Peccati, and Xiaochuan Yang. Multivariate normal approximation on the wiener space: new bounds in the convex distance.Journal of Theoretical Probability, 35(3):2020–2037, 2022

work page 2020

[21] [21]

Osekowski.Sharp Martingale and Semimartingale Inequalities

A. Osekowski.Sharp Martingale and Semimartingale Inequalities. Monografie Matematyczne 72. Birkh¨auser Basel, 1 edition, 2012

work page 2012

[22] [22]

Concentration inequalities for markov chains by marton couplings and spectral methods

Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic journal of probability, 20:79, 2015

work page 2015

[23] [23]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

work page 1992

[24] [24]

Finite-time analysis of asynchronous stochastic approximation and q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. InConference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3185–3205. PMLR, 2020

work page 2020

[25] [25]

Gaussian Approximation for Asynchronous Q-learning

Artemy Rubtsov, Sergey Samsonov, Vladimir Ulyanov, and Alexey Naumov. Gaussian approximation for asynchronous q-learning.arXiv preprint arXiv:2604.07323, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Efficient estimations from a slowly convergent Robbins-Monro process

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

work page 1988

[27] [27]

Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for polyak-ruppert averaged linear stochastic approximation with applications to td learning.Advances in Neural Information Processing Systems, 37:12408–12460, 2024

work page 2024

[28] [28]

Statistical inference for linear stochastic approximation with Markovian noise

Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexey Naumov. Statistical inference for linear stochastic approximation with Markovian noise. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[29] [29]

Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli, 28(3):1548–1576, 2022

Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli, 28(3):1548–1576, 2022

work page 2022

[30] [30]

Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Marina Sheshukova, Sergey Samsonov, Denis Belomestny, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for stochastic gradient descent.arXiv preprint arXiv:2502.06719, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Bootstrap confidence sets under model misspecification.The Annals of Statistics, 43(6):2653 – 2675, 2015

Vladimir Spokoiny and Mayya Zhilova. Bootstrap confidence sets under model misspecification.The Annals of Statistics, 43(6):2653 – 2675, 2015

work page 2015

[32] [32]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

work page 2018

[33] [33]

Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994

work page 1994

[34] [34]

Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

Martin J Wainwright. Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

work page arXiv 1906

[35] [35]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992

work page 1992

[36] [36]

arXiv preprint arXiv:2410.16106 , year=

Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024

work page arXiv 2024

[37] [37]

Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

Weichen Wu, Yuting Wei, and Alessandro Rinaldo. Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

work page arXiv 2025

[38] [38]

A statistical online inference approach in averaged stochastic approxi- mation.Advances in Neural Information Processing Systems, 35:8998–9009, 2022

Chuhan Xie and Zhihua Zhang. A statistical online inference approach in averaged stochastic approxi- mation.Advances in Neural Information Processing Systems, 35:8998–9009, 2022

work page 2022

[39] [39]

Maximum entropy inverse reinforcement learning

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

work page 2008

[40] [40]

Finite-sample analysis for sarsa with linear function approximation.Advances in neural information processing systems, 32, 2019

Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for sarsa with linear function approximation.Advances in neural information processing systems, 32, 2019. 14 A Notations Table 1: Notation used throughout the paper Notation Meaning Markov Decision Process MMarkov Decision Process (MDP) SState space,|S|=S AAction space,|A|=A rReward functi...

work page 2019