On Gaussian approximation for entropy-regularized Q-learning with function approximation
Pith reviewed 2026-05-19 22:08 UTC · model grok-4.3
pith:7BXAU7YP Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{7BXAU7YP}
Prints a linked pith:7BXAU7YP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Entropy-regularized Q-learning with linear function approximation yields a Gaussian approximation bound of order n to the minus one-fourth for Polyak-Ruppert averaged iterates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish a Gaussian approximation bound in the convex distance with rate of order n^{-1/4}, up to polylogarithmic factors in n, for the Polyak-Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and polynomial step-size k^{-ω} for ω in (1/2,1). The bound holds when the sequence of observed triples forms a uniformly geometrically ergodic Markov chain and suitable regularity conditions are satisfied for the projected soft Bellman equation. The proof combines a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term.
What carries the argument
Linearization of the soft Bellman recursion together with Gaussian approximation applied to its leading martingale term.
If this is right
- High-dimensional central limit theorem holds for the averaged iterates, giving explicit control on fluctuations in convex distance.
- High-order moment bounds are available for the algorithm's final iterate.
- The result applies to asynchronous updates driven by a single geometrically ergodic Markov chain of observations.
- Polynomial step sizes in the open interval (1/2,1) are covered by the same rate.
- The Gaussian approximation extends existing CLT statements from non-regularized to entropy-regularized Q-learning.
Where Pith is reading between the lines
- The same linearization-plus-martingale technique could be tested on nonlinear function approximators provided the projected operator remains contractive.
- Finite-sample concentration inequalities derived from the moment bounds might be combined with the CLT to produce practical confidence intervals for value estimates.
- The n^{-1/4} rate suggests that variance-reduction or multiple-trajectory averaging could improve the exponent in future refinements.
- The ergodicity assumption points to a natural next question: how the mixing time of the underlying Markov chain enters the implicit constants.
Load-bearing premise
The observed state-action-next-state triples must form a uniformly geometrically ergodic Markov chain and the projected soft Bellman equation must satisfy suitable regularity conditions.
What would settle it
Run the algorithm on a simple finite-state MDP with known transition kernel, collect the averaged iterates after n steps, and test whether their empirical distribution deviates from the predicted Gaussian by more than the claimed n^{-1/4} rate in convex distance.
read the original abstract
In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak--Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and a polynomial stepsize $k^{-\omega}$, $\omega \in (1/2,1)$. Assuming that the sequence of observed triples $(s_k,a_k,s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, and under suitable regularity conditions for the projected soft Bellman equation, we establish a Gaussian approximation bound in the convex distance with rate of order $n^{-1/4}$, up to polylogarithmic factors in $n$, where $n$ is the number of samples used by the algorithm. To obtain this result, we combine a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term. Finally, we derive high-order moment bounds for the algorithm's last iterate, which might be of independent interest.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates of entropy-regularized asynchronous Q-learning with linear function approximation and polynomial stepsize k^{-ω} for ω ∈ (1/2,1). Under the assumption that observed triples (s_k, a_k, s_{k+1}) form a uniformly geometrically ergodic Markov chain and suitable regularity conditions on the projected soft Bellman equation, it establishes a Gaussian approximation bound in the convex distance with rate n^{-1/4} up to polylog factors in n. The proof combines linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term, and additionally derives high-order moment bounds for the algorithm's last iterate.
Significance. If the central claim holds, the result supplies a non-asymptotic Gaussian approximation result for averaged iterates in entropy-regularized Q-learning under linear function approximation, which is relevant for uncertainty quantification in high-dimensional RL settings. The explicit n^{-1/4} rate in convex distance (up to logs) and the separate high-order moment bounds for the last iterate are strengths that may be of independent interest. The approach of linearizing the projected soft Bellman operator and applying martingale CLT tools is technically standard but is applied here to this specific regularized asynchronous setting.
major comments (2)
- [§4.2] §4.2, the linearization step around the projected soft Bellman fixed point: the argument claims the remainder after first-order expansion is o(n^{-1/4}) in the convex metric, but the stated regularity conditions on the projected equation only control the first-order term; a uniform bound on the second-order remainder (arising from the entropy nonlinearity or projection bias) must be verified explicitly to ensure it does not reach order n^{-1/4} or larger under the geometric ergodicity assumption alone.
- [Theorem 3.1, §5] Theorem 3.1 and its proof in §5: the Gaussian approximation for the martingale term is combined with the linearization, but the paper does not provide a concrete verification that the convex-distance error from the remainder is strictly smaller than the target rate; without this, the claimed bound does not follow from the stated assumptions.
minor comments (2)
- [Abstract] The abstract and introduction should clarify whether the polylog factors are explicit or hidden in the O(·) notation.
- [§2] Notation for the convex distance metric (Definition 2.3) could include a brief reminder of its relation to the usual Wasserstein or Kolmogorov distances for reader convenience.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments point by point below and will revise the manuscript to incorporate explicit verifications where needed.
read point-by-point responses
-
Referee: [§4.2] §4.2, the linearization step around the projected soft Bellman fixed point: the argument claims the remainder after first-order expansion is o(n^{-1/4}) in the convex metric, but the stated regularity conditions on the projected equation only control the first-order term; a uniform bound on the second-order remainder (arising from the entropy nonlinearity or projection bias) must be verified explicitly to ensure it does not reach order n^{-1/4} or larger under the geometric ergodicity assumption alone.
Authors: We agree that an explicit uniform bound on the second-order remainder is required for a fully rigorous justification. While the regularity conditions on the projected soft Bellman equation and uniform geometric ergodicity are intended to control the relevant Lipschitz and smoothness properties, we will add a dedicated lemma in the revised §4.2 that derives a concrete O(·) bound on the quadratic remainder term in convex distance. This will use the entropy nonlinearity's bounded Hessian and the projection bias control under the Markov chain assumptions to confirm the remainder is strictly o(n^{-1/4}) with high probability. revision: yes
-
Referee: [Theorem 3.1, §5] Theorem 3.1 and its proof in §5: the Gaussian approximation for the martingale term is combined with the linearization, but the paper does not provide a concrete verification that the convex-distance error from the remainder is strictly smaller than the target rate; without this, the claimed bound does not follow from the stated assumptions.
Authors: We acknowledge the need for a more explicit error propagation argument. In the revised proof of Theorem 3.1 in §5, we will insert an intermediate proposition that decomposes the total convex-distance error, bounds the martingale approximation separately, and then shows that the accumulated remainder from linearization is o(n^{-1/4}) (up to polylog n factors) by leveraging the polynomial stepsize decay and the ergodicity to control the recursion. This will make the combination of linearization and Gaussian approximation fully rigorous under the given assumptions. revision: yes
Circularity Check
Derivation relies on standard martingale CLT and ergodicity; no reduction to inputs by construction
full rationale
The paper establishes the n^{-1/4} Gaussian approximation in convex distance for averaged iterates by linearizing the projected soft Bellman operator and applying a martingale central limit theorem to the leading term, under the external assumptions of uniform geometric ergodicity of the Markov chain and suitable regularity conditions on the projected equation. These steps invoke standard probabilistic tools (martingale CLT, mixing bounds) that are independent of the target bound and are not obtained by fitting parameters to the result itself or by self-referential definitions. No load-bearing step reduces the claimed rate to a fitted quantity, a renamed empirical pattern, or a self-citation chain whose validity depends on the present paper. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The sequence of observed triples (s_k,a_k,s_{k+1}) forms a uniformly geometrically ergodic Markov chain
- domain assumption Suitable regularity conditions for the projected soft Bellman equation
Reference graph
Works this paper leans on
-
[1]
Residual algorithms: Reinforcement learning with function approximation
Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. InMachine learning proceedings 1995, pages 30–37. Elsevier, 1995
work page 1995
-
[2]
Keith Ball. The reverse isoperimetric problem for gaussian measure.Discrete & Computational Geometry, 10(4):411–420, 1993
work page 1993
-
[3]
Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996
work page 1996
-
[4]
Gaussian approximation for two-timescale linear stochastic approximation
Bogdan Butyrin, Artemy Rubtsov, Alexey Naumov, Vladimir V Ulyanov, and Sergey Samsonov. Gaussian approximation for two-timescale linear stochastic approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36627–36635, 2026
work page 2026
-
[5]
Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022
work page 2022
-
[6]
Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013
work page 2013
-
[7]
Springer Series in Operations Research and Financial Engineering
Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018
work page 2018
-
[8]
Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964, 2025
work page 2025
-
[9]
Tight high probability bounds for linear stochastic approximation with fixed stepsize
Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Kevin Scaman, and Hoi-To Wai. Tight high probability bounds for linear stochastic approximation with fixed stepsize. InAdvances in Neural Information Processing Systems, volume 34, pages 30063–30074. Curran Associates, Inc., 2021
work page 2021
-
[10]
Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018
work page 2018
-
[11]
Reinforcement learning with deep energy-based policies
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR, 2017
work page 2017
-
[12]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018
work page 2018
-
[13]
Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018
work page 2018
-
[14]
M. Kaledin, E. Moulines, A. Naumov, V . Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. InConference On Learning Theory, 2020
work page 2020
-
[15]
Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, and Yuting Wei. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024. 12
work page 2024
-
[16]
A statistical analysis of polyak-ruppert averaged q-learning
Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023
work page 2023
-
[17]
Central Limit Theorems for Asynchronous Averaged Q-Learning
Xingtu Liu. Central limit theorems for asynchronous averaged q-learning.arXiv preprint arXiv:2509.18964, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
An analysis of reinforcement learning with function approximation
Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008
work page 2008
-
[19]
Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015
work page 2015
-
[20]
Ivan Nourdin, Giovanni Peccati, and Xiaochuan Yang. Multivariate normal approximation on the wiener space: new bounds in the convex distance.Journal of Theoretical Probability, 35(3):2020–2037, 2022
work page 2020
-
[21]
Osekowski.Sharp Martingale and Semimartingale Inequalities
A. Osekowski.Sharp Martingale and Semimartingale Inequalities. Monografie Matematyczne 72. Birkh¨auser Basel, 1 edition, 2012
work page 2012
-
[22]
Concentration inequalities for markov chains by marton couplings and spectral methods
Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic journal of probability, 20:79, 2015
work page 2015
-
[23]
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992
work page 1992
-
[24]
Finite-time analysis of asynchronous stochastic approximation and q-learning
Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. InConference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3185–3205. PMLR, 2020
work page 2020
-
[25]
Gaussian Approximation for Asynchronous Q-learning
Artemy Rubtsov, Sergey Samsonov, Vladimir Ulyanov, and Alexey Naumov. Gaussian approximation for asynchronous q-learning.arXiv preprint arXiv:2604.07323, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Efficient estimations from a slowly convergent Robbins-Monro process
David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988
work page 1988
-
[27]
Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for polyak-ruppert averaged linear stochastic approximation with applications to td learning.Advances in Neural Information Processing Systems, 37:12408–12460, 2024
work page 2024
-
[28]
Statistical inference for linear stochastic approximation with Markovian noise
Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexey Naumov. Statistical inference for linear stochastic approximation with Markovian noise. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[29]
Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli, 28(3):1548–1576, 2022
work page 2022
-
[30]
Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent
Marina Sheshukova, Sergey Samsonov, Denis Belomestny, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for stochastic gradient descent.arXiv preprint arXiv:2502.06719, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Vladimir Spokoiny and Mayya Zhilova. Bootstrap confidence sets under model misspecification.The Annals of Statistics, 43(6):2653 – 2675, 2015
work page 2015
-
[32]
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018
work page 2018
-
[33]
Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994
John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994
work page 1994
-
[34]
Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019
Martin J Wainwright. Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019
-
[35]
Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992
work page 1992
-
[36]
arXiv preprint arXiv:2410.16106 , year=
Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024
-
[37]
Weichen Wu, Yuting Wei, and Alessandro Rinaldo. Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025
-
[38]
Chuhan Xie and Zhihua Zhang. A statistical online inference approach in averaged stochastic approxi- mation.Advances in Neural Information Processing Systems, 35:8998–9009, 2022
work page 2022
-
[39]
Maximum entropy inverse reinforcement learning
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008
work page 2008
-
[40]
Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for sarsa with linear function approximation.Advances in neural information processing systems, 32, 2019. 14 A Notations Table 1: Notation used throughout the paper Notation Meaning Markov Decision Process MMarkov Decision Process (MDP) SState space,|S|=S AAction space,|A|=A rReward functi...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.