pith. sign in

arxiv: 2605.17678 · v1 · pith:7BXAU7YPnew · submitted 2026-05-17 · 📊 stat.ML · cs.LG

On Gaussian approximation for entropy-regularized Q-learning with function approximation

Pith reviewed 2026-05-19 22:08 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords Gaussian approximationQ-learningentropy regularizationfunction approximationcentral limit theoremPolyak-Ruppert averagingMarkov chainsreinforcement learning
0
0 comments X p. Extension
pith:7BXAU7YP Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{7BXAU7YP}

Prints a linked pith:7BXAU7YP badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Entropy-regularized Q-learning with linear function approximation yields a Gaussian approximation bound of order n to the minus one-fourth for Polyak-Ruppert averaged iterates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives high-dimensional central limit theorem rates for the averaged iterates produced by entropy-regularized asynchronous Q-learning that uses linear function approximation and a polynomial step-size schedule. Under the assumption that observed state-action-next-state triples form a uniformly geometrically ergodic Markov chain, together with regularity conditions on the projected soft Bellman equation, it obtains a Gaussian approximation bound in convex distance at rate n to the minus one-fourth, up to polylog factors. A sympathetic reader would care because the result supplies explicit non-asymptotic control on the statistical fluctuations of the learning process in high dimensions, which in turn informs uncertainty quantification and finite-sample behavior for this family of algorithms. The derivation proceeds by linearizing the soft Bellman recursion and isolating a leading martingale term whose Gaussian approximation is controlled separately. The work also records high-order moment bounds on the last iterate as an auxiliary result.

Core claim

We establish a Gaussian approximation bound in the convex distance with rate of order n^{-1/4}, up to polylogarithmic factors in n, for the Polyak-Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and polynomial step-size k^{-ω} for ω in (1/2,1). The bound holds when the sequence of observed triples forms a uniformly geometrically ergodic Markov chain and suitable regularity conditions are satisfied for the projected soft Bellman equation. The proof combines a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term.

What carries the argument

Linearization of the soft Bellman recursion together with Gaussian approximation applied to its leading martingale term.

If this is right

  • High-dimensional central limit theorem holds for the averaged iterates, giving explicit control on fluctuations in convex distance.
  • High-order moment bounds are available for the algorithm's final iterate.
  • The result applies to asynchronous updates driven by a single geometrically ergodic Markov chain of observations.
  • Polynomial step sizes in the open interval (1/2,1) are covered by the same rate.
  • The Gaussian approximation extends existing CLT statements from non-regularized to entropy-regularized Q-learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linearization-plus-martingale technique could be tested on nonlinear function approximators provided the projected operator remains contractive.
  • Finite-sample concentration inequalities derived from the moment bounds might be combined with the CLT to produce practical confidence intervals for value estimates.
  • The n^{-1/4} rate suggests that variance-reduction or multiple-trajectory averaging could improve the exponent in future refinements.
  • The ergodicity assumption points to a natural next question: how the mixing time of the underlying Markov chain enters the implicit constants.

Load-bearing premise

The observed state-action-next-state triples must form a uniformly geometrically ergodic Markov chain and the projected soft Bellman equation must satisfy suitable regularity conditions.

What would settle it

Run the algorithm on a simple finite-state MDP with known transition kernel, collect the averaged iterates after n steps, and test whether their empirical distribution deviates from the predicted Gaussian by more than the claimed n^{-1/4} rate in convex distance.

read the original abstract

In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak--Ruppert averaged iterates generated by entropy-regularized asynchronous Q-learning with linear function approximation and a polynomial stepsize $k^{-\omega}$, $\omega \in (1/2,1)$. Assuming that the sequence of observed triples $(s_k,a_k,s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, and under suitable regularity conditions for the projected soft Bellman equation, we establish a Gaussian approximation bound in the convex distance with rate of order $n^{-1/4}$, up to polylogarithmic factors in $n$, where $n$ is the number of samples used by the algorithm. To obtain this result, we combine a linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term. Finally, we derive high-order moment bounds for the algorithm's last iterate, which might be of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates of entropy-regularized asynchronous Q-learning with linear function approximation and polynomial stepsize k^{-ω} for ω ∈ (1/2,1). Under the assumption that observed triples (s_k, a_k, s_{k+1}) form a uniformly geometrically ergodic Markov chain and suitable regularity conditions on the projected soft Bellman equation, it establishes a Gaussian approximation bound in the convex distance with rate n^{-1/4} up to polylog factors in n. The proof combines linearization of the soft Bellman recursion with a Gaussian approximation for the leading martingale term, and additionally derives high-order moment bounds for the algorithm's last iterate.

Significance. If the central claim holds, the result supplies a non-asymptotic Gaussian approximation result for averaged iterates in entropy-regularized Q-learning under linear function approximation, which is relevant for uncertainty quantification in high-dimensional RL settings. The explicit n^{-1/4} rate in convex distance (up to logs) and the separate high-order moment bounds for the last iterate are strengths that may be of independent interest. The approach of linearizing the projected soft Bellman operator and applying martingale CLT tools is technically standard but is applied here to this specific regularized asynchronous setting.

major comments (2)
  1. [§4.2] §4.2, the linearization step around the projected soft Bellman fixed point: the argument claims the remainder after first-order expansion is o(n^{-1/4}) in the convex metric, but the stated regularity conditions on the projected equation only control the first-order term; a uniform bound on the second-order remainder (arising from the entropy nonlinearity or projection bias) must be verified explicitly to ensure it does not reach order n^{-1/4} or larger under the geometric ergodicity assumption alone.
  2. [Theorem 3.1, §5] Theorem 3.1 and its proof in §5: the Gaussian approximation for the martingale term is combined with the linearization, but the paper does not provide a concrete verification that the convex-distance error from the remainder is strictly smaller than the target rate; without this, the claimed bound does not follow from the stated assumptions.
minor comments (2)
  1. [Abstract] The abstract and introduction should clarify whether the polylog factors are explicit or hidden in the O(·) notation.
  2. [§2] Notation for the convex distance metric (Definition 2.3) could include a brief reminder of its relation to the usual Wasserstein or Kolmogorov distances for reader convenience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments point by point below and will revise the manuscript to incorporate explicit verifications where needed.

read point-by-point responses
  1. Referee: [§4.2] §4.2, the linearization step around the projected soft Bellman fixed point: the argument claims the remainder after first-order expansion is o(n^{-1/4}) in the convex metric, but the stated regularity conditions on the projected equation only control the first-order term; a uniform bound on the second-order remainder (arising from the entropy nonlinearity or projection bias) must be verified explicitly to ensure it does not reach order n^{-1/4} or larger under the geometric ergodicity assumption alone.

    Authors: We agree that an explicit uniform bound on the second-order remainder is required for a fully rigorous justification. While the regularity conditions on the projected soft Bellman equation and uniform geometric ergodicity are intended to control the relevant Lipschitz and smoothness properties, we will add a dedicated lemma in the revised §4.2 that derives a concrete O(·) bound on the quadratic remainder term in convex distance. This will use the entropy nonlinearity's bounded Hessian and the projection bias control under the Markov chain assumptions to confirm the remainder is strictly o(n^{-1/4}) with high probability. revision: yes

  2. Referee: [Theorem 3.1, §5] Theorem 3.1 and its proof in §5: the Gaussian approximation for the martingale term is combined with the linearization, but the paper does not provide a concrete verification that the convex-distance error from the remainder is strictly smaller than the target rate; without this, the claimed bound does not follow from the stated assumptions.

    Authors: We acknowledge the need for a more explicit error propagation argument. In the revised proof of Theorem 3.1 in §5, we will insert an intermediate proposition that decomposes the total convex-distance error, bounds the martingale approximation separately, and then shows that the accumulated remainder from linearization is o(n^{-1/4}) (up to polylog n factors) by leveraging the polynomial stepsize decay and the ergodicity to control the recursion. This will make the combination of linearization and Gaussian approximation fully rigorous under the given assumptions. revision: yes

Circularity Check

0 steps flagged

Derivation relies on standard martingale CLT and ergodicity; no reduction to inputs by construction

full rationale

The paper establishes the n^{-1/4} Gaussian approximation in convex distance for averaged iterates by linearizing the projected soft Bellman operator and applying a martingale central limit theorem to the leading term, under the external assumptions of uniform geometric ergodicity of the Markov chain and suitable regularity conditions on the projected equation. These steps invoke standard probabilistic tools (martingale CLT, mixing bounds) that are independent of the target bound and are not obtained by fitting parameters to the result itself or by self-referential definitions. No load-bearing step reduces the claimed rate to a fitted quantity, a renamed empirical pattern, or a self-citation chain whose validity depends on the present paper. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result depends on standard domain assumptions about Markov chain mixing and Bellman equation regularity; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The sequence of observed triples (s_k,a_k,s_{k+1}) forms a uniformly geometrically ergodic Markov chain
    Invoked to control the dependence in the stochastic process for the CLT.
  • domain assumption Suitable regularity conditions for the projected soft Bellman equation
    Required to justify the linearization step and Gaussian approximation of the martingale term.

pith-pipeline@v0.9.0 · 5714 in / 1369 out tokens · 45572 ms · 2026-05-19T22:08:20.724234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    Residual algorithms: Reinforcement learning with function approximation

    Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. InMachine learning proceedings 1995, pages 30–37. Elsevier, 1995

  2. [2]

    The reverse isoperimetric problem for gaussian measure.Discrete & Computational Geometry, 10(4):411–420, 1993

    Keith Ball. The reverse isoperimetric problem for gaussian measure.Discrete & Computational Geometry, 10(4):411–420, 1993

  3. [3]

    Bertsekas and John N

    Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

  4. [4]

    Gaussian approximation for two-timescale linear stochastic approximation

    Bogdan Butyrin, Artemy Rubtsov, Alexey Naumov, Vladimir V Ulyanov, and Sergey Samsonov. Gaussian approximation for two-timescale linear stochastic approximation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36627–36635, 2026

  5. [5]

    Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

    Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

  6. [6]

    Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

    Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors.The Annals of Statistics, 41(6):2786– 2819, 2013

  7. [7]

    Springer Series in Operations Research and Financial Engineering

    Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier.Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

  8. [8]

    Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964, 2025

    Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Finite-time high-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation.Mathematics of Operations Research, 50(2):935–964, 2025

  9. [9]

    Tight high probability bounds for linear stochastic approximation with fixed stepsize

    Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Kevin Scaman, and Hoi-To Wai. Tight high probability bounds for linear stochastic approximation with fixed stepsize. InAdvances in Neural Information Processing Systems, volume 34, pages 30063–30074. Curran Associates, Inc., 2021

  10. [10]

    Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

    Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research, 19(78):1–21, 2018

  11. [11]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR, 2017

  12. [12]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  13. [13]

    Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

    Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

  14. [14]

    Kaledin, E

    M. Kaledin, E. Moulines, A. Naumov, V . Tadic, and Hoi-To Wai. Finite time analysis of linear two-timescale stochastic approximation with markovian noise. InConference On Learning Theory, 2020

  15. [15]

    Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024

    Gen Li, Weichen Wu, Yuejie Chi, Cong Ma, Alessandro Rinaldo, and Yuting Wei. Is q-learning minimax optimal? a tight sample complexity analysis.Operations Research, 72(1):222–236, 2024. 12

  16. [16]

    A statistical analysis of polyak-ruppert averaged q-learning

    Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023

  17. [17]

    Central Limit Theorems for Asynchronous Averaged Q-Learning

    Xingtu Liu. Central limit theorems for asynchronous averaged q-learning.arXiv preprint arXiv:2509.18964, 2025

  18. [18]

    An analysis of reinforcement learning with function approximation

    Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008

  19. [19]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  20. [20]

    Multivariate normal approximation on the wiener space: new bounds in the convex distance.Journal of Theoretical Probability, 35(3):2020–2037, 2022

    Ivan Nourdin, Giovanni Peccati, and Xiaochuan Yang. Multivariate normal approximation on the wiener space: new bounds in the convex distance.Journal of Theoretical Probability, 35(3):2020–2037, 2022

  21. [21]

    Osekowski.Sharp Martingale and Semimartingale Inequalities

    A. Osekowski.Sharp Martingale and Semimartingale Inequalities. Monografie Matematyczne 72. Birkh¨auser Basel, 1 edition, 2012

  22. [22]

    Concentration inequalities for markov chains by marton couplings and spectral methods

    Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral methods. Electronic journal of probability, 20:79, 2015

  23. [23]

    Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

  24. [24]

    Finite-time analysis of asynchronous stochastic approximation and q-learning

    Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. InConference on Learning Theory (COLT), volume 125 ofProceedings of Machine Learning Research, pages 3185–3205. PMLR, 2020

  25. [25]

    Gaussian Approximation for Asynchronous Q-learning

    Artemy Rubtsov, Sergey Samsonov, Vladimir Ulyanov, and Alexey Naumov. Gaussian approximation for asynchronous q-learning.arXiv preprint arXiv:2604.07323, 2026

  26. [26]

    Efficient estimations from a slowly convergent Robbins-Monro process

    David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

  27. [27]

    Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for polyak-ruppert averaged linear stochastic approximation with applications to td learning.Advances in Neural Information Processing Systems, 37:12408–12460, 2024

  28. [28]

    Statistical inference for linear stochastic approximation with Markovian noise

    Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexey Naumov. Statistical inference for linear stochastic approximation with Markovian noise. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  29. [29]

    Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli, 28(3):1548–1576, 2022

    Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli, 28(3):1548–1576, 2022

  30. [30]

    Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

    Marina Sheshukova, Sergey Samsonov, Denis Belomestny, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaussian approximation and multiplier bootstrap for stochastic gradient descent.arXiv preprint arXiv:2502.06719, 2025. 13

  31. [31]

    Bootstrap confidence sets under model misspecification.The Annals of Statistics, 43(6):2653 – 2675, 2015

    Vladimir Spokoiny and Mayya Zhilova. Bootstrap confidence sets under model misspecification.The Annals of Statistics, 43(6):2653 – 2675, 2015

  32. [32]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

  33. [33]

    Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994

    John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185– 202, 1994

  34. [34]

    Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

    Martin J Wainwright. Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

  35. [35]

    Christopher J. C. H. Watkins and Peter Dayan. Q-learning.Machine Learning, 8:279–292, 1992

  36. [36]

    arXiv preprint arXiv:2410.16106 , year=

    Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation.arXiv preprint arXiv:2410.16106, 2024

  37. [37]

    Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

    Weichen Wu, Yuting Wei, and Alessandro Rinaldo. Uncertainty quantification for Markov chains with application to temporal difference learning.arXiv preprint arXiv:2502.13822, 2025

  38. [38]

    A statistical online inference approach in averaged stochastic approxi- mation.Advances in Neural Information Processing Systems, 35:8998–9009, 2022

    Chuhan Xie and Zhihua Zhang. A statistical online inference approach in averaged stochastic approxi- mation.Advances in Neural Information Processing Systems, 35:8998–9009, 2022

  39. [39]

    Maximum entropy inverse reinforcement learning

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

  40. [40]

    Finite-sample analysis for sarsa with linear function approximation.Advances in neural information processing systems, 32, 2019

    Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for sarsa with linear function approximation.Advances in neural information processing systems, 32, 2019. 14 A Notations Table 1: Notation used throughout the paper Notation Meaning Markov Decision Process MMarkov Decision Process (MDP) SState space,|S|=S AAction space,|A|=A rReward functi...