From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes
Pith reviewed 2026-05-22 17:30 UTC · model grok-4.3
The pith
Adaptive stepsizes let asynchronous average-reward Q-learning converge at a 1/k rate to the optimal Q-function.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under appropriate assumptions the iterates of this Q-learning algorithm converge at rate Õ(1/k) in the mean-square sense to the optimal Q-function in the span seminorm. Adaptive stepsizes are necessary; without them the algorithm fails to reach the correct fixed point. The stepsizes act as local clocks and can be viewed as implicit importance sampling that offsets the bias from asynchronous sampling. The proof proceeds by recasting the resulting non-Markovian stochastic approximation as a time-inhomogeneous Markovian process and then applying time-varying bounds together with Markov-chain concentration inequalities.
What carries the argument
Adaptive stepsizes viewed as implicit importance sampling that counteracts bias from asynchronous updates
If this is right
- Last-iterate convergence holds rather than only Cesaro or setwise convergence.
- Finite-time rates become available for practical asynchronous average-reward reinforcement learning.
- Each state-action pair effectively maintains its own clock through the adaptive rule.
- The same reformulation technique can be reused for other stochastic approximation schemes that employ adaptive stepsizes.
Where Pith is reading between the lines
- The same adaptive-stepsize mechanism could be tested in other asynchronous reinforcement-learning methods such as SARSA or actor-critic variants.
- Empirical measurement of the importance-sampling effect in high-asynchrony environments would directly test the paper's bias-cancellation argument.
- Designing new adaptive rules that achieve faster than 1/k rates while preserving the fixed-point property appears feasible with the same analytical tools.
Load-bearing premise
Adaptive stepsizes fully counteract the bias from asynchronous updates so that the time-inhomogeneous Markovian reformulation still has the desired optimal Q-function as its fixed point.
What would settle it
Running the same algorithm with non-adaptive stepsizes and checking whether the mean-square distance to the optimal Q-function in the span seminorm stops decreasing or converges to the wrong limit.
Figures
read the original abstract
This work presents the first finite-time analysis for the last-iterate convergence of average-reward $Q$-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that, under appropriate assumptions, the iterates generated by this $Q$-learning algorithm converge at a rate of $\tilde{\mathcal{O}}(1/k)$ (in the mean-square sense) to the optimal $Q$-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to the centered optimal $Q$-function, also at a rate of $\tilde{\mathcal{O}}(1/k)$. To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous updates. Technically, the use of adaptive stepsizes makes each $Q$-learning update depend on the entire sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. The tools developed in this work are likely to be broadly applicable to the analysis of general SA algorithms with adaptive stepsizes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to deliver the first finite-time analysis of last-iterate convergence for asynchronous average-reward Q-learning. It asserts that adaptive stepsizes, interpreted as local clocks and implicit importance sampling, enable Õ(1/k) mean-square convergence to the optimal Q-function in the span seminorm. Adding a centering step yields the same rate for pointwise convergence to the centered optimal Q-function. The proof strategy relies on a time-inhomogeneous Markovian reformulation of the resulting non-Markovian stochastic approximation together with almost-sure time-varying bounds, conditioning, and Markov-chain concentration inequalities. The authors also claim that adaptive stepsizes are necessary, as the algorithm fails to converge to the correct target without them.
Significance. If the stated rates and necessity result are rigorously established, the work would advance finite-time analysis of average-reward RL under asynchronous updates, a setting common in practice. The reformulation technique for handling adaptive stepsizes in stochastic approximation could apply more broadly. The necessity claim provides a useful supporting result that clarifies when such algorithms succeed.
major comments (1)
- Abstract: the manuscript states the Õ(1/k) mean-square rates and the necessity of adaptive stepsizes but supplies no derivation details, explicit error bounds, or complete assumption list. Without these, the correctness of the time-inhomogeneous Markovian reformulation and the concentration arguments used to break correlations cannot be verified.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for acknowledging the potential significance of the finite-time last-iterate analysis for asynchronous average-reward Q-learning. We address the major comment below.
read point-by-point responses
-
Referee: [—] Abstract: the manuscript states the Õ(1/k) mean-square rates and the necessity of adaptive stepsizes but supplies no derivation details, explicit error bounds, or complete assumption list. Without these, the correctness of the time-inhomogeneous Markovian reformulation and the concentration arguments used to break correlations cannot be verified.
Authors: The abstract is intentionally concise and provides a high-level summary of the contributions, rates, and proof strategy, consistent with standard practice. The full manuscript contains the complete list of assumptions (Section 2), explicit error bounds (Theorems 1 and 2), and the detailed development of the time-inhomogeneous Markovian reformulation together with the almost-sure time-varying bounds, conditioning arguments, and Markov-chain concentration inequalities (Sections 3–5). These sections supply the technical details needed to verify the arguments. We can revise the abstract to include a brief statement of the key assumptions and a high-level outline of the proof approach if the referee finds this helpful. revision: partial
Circularity Check
No significant circularity identified
full rationale
The abstract presents a finite-time analysis of average-reward Q-learning using adaptive stepsizes, reformulating the non-Markovian process as time-inhomogeneous Markovian SA and applying standard Markov-chain concentration inequalities plus almost-sure bounds. No equation or claim reduces the claimed Õ(1/k) rate in the span seminorm to a fitted parameter, self-defined target, or load-bearing self-citation; the necessity of adaptivity is stated as a supporting result rather than an unverified premise, and the derivation remains self-contained against external tools such as Markov chain theory.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption appropriate assumptions on the underlying MDP and adaptive stepsizes
Reference graph
Works this paper leans on
-
[1]
Abounadi,J.,Bertsekas,D.,andBorkar,V.S.(2001).LearningalgorithmsforMarkovdecisionprocesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698
work page 2001
-
[2]
Abounadi,J.,Bertsekas,D.P.,andBorkar,V.(2002). Stochasticapproximationfornonexpansivemaps: Application toQ-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22
work page 2002
-
[3]
InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1
Agrawal,P.andAgrawal,S.(2025).OptimisticQ-learningforaveragerewardandepisodicreinforcement learning. InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1. PMLR
work page 2025
-
[4]
Azar, M. G., Gómez, V., and Kappen, H. J. (2012). Dynamic policy programming.The Journal of Machine Learning Research, 13(1):3207–3245
work page 2012
-
[5]
Beck, C. L. and Srikant, R. (2012). Error bounds for constant stepsizeQ-learning.Systems & control letters, 61(12):1203–1208
work page 2012
-
[6]
Beck, C. L. and Srikant, R. (2013). Improved upper bounds on the expected error in constant stepsize Q-learning. In2013 American Control Conference, pages 1926–1931. IEEE
work page 2013
-
[7]
(2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22
Benveniste, A., Métivier, M., and Priouret, P. (2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22. Springer Science & Business Media
work page 2012
-
[8]
Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-Dynamic Programming. Athena Scientific
work page 1996
-
[9]
Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR
work page 2018
-
[10]
Blaser, E. and Zhang, S. (2024). Asymptotic and finite sample analysis of nonexpansive stochastic approximations with Markovian noise.Preprint arXiv:2409.19546
-
[11]
Borkar, V. S. (1998). Asynchronous stochastic approximations.SIAM Journal on Control and Opti- mization, 36(3):840–851
work page 1998
-
[12]
Borkar, V. S. (2008).Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge Univer- sity Press. 22
work page 2008
-
[13]
Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469
work page 2000
-
[14]
Bottou,L.,Curtis,F.E.,andNocedal,J.(2018).Optimizationmethodsforlarge-scalemachinelearning. Siam Review, 60(2):223–311
work page 2018
-
[15]
Bravo, M. and Cominetti, R. (2024). Stochastic fixed-point iterations for nonexpansive maps: Conver- gence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219
work page 2024
-
[16]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901
work page 2020
-
[17]
Bucklew, J. A. and Bucklew, J. (2004).Introduction to Rare Event Simulation, volume 5. Springer
work page 2004
-
[18]
Chandak, S. and Borkar, V. S. (2023). A concentration bound for TD(0) with function approximation. Preprint arXiv:2312.10424
-
[19]
Chandak, S., Haque, S. U., and Bambos, N. (2025). Finite-time bounds for two-timescale stochastic approximation with arbitrary norm contractions and Markovian noise.Preprint arXiv:2503.18391
-
[20]
Chen,Z.,Clarke,J.-P.,andMaguluri,S.T.(2023). Targetnetworkandtruncationovercomethedeadly triad inQ-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101
work page 2023
-
[21]
T., Shakkottai, S., and Shanmugam, K
Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes.Advances in Neural Information Processing Systems, 33
work page 2020
-
[22]
T., Shakkottai, S., and Shanmugam, K
Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov theory for finite- sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367
work page 2024
-
[23]
Chen, Z., Maguluri, S. T., and Zubeldia, M. (2025a). Concentration of contractive stochastic approxi- mation: Additive and multiplicative noise.The Annals of Applied Probability, 35(2):1298–1352
-
[24]
T., Clarke, J.-P., and Maguluri, S
Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinearstochasticapproximationwithapplicationsinreinforcementlearning.Automatica,146:110623
work page 2022
-
[25]
Chen, Z., Zhang, S., Zhang, Z., Haque, S. U., and Maguluri, S. T. (2025b). A non-asymptotic theory of seminorm Lyapunov stability: From deterministic to stochastic iterative algorithms.Preprint arXiv:2502.14208
-
[26]
Dai, J. G. and Gluzman, M. (2022). Queueing network controls via deep reinforcement learning. Stochastic Systems, 12(1):30–67
work page 2022
-
[27]
Devraj, A. M. and Meyn, S. (2017). ZapQ-learning. InAdvances in Neural Information Processing Systems, pages 2235–2244
work page 2017
-
[28]
Durmus, A., Moulines, E., Naumov, A., Samsonov, S., Scaman, K., and Wai, H.-T. (2021). Tight high probabilityboundsforlinearstochasticapproximationwithfixedstepsize.AdvancesinNeuralInformation Processing Systems, 34:30063–30074
work page 2021
-
[29]
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. InInternational Conference on Machine Learning, pages 1407–1416. 23
work page 2018
-
[30]
Even-Dar, E., Mansour, Y., and Bartlett, P. (2003). Learning rates forQ-learning.Journal of machine learning Research, 5(1)
work page 2003
-
[31]
Gosavi, A. (2006). Boundedness of iterates inQ-learning.Systems & control letters, 55(4):347–349
work page 2006
-
[32]
U., Khodadadian, S., and Maguluri, S
Haque, S. U., Khodadadian, S., and Maguluri, S. T. (2023). Tight finite time bounds of two-time-scale linear stochastic approximation with Markovian noise.Preprint arXiv:2401.00364
- [33]
-
[34]
Harchol-Balter,M.(2013).PerformanceModelingandDesignofComputerSystems: QueueingTheory in Action. Cambridge University Press
work page 2013
-
[35]
Huo, D., Chen, Y., and Xie, Q. (2023). Bias and extrapolation in Markovian linear stochastic approx- imation with constant stepsizes. InAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 81–82
work page 2023
-
[36]
Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710
work page 1994
-
[37]
Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600
work page 2010
-
[38]
Jin, Y., Gummadi, R., Zhou, Z., and Blanchet, J. (2024). FeasibleQ-learning for average reward reinforcementlearning. InInternationalConferenceonArtificialIntelligenceandStatistics,pages1630–
work page 2024
-
[39]
Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward MDPs. InInternational Conference on Machine Learning, pages 5055–5064. PMLR
work page 2021
- [40]
-
[41]
Kloek,T.andVanDijk,H.K.(1978).Bayesianestimatesofequationsystemparameters: Anapplication of integration by Monte Carlo.Econometrica: Journal of the Econometric Society, pages 1–19
work page 1978
-
[42]
Kushner, H. J. and Clark, D. S. (2012).Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26. Springer Science & Business Media
work page 2012
-
[43]
(2020).First-Order and Stochastic Optimization Methods for Machine Learning
Lan, G. (2020).First-Order and Stochastic Optimization Methods for Machine Learning. Springer
work page 2020
- [44]
-
[45]
Lee, D. (2024). Final iteration convergence bound ofQ-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772
work page 2024
-
[46]
Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc
work page 2017
-
[47]
Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Tightening the dependence on horizon in the sample complexity ofQ-learning. InInternational Conference on Machine Learning, pages 6296–6306. PMLR. 24
work page 2021
-
[48]
IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236
Li,G.,Cai,C.,Chen,Y.,Wei,Y.,andChi,Y.(2024a). IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236
-
[49]
Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronousQ-learning: Sharper analysis and variance reduction.Advances in neural information processing systems, 33:7031– 7043
work page 2020
-
[50]
Li,T.,Wu,F.,andLan,G.(2024b). Stochasticfirst-ordermethodsforaverage-rewardMarkovdecision processes.Mathematics of Operations Research
-
[51]
Li, X., Yang, W., Liang, J., Zhang, Z., and Jordan, M. I. (2023). A statistical analysis of Polyak- Ruppert averagedQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR
work page 2023
-
[52]
Liu, B., Xie, Q., and Modiano, E. (2022). RL-QN: A reinforcement learning framework for optimal controlofqueueingsystems.ACMTransactionsonModelingandPerformanceEvaluationofComputing Systems, 7(1):1–35
work page 2022
-
[53]
Mahadevan,S.(1996). Averagerewardreinforcementlearning: Foundations,algorithms,andempirical results.Machine learning, 22(1):159–195
work page 1996
-
[54]
Ananalysisofreinforcementlearningwithfunction approximation
Melo,F.S.,Meyn,S.P.,andRibeiro,M.I.(2008). Ananalysisofreinforcementlearningwithfunction approximation. InProceedingsofthe25thinternationalconferenceonMachinelearning,pages664–671
work page 2008
-
[55]
Meyn, S. (2024). The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control
work page 2024
-
[56]
Meyn, S. P. and Tweedie, R. L. (2012).Markov Chains and Stochastic Stability. Springer Science & Business Media
work page 2012
-
[57]
Mnih,V.,Kavukcuoglu,K.,Silver,D.,Rusu,A.A.,Veness,J.,Bellemare,M.G.,Graves,A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning.nature, 518(7540):529–533
work page 2015
-
[58]
Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. (2020). On linear stochas- tic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. InConference on Learning Theory, pages 2947–2997. PMLR
work page 2020
-
[59]
Mou,W.,Pananjady,A.,Wainwright,M.J.,andBartlett,P.L.(2024). Optimalandinstance-dependent guaranteesforMarkovianlinearstochasticapproximation.MathematicalStatisticsandLearning,7(1):41– 153
work page 2024
-
[60]
Nanda, P. and Chen, Z. (2025). A minimal-assumption analysis of Q-learning with time-varying policies.Preprint arXiv:1910.02140
-
[61]
Paulin, D. (2015). Concentration inequalities for Markov chains by Marton couplings and spectral methods.Electronic Journal of Probability, 20:1–32
work page 2015
-
[62]
Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons
work page 2014
- [63]
-
[64]
Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR
work page 2020
-
[65]
Robbins, H. and Monro, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407
work page 1951
-
[66]
Shalev-Shwartz, S. et al. (2012). Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194
work page 2012
-
[67]
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144
work page 2018
-
[68]
Singh, B., Kumar, R., and Singh, V. P. (2022). Reinforcement learning in robotic applications: a comprehensive survey.Artificial Intelligence Review, 55(2):945–990
work page 2022
-
[69]
Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830
work page 2019
-
[70]
Reinforcementlearningforcost-aware Markovdecisionprocesses
Suttle,W.,Zhang,K.,Yang,Z.,Liu,J.,andKraemer,D.(2021). Reinforcementlearningforcost-aware Markovdecisionprocesses. InInternationalConferenceonMachineLearning,pages9989–9999.PMLR
work page 2021
-
[71]
Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press
work page 2018
-
[72]
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation andQ-learning.Machine learning, 16(3):185–202
work page 1994
-
[73]
Tsitsiklis, J. N. and Van Roy, B. (1999). Average cost temporal-difference learning.Automatica, 35(11):1799–1808
work page 1999
-
[74]
Wainwright, M. J. (2019a). Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.Preprint arXiv:1905.06265
work page internal anchor Pith review Pith/arXiv arXiv 1905
- [75]
-
[76]
Wan, Y., Naik, A., and Sutton, R. (2021a). Average-reward learning and planning with options. Advances in Neural Information Processing Systems, 34:22758–22769
-
[77]
Learningandplanninginaverage-rewardMarkovdecision processes
Wan,Y.,Naik,A.,andSutton,R.S.(2021b). Learningandplanninginaverage-rewardMarkovdecision processes. InInternational Conference on Machine Learning, pages 10653–10662. PMLR
- [78]
-
[79]
Wang, S., Blanchet, J., and Glynn, P. (2024). Optimal sample complexity for average reward Markov decision processes. InThe Twelfth International Conference on Learning Representations
work page 2024
-
[80]
Watkins, C. J. and Dayan, P. (1992).Q-learning.Machine learning, 8(3-4):279–292
work page 1992
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.