From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

Phalguni Nanda; Zaiwei Chen

arxiv: 2504.18743 · v2 · submitted 2025-04-25 · 💻 cs.LG · math.PR· stat.ML

From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

Zaiwei Chen , Phalguni Nanda This is my paper

Pith reviewed 2026-05-22 17:30 UTC · model grok-4.3

classification 💻 cs.LG math.PRstat.ML

keywords Q-learningaverage-rewardfinite-time analysisasynchronous updatesadaptive stepsizesstochastic approximationreinforcement learning

0 comments

The pith

Adaptive stepsizes let asynchronous average-reward Q-learning converge at a 1/k rate to the optimal Q-function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper provides the first finite-time analysis showing last-iterate convergence for average-reward Q-learning when updates occur asynchronously across state-action pairs. It proves that the algorithm's iterates converge in mean square at a rate of roughly 1 over the iteration count, measured against the optimal Q-function in the span seminorm. Adding a centering step produces pointwise convergence to the centered optimal values at the same rate. These guarantees matter because they apply to realistic implementations where different state-action pairs are updated at different times, and they demonstrate that fixed or non-adaptive stepsizes cause the iterates to miss the correct target entirely.

Core claim

Under appropriate assumptions the iterates of this Q-learning algorithm converge at rate Õ(1/k) in the mean-square sense to the optimal Q-function in the span seminorm. Adaptive stepsizes are necessary; without them the algorithm fails to reach the correct fixed point. The stepsizes act as local clocks and can be viewed as implicit importance sampling that offsets the bias from asynchronous sampling. The proof proceeds by recasting the resulting non-Markovian stochastic approximation as a time-inhomogeneous Markovian process and then applying time-varying bounds together with Markov-chain concentration inequalities.

What carries the argument

Adaptive stepsizes viewed as implicit importance sampling that counteracts bias from asynchronous updates

If this is right

Last-iterate convergence holds rather than only Cesaro or setwise convergence.
Finite-time rates become available for practical asynchronous average-reward reinforcement learning.
Each state-action pair effectively maintains its own clock through the adaptive rule.
The same reformulation technique can be reused for other stochastic approximation schemes that employ adaptive stepsizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive-stepsize mechanism could be tested in other asynchronous reinforcement-learning methods such as SARSA or actor-critic variants.
Empirical measurement of the importance-sampling effect in high-asynchrony environments would directly test the paper's bias-cancellation argument.
Designing new adaptive rules that achieve faster than 1/k rates while preserving the fixed-point property appears feasible with the same analytical tools.

Load-bearing premise

Adaptive stepsizes fully counteract the bias from asynchronous updates so that the time-inhomogeneous Markovian reformulation still has the desired optimal Q-function as its fixed point.

What would settle it

Running the same algorithm with non-adaptive stepsizes and checking whether the mean-square distance to the optimal Q-function in the span seminorm stops decreasing or converges to the wrong limit.

Figures

Figures reproduced from arXiv: 2504.18743 by Phalguni Nanda, Zaiwei Chen.

**Figure 4.** Figure 4: Discounted [PITH_FULL_IMAGE:figures/full_fig_p063_4.png] view at source ↗

**Figure 6.** Figure 6: Adaptive Stepsize αk(s, a) = α(k+h) 1−z Nk(s,a)+h quickly because it yields larger steps, but the resulting variance takes longer to decay. Conversely, a larger z requires more iterations to diminish the initial bias, but exhibits a faster decay of the variance. 65 [PITH_FULL_IMAGE:figures/full_fig_p065_6.png] view at source ↗

read the original abstract

This work presents the first finite-time analysis for the last-iterate convergence of average-reward $Q$-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that, under appropriate assumptions, the iterates generated by this $Q$-learning algorithm converge at a rate of $\tilde{\mathcal{O}}(1/k)$ (in the mean-square sense) to the optimal $Q$-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to the centered optimal $Q$-function, also at a rate of $\tilde{\mathcal{O}}(1/k)$. To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous updates. Technically, the use of adaptive stepsizes makes each $Q$-learning update depend on the entire sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. The tools developed in this work are likely to be broadly applicable to the analysis of general SA algorithms with adaptive stepsizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This abstract claims the first finite-time last-iterate rates for asynchronous average-reward Q-learning with adaptive stepsizes at Õ(1/k) mean-square in the span seminorm, plus a necessity result for adaptivity.

read the letter

The main thing to know is that the paper positions itself as delivering the first finite-time last-iterate analysis for asynchronous average-reward Q-learning that uses adaptive stepsizes as local clocks. It reports Õ(1/k) mean-square convergence to the optimal Q-function in the span seminorm, plus pointwise convergence after an added centering step, and argues that adaptivity is necessary because the algorithm otherwise misses the target fixed point.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to deliver the first finite-time analysis of last-iterate convergence for asynchronous average-reward Q-learning. It asserts that adaptive stepsizes, interpreted as local clocks and implicit importance sampling, enable Õ(1/k) mean-square convergence to the optimal Q-function in the span seminorm. Adding a centering step yields the same rate for pointwise convergence to the centered optimal Q-function. The proof strategy relies on a time-inhomogeneous Markovian reformulation of the resulting non-Markovian stochastic approximation together with almost-sure time-varying bounds, conditioning, and Markov-chain concentration inequalities. The authors also claim that adaptive stepsizes are necessary, as the algorithm fails to converge to the correct target without them.

Significance. If the stated rates and necessity result are rigorously established, the work would advance finite-time analysis of average-reward RL under asynchronous updates, a setting common in practice. The reformulation technique for handling adaptive stepsizes in stochastic approximation could apply more broadly. The necessity claim provides a useful supporting result that clarifies when such algorithms succeed.

major comments (1)

Abstract: the manuscript states the Õ(1/k) mean-square rates and the necessity of adaptive stepsizes but supplies no derivation details, explicit error bounds, or complete assumption list. Without these, the correctness of the time-inhomogeneous Markovian reformulation and the concentration arguments used to break correlations cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for acknowledging the potential significance of the finite-time last-iterate analysis for asynchronous average-reward Q-learning. We address the major comment below.

read point-by-point responses

Referee: [—] Abstract: the manuscript states the Õ(1/k) mean-square rates and the necessity of adaptive stepsizes but supplies no derivation details, explicit error bounds, or complete assumption list. Without these, the correctness of the time-inhomogeneous Markovian reformulation and the concentration arguments used to break correlations cannot be verified.

Authors: The abstract is intentionally concise and provides a high-level summary of the contributions, rates, and proof strategy, consistent with standard practice. The full manuscript contains the complete list of assumptions (Section 2), explicit error bounds (Theorems 1 and 2), and the detailed development of the time-inhomogeneous Markovian reformulation together with the almost-sure time-varying bounds, conditioning arguments, and Markov-chain concentration inequalities (Sections 3–5). These sections supply the technical details needed to verify the arguments. We can revise the abstract to include a brief statement of the key assumptions and a high-level outline of the proof approach if the referee finds this helpful. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract presents a finite-time analysis of average-reward Q-learning using adaptive stepsizes, reformulating the non-Markovian process as time-inhomogeneous Markovian SA and applying standard Markov-chain concentration inequalities plus almost-sure bounds. No equation or claim reduces the claimed Õ(1/k) rate in the span seminorm to a fitted parameter, self-defined target, or load-bearing self-citation; the necessity of adaptivity is stated as a supporting result rather than an unverified premise, and the derivation remains self-contained against external tools such as Markov chain theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on unspecified 'appropriate assumptions' that are not detailed in the abstract; these likely include ergodicity conditions on the MDP and regularity of the adaptive stepsizes, but no explicit free parameters, new entities, or ad-hoc axioms are visible.

axioms (1)

domain assumption appropriate assumptions on the underlying MDP and adaptive stepsizes
Invoked in the abstract to guarantee convergence; exact statement not provided.

pith-pipeline@v0.9.0 · 5816 in / 1259 out tokens · 59238 ms · 2026-05-22T17:30:54.478014+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 1 internal anchor

[1]

Abounadi,J.,Bertsekas,D.,andBorkar,V.S.(2001).LearningalgorithmsforMarkovdecisionprocesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698

work page 2001
[2]

Stochasticapproximationfornonexpansivemaps: Application toQ-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

Abounadi,J.,Bertsekas,D.P.,andBorkar,V.(2002). Stochasticapproximationfornonexpansivemaps: Application toQ-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

work page 2002
[3]

InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1

Agrawal,P.andAgrawal,S.(2025).OptimisticQ-learningforaveragerewardandepisodicreinforcement learning. InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1. PMLR

work page 2025
[4]

G., Gómez, V., and Kappen, H

Azar, M. G., Gómez, V., and Kappen, H. J. (2012). Dynamic policy programming.The Journal of Machine Learning Research, 13(1):3207–3245

work page 2012
[5]

Beck, C. L. and Srikant, R. (2012). Error bounds for constant stepsizeQ-learning.Systems & control letters, 61(12):1203–1208

work page 2012
[6]

Beck, C. L. and Srikant, R. (2013). Improved upper bounds on the expected error in constant stepsize Q-learning. In2013 American Control Conference, pages 1926–1931. IEEE

work page 2013
[7]

(2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22

Benveniste, A., Métivier, M., and Priouret, P. (2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22. Springer Science & Business Media

work page 2012
[8]

Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-Dynamic Programming. Athena Scientific

work page 1996
[9]

Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR

work page 2018
[10]

and Zhang, S

Blaser, E. and Zhang, S. (2024). Asymptotic and finite sample analysis of nonexpansive stochastic approximations with Markovian noise.Preprint arXiv:2409.19546

work page arXiv 2024
[11]

Borkar, V. S. (1998). Asynchronous stochastic approximations.SIAM Journal on Control and Opti- mization, 36(3):840–851

work page 1998
[12]

Borkar, V. S. (2008).Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge Univer- sity Press. 22

work page 2008
[13]

Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469

work page 2000
[14]

Siam Review, 60(2):223–311

Bottou,L.,Curtis,F.E.,andNocedal,J.(2018).Optimizationmethodsforlarge-scalemachinelearning. Siam Review, 60(2):223–311

work page 2018
[15]

and Cominetti, R

Bravo, M. and Cominetti, R. (2024). Stochastic fixed-point iterations for nonexpansive maps: Conver- gence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219

work page 2024
[16]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901

work page 2020
[17]

Bucklew, J. A. and Bucklew, J. (2004).Introduction to Rare Event Simulation, volume 5. Springer

work page 2004
[18]

and Borkar, V

Chandak, S. and Borkar, V. S. (2023). A concentration bound for TD(0) with function approximation. Preprint arXiv:2312.10424

work page arXiv 2023
[19]

Chandak, S

Chandak, S., Haque, S. U., and Bambos, N. (2025). Finite-time bounds for two-timescale stochastic approximation with arbitrary norm contractions and Markovian noise.Preprint arXiv:2503.18391

work page arXiv 2025
[20]

Targetnetworkandtruncationovercomethedeadly triad inQ-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

Chen,Z.,Clarke,J.-P.,andMaguluri,S.T.(2023). Targetnetworkandtruncationovercomethedeadly triad inQ-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

work page 2023
[21]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes.Advances in Neural Information Processing Systems, 33

work page 2020
[22]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov theory for finite- sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367

work page 2024
[23]

T., and Zubeldia, M

Chen, Z., Maguluri, S. T., and Zubeldia, M. (2025a). Concentration of contractive stochastic approxi- mation: Additive and multiplicative noise.The Annals of Applied Probability, 35(2):1298–1352

work page
[24]

T., Clarke, J.-P., and Maguluri, S

Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinearstochasticapproximationwithapplicationsinreinforcementlearning.Automatica,146:110623

work page 2022
[25]

U., and Maguluri, S

Chen, Z., Zhang, S., Zhang, Z., Haque, S. U., and Maguluri, S. T. (2025b). A non-asymptotic theory of seminorm Lyapunov stability: From deterministic to stochastic iterative algorithms.Preprint arXiv:2502.14208

work page arXiv
[26]

Dai, J. G. and Gluzman, M. (2022). Queueing network controls via deep reinforcement learning. Stochastic Systems, 12(1):30–67

work page 2022
[27]

Devraj, A. M. and Meyn, S. (2017). ZapQ-learning. InAdvances in Neural Information Processing Systems, pages 2235–2244

work page 2017
[28]

Durmus, A., Moulines, E., Naumov, A., Samsonov, S., Scaman, K., and Wai, H.-T. (2021). Tight high probabilityboundsforlinearstochasticapproximationwithfixedstepsize.AdvancesinNeuralInformation Processing Systems, 34:30063–30074

work page 2021
[29]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. InInternational Conference on Machine Learning, pages 1407–1416. 23

work page 2018
[30]

Even-Dar, E., Mansour, Y., and Bartlett, P. (2003). Learning rates forQ-learning.Journal of machine learning Research, 5(1)

work page 2003
[31]

Gosavi, A. (2006). Boundedness of iterates inQ-learning.Systems & control letters, 55(4):347–349

work page 2006
[32]

U., Khodadadian, S., and Maguluri, S

Haque, S. U., Khodadadian, S., and Maguluri, S. T. (2023). Tight finite time bounds of two-time-scale linear stochastic approximation with Markovian noise.Preprint arXiv:2401.00364

work page arXiv 2023
[33]

Haque, S. U. and Maguluri, S. T. (2024). Stochastic approximation with unbounded Markovian noise: A general-purpose theorem.Preprint arXiv:2410.21704

work page arXiv 2024
[34]

Cambridge University Press

Harchol-Balter,M.(2013).PerformanceModelingandDesignofComputerSystems: QueueingTheory in Action. Cambridge University Press

work page 2013
[35]

Huo, D., Chen, Y., and Xie, Q. (2023). Bias and extrapolation in Markovian linear stochastic approx- imation with constant stepsizes. InAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 81–82

work page 2023
[36]

I., and Singh, S

Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710

work page 1994
[37]

Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600

work page 2010
[38]

Jin, Y., Gummadi, R., Zhou, Z., and Blanchet, J. (2024). FeasibleQ-learning for average reward reinforcementlearning. InInternationalConferenceonArtificialIntelligenceandStatistics,pages1630–

work page 2024
[39]

and Sidford, A

Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward MDPs. InInternational Conference on Machine Learning, pages 5055–5064. PMLR

work page 2021
[40]

Kara, A. D. and Yuksel, S. (2023).Q-learning for continuous state and action MDPs under average cost criteria.Preprint arXiv:2308.07591

work page arXiv 2023
[41]

Kloek,T.andVanDijk,H.K.(1978).Bayesianestimatesofequationsystemparameters: Anapplication of integration by Monte Carlo.Econometrica: Journal of the Econometric Society, pages 1–19

work page 1978
[42]

Kushner, H. J. and Clark, D. S. (2012).Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26. Springer Science & Business Media

work page 2012
[43]

(2020).First-Order and Stochastic Optimization Methods for Machine Learning

Lan, G. (2020).First-Order and Stochastic Optimization Methods for Machine Learning. Springer

work page 2020
[44]

Lauand, C. K. and Meyn, S. (2024). Revisiting stepsize assumptions in stochastic approximation. Preprint arXiv:2405.17834

work page arXiv 2024
[45]

Lee, D. (2024). Final iteration convergence bound ofQ-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772

work page 2024
[46]

Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc

work page 2017
[47]

Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Tightening the dependence on horizon in the sample complexity ofQ-learning. InInternational Conference on Machine Learning, pages 6296–6306. PMLR. 24

work page 2021
[48]

IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236

Li,G.,Cai,C.,Chen,Y.,Wei,Y.,andChi,Y.(2024a). IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236

work page
[49]

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronousQ-learning: Sharper analysis and variance reduction.Advances in neural information processing systems, 33:7031– 7043

work page 2020
[50]

Stochasticfirst-ordermethodsforaverage-rewardMarkovdecision processes.Mathematics of Operations Research

Li,T.,Wu,F.,andLan,G.(2024b). Stochasticfirst-ordermethodsforaverage-rewardMarkovdecision processes.Mathematics of Operations Research

work page
[51]

Li, X., Yang, W., Liang, J., Zhang, Z., and Jordan, M. I. (2023). A statistical analysis of Polyak- Ruppert averagedQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR

work page 2023
[52]

Liu, B., Xie, Q., and Modiano, E. (2022). RL-QN: A reinforcement learning framework for optimal controlofqueueingsystems.ACMTransactionsonModelingandPerformanceEvaluationofComputing Systems, 7(1):1–35

work page 2022
[53]

Averagerewardreinforcementlearning: Foundations,algorithms,andempirical results.Machine learning, 22(1):159–195

Mahadevan,S.(1996). Averagerewardreinforcementlearning: Foundations,algorithms,andempirical results.Machine learning, 22(1):159–195

work page 1996
[54]

Ananalysisofreinforcementlearningwithfunction approximation

Melo,F.S.,Meyn,S.P.,andRibeiro,M.I.(2008). Ananalysisofreinforcementlearningwithfunction approximation. InProceedingsofthe25thinternationalconferenceonMachinelearning,pages664–671

work page 2008
[55]

Meyn, S. (2024). The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control

work page 2024
[56]

Meyn, S. P. and Tweedie, R. L. (2012).Markov Chains and Stochastic Stability. Springer Science & Business Media

work page 2012
[57]

K., Ostrovski, G., et al

Mnih,V.,Kavukcuoglu,K.,Silver,D.,Rusu,A.A.,Veness,J.,Bellemare,M.G.,Graves,A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning.nature, 518(7540):529–533

work page 2015
[58]

J., Wainwright, M

Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. (2020). On linear stochas- tic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. InConference on Learning Theory, pages 2947–2997. PMLR

work page 2020
[59]

Optimalandinstance-dependent guaranteesforMarkovianlinearstochasticapproximation.MathematicalStatisticsandLearning,7(1):41– 153

Mou,W.,Pananjady,A.,Wainwright,M.J.,andBartlett,P.L.(2024). Optimalandinstance-dependent guaranteesforMarkovianlinearstochasticapproximation.MathematicalStatisticsandLearning,7(1):41– 153

work page 2024
[60]

and Chen, Z

Nanda, P. and Chen, Z. (2025). A minimal-assumption analysis of Q-learning with time-varying policies.Preprint arXiv:1910.02140

work page arXiv 2025
[61]

Paulin, D. (2015). Concentration inequalities for Markov chains by Marton couplings and spectral methods.Electronic Journal of Probability, 20:1–32

work page 2015
[62]

Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons

work page 2014
[63]

Qian, X., Xie, Z., Liu, X., and Zhang, S. (2024). Almost sure convergence rates and concentration of stochastic approximation and reinforcement learning with Markovian noise.Preprint arXiv:2411.13711. 25

work page arXiv 2024
[64]

and Wierman, A

Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR

work page 2020
[65]

and Monro, S

Robbins, H. and Monro, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407

work page 1951
[66]

Shalev-Shwartz, S. et al. (2012). Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194

work page 2012
[67]

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144

work page 2018
[68]

Singh, B., Kumar, R., and Singh, V. P. (2022). Reinforcement learning in robotic applications: a comprehensive survey.Artificial Intelligence Review, 55(2):945–990

work page 2022
[69]

and Ying, L

Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830

work page 2019
[70]

Reinforcementlearningforcost-aware Markovdecisionprocesses

Suttle,W.,Zhang,K.,Yang,Z.,Liu,J.,andKraemer,D.(2021). Reinforcementlearningforcost-aware Markovdecisionprocesses. InInternationalConferenceonMachineLearning,pages9989–9999.PMLR

work page 2021
[71]

Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press

work page 2018
[72]

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation andQ-learning.Machine learning, 16(3):185–202

work page 1994
[73]

Tsitsiklis, J. N. and Van Roy, B. (1999). Average cost temporal-difference learning.Automatica, 35(11):1799–1808

work page 1999
[74]

Wainwright, M. J. (2019a). Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.Preprint arXiv:1905.06265

work page internal anchor Pith review Pith/arXiv arXiv 1905
[75]

Wainwright, M. J. (2019b). Variance-reducedQ-learning is minimax optimal.Preprint arXiv:1906.04697

work page arXiv 1906
[76]

Wan, Y., Naik, A., and Sutton, R. (2021a). Average-reward learning and planning with options. Advances in Neural Information Processing Systems, 34:22758–22769

work page
[77]

Learningandplanninginaverage-rewardMarkovdecision processes

Wan,Y.,Naik,A.,andSutton,R.S.(2021b). Learningandplanninginaverage-rewardMarkovdecision processes. InInternational Conference on Machine Learning, pages 10653–10662. PMLR

work page
[78]

Wan, Y., Yu, H., and Sutton, R. S. (2024). On convergence of average-rewardQ-learning in weakly communicating Markov decision processes.Preprint arXiv:2408.16262

work page arXiv 2024
[79]

Wang, S., Blanchet, J., and Glynn, P. (2024). Optimal sample complexity for average reward Markov decision processes. InThe Twelfth International Conference on Learning Representations

work page 2024
[80]

Watkins, C. J. and Dayan, P. (1992).Q-learning.Machine learning, 8(3-4):279–292

work page 1992

Showing first 80 references.

[1] [1]

Abounadi,J.,Bertsekas,D.,andBorkar,V.S.(2001).LearningalgorithmsforMarkovdecisionprocesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698

work page 2001

[2] [2]

Stochasticapproximationfornonexpansivemaps: Application toQ-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

Abounadi,J.,Bertsekas,D.P.,andBorkar,V.(2002). Stochasticapproximationfornonexpansivemaps: Application toQ-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

work page 2002

[3] [3]

InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1

Agrawal,P.andAgrawal,S.(2025).OptimisticQ-learningforaveragerewardandepisodicreinforcement learning. InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1. PMLR

work page 2025

[4] [4]

G., Gómez, V., and Kappen, H

Azar, M. G., Gómez, V., and Kappen, H. J. (2012). Dynamic policy programming.The Journal of Machine Learning Research, 13(1):3207–3245

work page 2012

[5] [5]

Beck, C. L. and Srikant, R. (2012). Error bounds for constant stepsizeQ-learning.Systems & control letters, 61(12):1203–1208

work page 2012

[6] [6]

Beck, C. L. and Srikant, R. (2013). Improved upper bounds on the expected error in constant stepsize Q-learning. In2013 American Control Conference, pages 1926–1931. IEEE

work page 2013

[7] [7]

(2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22

Benveniste, A., Métivier, M., and Priouret, P. (2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22. Springer Science & Business Media

work page 2012

[8] [8]

Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-Dynamic Programming. Athena Scientific

work page 1996

[9] [9]

Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR

work page 2018

[10] [10]

and Zhang, S

Blaser, E. and Zhang, S. (2024). Asymptotic and finite sample analysis of nonexpansive stochastic approximations with Markovian noise.Preprint arXiv:2409.19546

work page arXiv 2024

[11] [11]

Borkar, V. S. (1998). Asynchronous stochastic approximations.SIAM Journal on Control and Opti- mization, 36(3):840–851

work page 1998

[12] [12]

Borkar, V. S. (2008).Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge Univer- sity Press. 22

work page 2008

[13] [13]

Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469

work page 2000

[14] [14]

Siam Review, 60(2):223–311

Bottou,L.,Curtis,F.E.,andNocedal,J.(2018).Optimizationmethodsforlarge-scalemachinelearning. Siam Review, 60(2):223–311

work page 2018

[15] [15]

and Cominetti, R

Bravo, M. and Cominetti, R. (2024). Stochastic fixed-point iterations for nonexpansive maps: Conver- gence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219

work page 2024

[16] [16]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901

work page 2020

[17] [17]

Bucklew, J. A. and Bucklew, J. (2004).Introduction to Rare Event Simulation, volume 5. Springer

work page 2004

[18] [18]

and Borkar, V

Chandak, S. and Borkar, V. S. (2023). A concentration bound for TD(0) with function approximation. Preprint arXiv:2312.10424

work page arXiv 2023

[19] [19]

Chandak, S

Chandak, S., Haque, S. U., and Bambos, N. (2025). Finite-time bounds for two-timescale stochastic approximation with arbitrary norm contractions and Markovian noise.Preprint arXiv:2503.18391

work page arXiv 2025

[20] [20]

Targetnetworkandtruncationovercomethedeadly triad inQ-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

Chen,Z.,Clarke,J.-P.,andMaguluri,S.T.(2023). Targetnetworkandtruncationovercomethedeadly triad inQ-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

work page 2023

[21] [21]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes.Advances in Neural Information Processing Systems, 33

work page 2020

[22] [22]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov theory for finite- sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367

work page 2024

[23] [23]

T., and Zubeldia, M

Chen, Z., Maguluri, S. T., and Zubeldia, M. (2025a). Concentration of contractive stochastic approxi- mation: Additive and multiplicative noise.The Annals of Applied Probability, 35(2):1298–1352

work page

[24] [24]

T., Clarke, J.-P., and Maguluri, S

Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinearstochasticapproximationwithapplicationsinreinforcementlearning.Automatica,146:110623

work page 2022

[25] [25]

U., and Maguluri, S

Chen, Z., Zhang, S., Zhang, Z., Haque, S. U., and Maguluri, S. T. (2025b). A non-asymptotic theory of seminorm Lyapunov stability: From deterministic to stochastic iterative algorithms.Preprint arXiv:2502.14208

work page arXiv

[26] [26]

Dai, J. G. and Gluzman, M. (2022). Queueing network controls via deep reinforcement learning. Stochastic Systems, 12(1):30–67

work page 2022

[27] [27]

Devraj, A. M. and Meyn, S. (2017). ZapQ-learning. InAdvances in Neural Information Processing Systems, pages 2235–2244

work page 2017

[28] [28]

Durmus, A., Moulines, E., Naumov, A., Samsonov, S., Scaman, K., and Wai, H.-T. (2021). Tight high probabilityboundsforlinearstochasticapproximationwithfixedstepsize.AdvancesinNeuralInformation Processing Systems, 34:30063–30074

work page 2021

[29] [29]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. InInternational Conference on Machine Learning, pages 1407–1416. 23

work page 2018

[30] [30]

Even-Dar, E., Mansour, Y., and Bartlett, P. (2003). Learning rates forQ-learning.Journal of machine learning Research, 5(1)

work page 2003

[31] [31]

Gosavi, A. (2006). Boundedness of iterates inQ-learning.Systems & control letters, 55(4):347–349

work page 2006

[32] [32]

U., Khodadadian, S., and Maguluri, S

Haque, S. U., Khodadadian, S., and Maguluri, S. T. (2023). Tight finite time bounds of two-time-scale linear stochastic approximation with Markovian noise.Preprint arXiv:2401.00364

work page arXiv 2023

[33] [33]

Haque, S. U. and Maguluri, S. T. (2024). Stochastic approximation with unbounded Markovian noise: A general-purpose theorem.Preprint arXiv:2410.21704

work page arXiv 2024

[34] [34]

Cambridge University Press

Harchol-Balter,M.(2013).PerformanceModelingandDesignofComputerSystems: QueueingTheory in Action. Cambridge University Press

work page 2013

[35] [35]

Huo, D., Chen, Y., and Xie, Q. (2023). Bias and extrapolation in Markovian linear stochastic approx- imation with constant stepsizes. InAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 81–82

work page 2023

[36] [36]

I., and Singh, S

Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710

work page 1994

[37] [37]

Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600

work page 2010

[38] [38]

Jin, Y., Gummadi, R., Zhou, Z., and Blanchet, J. (2024). FeasibleQ-learning for average reward reinforcementlearning. InInternationalConferenceonArtificialIntelligenceandStatistics,pages1630–

work page 2024

[39] [39]

and Sidford, A

Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward MDPs. InInternational Conference on Machine Learning, pages 5055–5064. PMLR

work page 2021

[40] [40]

Kara, A. D. and Yuksel, S. (2023).Q-learning for continuous state and action MDPs under average cost criteria.Preprint arXiv:2308.07591

work page arXiv 2023

[41] [41]

Kloek,T.andVanDijk,H.K.(1978).Bayesianestimatesofequationsystemparameters: Anapplication of integration by Monte Carlo.Econometrica: Journal of the Econometric Society, pages 1–19

work page 1978

[42] [42]

Kushner, H. J. and Clark, D. S. (2012).Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26. Springer Science & Business Media

work page 2012

[43] [43]

(2020).First-Order and Stochastic Optimization Methods for Machine Learning

Lan, G. (2020).First-Order and Stochastic Optimization Methods for Machine Learning. Springer

work page 2020

[44] [44]

Lauand, C. K. and Meyn, S. (2024). Revisiting stepsize assumptions in stochastic approximation. Preprint arXiv:2405.17834

work page arXiv 2024

[45] [45]

Lee, D. (2024). Final iteration convergence bound ofQ-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772

work page 2024

[46] [46]

Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc

work page 2017

[47] [47]

Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Tightening the dependence on horizon in the sample complexity ofQ-learning. InInternational Conference on Machine Learning, pages 6296–6306. PMLR. 24

work page 2021

[48] [48]

IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236

Li,G.,Cai,C.,Chen,Y.,Wei,Y.,andChi,Y.(2024a). IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236

work page

[49] [49]

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronousQ-learning: Sharper analysis and variance reduction.Advances in neural information processing systems, 33:7031– 7043

work page 2020

[50] [50]

Stochasticfirst-ordermethodsforaverage-rewardMarkovdecision processes.Mathematics of Operations Research

Li,T.,Wu,F.,andLan,G.(2024b). Stochasticfirst-ordermethodsforaverage-rewardMarkovdecision processes.Mathematics of Operations Research

work page

[51] [51]

Li, X., Yang, W., Liang, J., Zhang, Z., and Jordan, M. I. (2023). A statistical analysis of Polyak- Ruppert averagedQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR

work page 2023

[52] [52]

Liu, B., Xie, Q., and Modiano, E. (2022). RL-QN: A reinforcement learning framework for optimal controlofqueueingsystems.ACMTransactionsonModelingandPerformanceEvaluationofComputing Systems, 7(1):1–35

work page 2022

[53] [53]

Averagerewardreinforcementlearning: Foundations,algorithms,andempirical results.Machine learning, 22(1):159–195

Mahadevan,S.(1996). Averagerewardreinforcementlearning: Foundations,algorithms,andempirical results.Machine learning, 22(1):159–195

work page 1996

[54] [54]

Ananalysisofreinforcementlearningwithfunction approximation

Melo,F.S.,Meyn,S.P.,andRibeiro,M.I.(2008). Ananalysisofreinforcementlearningwithfunction approximation. InProceedingsofthe25thinternationalconferenceonMachinelearning,pages664–671

work page 2008

[55] [55]

Meyn, S. (2024). The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control

work page 2024

[56] [56]

Meyn, S. P. and Tweedie, R. L. (2012).Markov Chains and Stochastic Stability. Springer Science & Business Media

work page 2012

[57] [57]

K., Ostrovski, G., et al

Mnih,V.,Kavukcuoglu,K.,Silver,D.,Rusu,A.A.,Veness,J.,Bellemare,M.G.,Graves,A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning.nature, 518(7540):529–533

work page 2015

[58] [58]

J., Wainwright, M

Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. (2020). On linear stochas- tic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. InConference on Learning Theory, pages 2947–2997. PMLR

work page 2020

[59] [59]

Optimalandinstance-dependent guaranteesforMarkovianlinearstochasticapproximation.MathematicalStatisticsandLearning,7(1):41– 153

Mou,W.,Pananjady,A.,Wainwright,M.J.,andBartlett,P.L.(2024). Optimalandinstance-dependent guaranteesforMarkovianlinearstochasticapproximation.MathematicalStatisticsandLearning,7(1):41– 153

work page 2024

[60] [60]

and Chen, Z

Nanda, P. and Chen, Z. (2025). A minimal-assumption analysis of Q-learning with time-varying policies.Preprint arXiv:1910.02140

work page arXiv 2025

[61] [61]

Paulin, D. (2015). Concentration inequalities for Markov chains by Marton couplings and spectral methods.Electronic Journal of Probability, 20:1–32

work page 2015

[62] [62]

Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons

work page 2014

[63] [63]

Qian, X., Xie, Z., Liu, X., and Zhang, S. (2024). Almost sure convergence rates and concentration of stochastic approximation and reinforcement learning with Markovian noise.Preprint arXiv:2411.13711. 25

work page arXiv 2024

[64] [64]

and Wierman, A

Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR

work page 2020

[65] [65]

and Monro, S

Robbins, H. and Monro, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407

work page 1951

[66] [66]

Shalev-Shwartz, S. et al. (2012). Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194

work page 2012

[67] [67]

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144

work page 2018

[68] [68]

Singh, B., Kumar, R., and Singh, V. P. (2022). Reinforcement learning in robotic applications: a comprehensive survey.Artificial Intelligence Review, 55(2):945–990

work page 2022

[69] [69]

and Ying, L

Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830

work page 2019

[70] [70]

Reinforcementlearningforcost-aware Markovdecisionprocesses

Suttle,W.,Zhang,K.,Yang,Z.,Liu,J.,andKraemer,D.(2021). Reinforcementlearningforcost-aware Markovdecisionprocesses. InInternationalConferenceonMachineLearning,pages9989–9999.PMLR

work page 2021

[71] [71]

Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press

work page 2018

[72] [72]

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation andQ-learning.Machine learning, 16(3):185–202

work page 1994

[73] [73]

Tsitsiklis, J. N. and Van Roy, B. (1999). Average cost temporal-difference learning.Automatica, 35(11):1799–1808

work page 1999

[74] [74]

Wainwright, M. J. (2019a). Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.Preprint arXiv:1905.06265

work page internal anchor Pith review Pith/arXiv arXiv 1905

[75] [75]

Wainwright, M. J. (2019b). Variance-reducedQ-learning is minimax optimal.Preprint arXiv:1906.04697

work page arXiv 1906

[76] [76]

Wan, Y., Naik, A., and Sutton, R. (2021a). Average-reward learning and planning with options. Advances in Neural Information Processing Systems, 34:22758–22769

work page

[77] [77]

Learningandplanninginaverage-rewardMarkovdecision processes

Wan,Y.,Naik,A.,andSutton,R.S.(2021b). Learningandplanninginaverage-rewardMarkovdecision processes. InInternational Conference on Machine Learning, pages 10653–10662. PMLR

work page

[78] [78]

Wan, Y., Yu, H., and Sutton, R. S. (2024). On convergence of average-rewardQ-learning in weakly communicating Markov decision processes.Preprint arXiv:2408.16262

work page arXiv 2024

[79] [79]

Wang, S., Blanchet, J., and Glynn, P. (2024). Optimal sample complexity for average reward Markov decision processes. InThe Twelfth International Conference on Learning Representations

work page 2024

[80] [80]

Watkins, C. J. and Dayan, P. (1992).Q-learning.Machine learning, 8(3-4):279–292

work page 1992