pith. sign in

arxiv: 2504.18743 · v2 · submitted 2025-04-25 · 💻 cs.LG · math.PR· stat.ML

From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

Pith reviewed 2026-05-22 17:30 UTC · model grok-4.3

classification 💻 cs.LG math.PRstat.ML
keywords Q-learningaverage-rewardfinite-time analysisasynchronous updatesadaptive stepsizesstochastic approximationreinforcement learning
0
0 comments X

The pith

Adaptive stepsizes let asynchronous average-reward Q-learning converge at a 1/k rate to the optimal Q-function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper provides the first finite-time analysis showing last-iterate convergence for average-reward Q-learning when updates occur asynchronously across state-action pairs. It proves that the algorithm's iterates converge in mean square at a rate of roughly 1 over the iteration count, measured against the optimal Q-function in the span seminorm. Adding a centering step produces pointwise convergence to the centered optimal values at the same rate. These guarantees matter because they apply to realistic implementations where different state-action pairs are updated at different times, and they demonstrate that fixed or non-adaptive stepsizes cause the iterates to miss the correct target entirely.

Core claim

Under appropriate assumptions the iterates of this Q-learning algorithm converge at rate Õ(1/k) in the mean-square sense to the optimal Q-function in the span seminorm. Adaptive stepsizes are necessary; without them the algorithm fails to reach the correct fixed point. The stepsizes act as local clocks and can be viewed as implicit importance sampling that offsets the bias from asynchronous sampling. The proof proceeds by recasting the resulting non-Markovian stochastic approximation as a time-inhomogeneous Markovian process and then applying time-varying bounds together with Markov-chain concentration inequalities.

What carries the argument

Adaptive stepsizes viewed as implicit importance sampling that counteracts bias from asynchronous updates

If this is right

  • Last-iterate convergence holds rather than only Cesaro or setwise convergence.
  • Finite-time rates become available for practical asynchronous average-reward reinforcement learning.
  • Each state-action pair effectively maintains its own clock through the adaptive rule.
  • The same reformulation technique can be reused for other stochastic approximation schemes that employ adaptive stepsizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptive-stepsize mechanism could be tested in other asynchronous reinforcement-learning methods such as SARSA or actor-critic variants.
  • Empirical measurement of the importance-sampling effect in high-asynchrony environments would directly test the paper's bias-cancellation argument.
  • Designing new adaptive rules that achieve faster than 1/k rates while preserving the fixed-point property appears feasible with the same analytical tools.

Load-bearing premise

Adaptive stepsizes fully counteract the bias from asynchronous updates so that the time-inhomogeneous Markovian reformulation still has the desired optimal Q-function as its fixed point.

What would settle it

Running the same algorithm with non-adaptive stepsizes and checking whether the mean-square distance to the optimal Q-function in the span seminorm stops decreasing or converges to the wrong limit.

Figures

Figures reproduced from arXiv: 2504.18743 by Phalguni Nanda, Zaiwei Chen.

Figure 1
Figure 1. Figure 1: Convergence to Q∗ in span(·) [PITH_FULL_IMAGE:figures/full_fig_p063_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Discounted [PITH_FULL_IMAGE:figures/full_fig_p063_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Adaptive Stepsize αk(s, a) = α(k+h) 1−z Nk(s,a)+h quickly because it yields larger steps, but the resulting variance takes longer to decay. Conversely, a larger z requires more iterations to diminish the initial bias, but exhibits a faster decay of the variance. 65 [PITH_FULL_IMAGE:figures/full_fig_p065_6.png] view at source ↗
read the original abstract

This work presents the first finite-time analysis for the last-iterate convergence of average-reward $Q$-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that, under appropriate assumptions, the iterates generated by this $Q$-learning algorithm converge at a rate of $\tilde{\mathcal{O}}(1/k)$ (in the mean-square sense) to the optimal $Q$-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to the centered optimal $Q$-function, also at a rate of $\tilde{\mathcal{O}}(1/k)$. To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous updates. Technically, the use of adaptive stepsizes makes each $Q$-learning update depend on the entire sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. The tools developed in this work are likely to be broadly applicable to the analysis of general SA algorithms with adaptive stepsizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to deliver the first finite-time analysis of last-iterate convergence for asynchronous average-reward Q-learning. It asserts that adaptive stepsizes, interpreted as local clocks and implicit importance sampling, enable Õ(1/k) mean-square convergence to the optimal Q-function in the span seminorm. Adding a centering step yields the same rate for pointwise convergence to the centered optimal Q-function. The proof strategy relies on a time-inhomogeneous Markovian reformulation of the resulting non-Markovian stochastic approximation together with almost-sure time-varying bounds, conditioning, and Markov-chain concentration inequalities. The authors also claim that adaptive stepsizes are necessary, as the algorithm fails to converge to the correct target without them.

Significance. If the stated rates and necessity result are rigorously established, the work would advance finite-time analysis of average-reward RL under asynchronous updates, a setting common in practice. The reformulation technique for handling adaptive stepsizes in stochastic approximation could apply more broadly. The necessity claim provides a useful supporting result that clarifies when such algorithms succeed.

major comments (1)
  1. Abstract: the manuscript states the Õ(1/k) mean-square rates and the necessity of adaptive stepsizes but supplies no derivation details, explicit error bounds, or complete assumption list. Without these, the correctness of the time-inhomogeneous Markovian reformulation and the concentration arguments used to break correlations cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for acknowledging the potential significance of the finite-time last-iterate analysis for asynchronous average-reward Q-learning. We address the major comment below.

read point-by-point responses
  1. Referee: [—] Abstract: the manuscript states the Õ(1/k) mean-square rates and the necessity of adaptive stepsizes but supplies no derivation details, explicit error bounds, or complete assumption list. Without these, the correctness of the time-inhomogeneous Markovian reformulation and the concentration arguments used to break correlations cannot be verified.

    Authors: The abstract is intentionally concise and provides a high-level summary of the contributions, rates, and proof strategy, consistent with standard practice. The full manuscript contains the complete list of assumptions (Section 2), explicit error bounds (Theorems 1 and 2), and the detailed development of the time-inhomogeneous Markovian reformulation together with the almost-sure time-varying bounds, conditioning arguments, and Markov-chain concentration inequalities (Sections 3–5). These sections supply the technical details needed to verify the arguments. We can revise the abstract to include a brief statement of the key assumptions and a high-level outline of the proof approach if the referee finds this helpful. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract presents a finite-time analysis of average-reward Q-learning using adaptive stepsizes, reformulating the non-Markovian process as time-inhomogeneous Markovian SA and applying standard Markov-chain concentration inequalities plus almost-sure bounds. No equation or claim reduces the claimed Õ(1/k) rate in the span seminorm to a fitted parameter, self-defined target, or load-bearing self-citation; the necessity of adaptivity is stated as a supporting result rather than an unverified premise, and the derivation remains self-contained against external tools such as Markov chain theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on unspecified 'appropriate assumptions' that are not detailed in the abstract; these likely include ergodicity conditions on the MDP and regularity of the adaptive stepsizes, but no explicit free parameters, new entities, or ad-hoc axioms are visible.

axioms (1)
  • domain assumption appropriate assumptions on the underlying MDP and adaptive stepsizes
    Invoked in the abstract to guarantee convergence; exact statement not provided.

pith-pipeline@v0.9.0 · 5816 in / 1259 out tokens · 59238 ms · 2026-05-22T17:30:54.478014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 1 internal anchor

  1. [1]

    Abounadi,J.,Bertsekas,D.,andBorkar,V.S.(2001).LearningalgorithmsforMarkovdecisionprocesses with average cost.SIAM Journal on Control and Optimization, 40(3):681–698

  2. [2]

    Stochasticapproximationfornonexpansivemaps: Application toQ-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

    Abounadi,J.,Bertsekas,D.P.,andBorkar,V.(2002). Stochasticapproximationfornonexpansivemaps: Application toQ-learning algorithms.SIAM Journal on Control and Optimization, 41(1):1–22

  3. [3]

    InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1

    Agrawal,P.andAgrawal,S.(2025).OptimisticQ-learningforaveragerewardandepisodicreinforcement learning. InThe Thirty Eighth Annual Conference on Learning Theory, pages 1–1. PMLR

  4. [4]

    G., Gómez, V., and Kappen, H

    Azar, M. G., Gómez, V., and Kappen, H. J. (2012). Dynamic policy programming.The Journal of Machine Learning Research, 13(1):3207–3245

  5. [5]

    Beck, C. L. and Srikant, R. (2012). Error bounds for constant stepsizeQ-learning.Systems & control letters, 61(12):1203–1208

  6. [6]

    Beck, C. L. and Srikant, R. (2013). Improved upper bounds on the expected error in constant stepsize Q-learning. In2013 American Control Conference, pages 1926–1931. IEEE

  7. [7]

    (2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22

    Benveniste, A., Métivier, M., and Priouret, P. (2012).Adaptive Algorithms and Stochastic Approxima- tions, volume 22. Springer Science & Business Media

  8. [8]

    Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-Dynamic Programming. Athena Scientific

  9. [9]

    Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR

  10. [10]

    and Zhang, S

    Blaser, E. and Zhang, S. (2024). Asymptotic and finite sample analysis of nonexpansive stochastic approximations with Markovian noise.Preprint arXiv:2409.19546

  11. [11]

    Borkar, V. S. (1998). Asynchronous stochastic approximations.SIAM Journal on Control and Opti- mization, 36(3):840–851

  12. [12]

    Borkar, V. S. (2008).Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge Univer- sity Press. 22

  13. [13]

    Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469

  14. [14]

    Siam Review, 60(2):223–311

    Bottou,L.,Curtis,F.E.,andNocedal,J.(2018).Optimizationmethodsforlarge-scalemachinelearning. Siam Review, 60(2):223–311

  15. [15]

    and Cominetti, R

    Bravo, M. and Cominetti, R. (2024). Stochastic fixed-point iterations for nonexpansive maps: Conver- gence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219

  16. [16]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901

  17. [17]

    Bucklew, J. A. and Bucklew, J. (2004).Introduction to Rare Event Simulation, volume 5. Springer

  18. [18]

    and Borkar, V

    Chandak, S. and Borkar, V. S. (2023). A concentration bound for TD(0) with function approximation. Preprint arXiv:2312.10424

  19. [19]

    Chandak, S

    Chandak, S., Haque, S. U., and Bambos, N. (2025). Finite-time bounds for two-timescale stochastic approximation with arbitrary norm contractions and Markovian noise.Preprint arXiv:2503.18391

  20. [20]

    Targetnetworkandtruncationovercomethedeadly triad inQ-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

    Chen,Z.,Clarke,J.-P.,andMaguluri,S.T.(2023). Targetnetworkandtruncationovercomethedeadly triad inQ-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101

  21. [21]

    T., Shakkottai, S., and Shanmugam, K

    Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes.Advances in Neural Information Processing Systems, 33

  22. [22]

    T., Shakkottai, S., and Shanmugam, K

    Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov theory for finite- sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367

  23. [23]

    T., and Zubeldia, M

    Chen, Z., Maguluri, S. T., and Zubeldia, M. (2025a). Concentration of contractive stochastic approxi- mation: Additive and multiplicative noise.The Annals of Applied Probability, 35(2):1298–1352

  24. [24]

    T., Clarke, J.-P., and Maguluri, S

    Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinearstochasticapproximationwithapplicationsinreinforcementlearning.Automatica,146:110623

  25. [25]

    U., and Maguluri, S

    Chen, Z., Zhang, S., Zhang, Z., Haque, S. U., and Maguluri, S. T. (2025b). A non-asymptotic theory of seminorm Lyapunov stability: From deterministic to stochastic iterative algorithms.Preprint arXiv:2502.14208

  26. [26]

    Dai, J. G. and Gluzman, M. (2022). Queueing network controls via deep reinforcement learning. Stochastic Systems, 12(1):30–67

  27. [27]

    Devraj, A. M. and Meyn, S. (2017). ZapQ-learning. InAdvances in Neural Information Processing Systems, pages 2235–2244

  28. [28]

    Durmus, A., Moulines, E., Naumov, A., Samsonov, S., Scaman, K., and Wai, H.-T. (2021). Tight high probabilityboundsforlinearstochasticapproximationwithfixedstepsize.AdvancesinNeuralInformation Processing Systems, 34:30063–30074

  29. [29]

    Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. InInternational Conference on Machine Learning, pages 1407–1416. 23

  30. [30]

    Even-Dar, E., Mansour, Y., and Bartlett, P. (2003). Learning rates forQ-learning.Journal of machine learning Research, 5(1)

  31. [31]

    Gosavi, A. (2006). Boundedness of iterates inQ-learning.Systems & control letters, 55(4):347–349

  32. [32]

    U., Khodadadian, S., and Maguluri, S

    Haque, S. U., Khodadadian, S., and Maguluri, S. T. (2023). Tight finite time bounds of two-time-scale linear stochastic approximation with Markovian noise.Preprint arXiv:2401.00364

  33. [33]

    Haque, S. U. and Maguluri, S. T. (2024). Stochastic approximation with unbounded Markovian noise: A general-purpose theorem.Preprint arXiv:2410.21704

  34. [34]

    Cambridge University Press

    Harchol-Balter,M.(2013).PerformanceModelingandDesignofComputerSystems: QueueingTheory in Action. Cambridge University Press

  35. [35]

    Huo, D., Chen, Y., and Xie, Q. (2023). Bias and extrapolation in Markovian linear stochastic approx- imation with constant stepsizes. InAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 81–82

  36. [36]

    I., and Singh, S

    Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710

  37. [37]

    Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600

  38. [38]

    Jin, Y., Gummadi, R., Zhou, Z., and Blanchet, J. (2024). FeasibleQ-learning for average reward reinforcementlearning. InInternationalConferenceonArtificialIntelligenceandStatistics,pages1630–

  39. [39]

    and Sidford, A

    Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward MDPs. InInternational Conference on Machine Learning, pages 5055–5064. PMLR

  40. [40]

    Kara, A. D. and Yuksel, S. (2023).Q-learning for continuous state and action MDPs under average cost criteria.Preprint arXiv:2308.07591

  41. [41]

    Kloek,T.andVanDijk,H.K.(1978).Bayesianestimatesofequationsystemparameters: Anapplication of integration by Monte Carlo.Econometrica: Journal of the Econometric Society, pages 1–19

  42. [42]

    Kushner, H. J. and Clark, D. S. (2012).Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26. Springer Science & Business Media

  43. [43]

    (2020).First-Order and Stochastic Optimization Methods for Machine Learning

    Lan, G. (2020).First-Order and Stochastic Optimization Methods for Machine Learning. Springer

  44. [44]

    Lauand, C. K. and Meyn, S. (2024). Revisiting stepsize assumptions in stochastic approximation. Preprint arXiv:2405.17834

  45. [45]

    Lee, D. (2024). Final iteration convergence bound ofQ-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772

  46. [46]

    Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc

  47. [47]

    Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Tightening the dependence on horizon in the sample complexity ofQ-learning. InInternational Conference on Machine Learning, pages 6296–6306. PMLR. 24

  48. [48]

    IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236

    Li,G.,Cai,C.,Chen,Y.,Wei,Y.,andChi,Y.(2024a). IsQ-learningminimaxoptimal? atightsample complexity analysis.Operations Research, 72(1):222–236

  49. [49]

    Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronousQ-learning: Sharper analysis and variance reduction.Advances in neural information processing systems, 33:7031– 7043

  50. [50]

    Stochasticfirst-ordermethodsforaverage-rewardMarkovdecision processes.Mathematics of Operations Research

    Li,T.,Wu,F.,andLan,G.(2024b). Stochasticfirst-ordermethodsforaverage-rewardMarkovdecision processes.Mathematics of Operations Research

  51. [51]

    Li, X., Yang, W., Liang, J., Zhang, Z., and Jordan, M. I. (2023). A statistical analysis of Polyak- Ruppert averagedQ-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR

  52. [52]

    Liu, B., Xie, Q., and Modiano, E. (2022). RL-QN: A reinforcement learning framework for optimal controlofqueueingsystems.ACMTransactionsonModelingandPerformanceEvaluationofComputing Systems, 7(1):1–35

  53. [53]

    Averagerewardreinforcementlearning: Foundations,algorithms,andempirical results.Machine learning, 22(1):159–195

    Mahadevan,S.(1996). Averagerewardreinforcementlearning: Foundations,algorithms,andempirical results.Machine learning, 22(1):159–195

  54. [54]

    Ananalysisofreinforcementlearningwithfunction approximation

    Melo,F.S.,Meyn,S.P.,andRibeiro,M.I.(2008). Ananalysisofreinforcementlearningwithfunction approximation. InProceedingsofthe25thinternationalconferenceonMachinelearning,pages664–671

  55. [55]

    Meyn, S. (2024). The projected Bellman equation in reinforcement learning.IEEE Transactions on Automatic Control

  56. [56]

    Meyn, S. P. and Tweedie, R. L. (2012).Markov Chains and Stochastic Stability. Springer Science & Business Media

  57. [57]

    K., Ostrovski, G., et al

    Mnih,V.,Kavukcuoglu,K.,Silver,D.,Rusu,A.A.,Veness,J.,Bellemare,M.G.,Graves,A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning.nature, 518(7540):529–533

  58. [58]

    J., Wainwright, M

    Mou, W., Li, C. J., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. (2020). On linear stochas- tic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. InConference on Learning Theory, pages 2947–2997. PMLR

  59. [59]

    Optimalandinstance-dependent guaranteesforMarkovianlinearstochasticapproximation.MathematicalStatisticsandLearning,7(1):41– 153

    Mou,W.,Pananjady,A.,Wainwright,M.J.,andBartlett,P.L.(2024). Optimalandinstance-dependent guaranteesforMarkovianlinearstochasticapproximation.MathematicalStatisticsandLearning,7(1):41– 153

  60. [60]

    and Chen, Z

    Nanda, P. and Chen, Z. (2025). A minimal-assumption analysis of Q-learning with time-varying policies.Preprint arXiv:1910.02140

  61. [61]

    Paulin, D. (2015). Concentration inequalities for Markov chains by Marton couplings and spectral methods.Electronic Journal of Probability, 20:1–32

  62. [62]

    Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons

  63. [63]

    Qian, X., Xie, Z., Liu, X., and Zhang, S. (2024). Almost sure convergence rates and concentration of stochastic approximation and reinforcement learning with Markovian noise.Preprint arXiv:2411.13711. 25

  64. [64]

    and Wierman, A

    Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR

  65. [65]

    and Monro, S

    Robbins, H. and Monro, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407

  66. [66]

    Shalev-Shwartz, S. et al. (2012). Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194

  67. [67]

    Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144

  68. [68]

    Singh, B., Kumar, R., and Singh, V. P. (2022). Reinforcement learning in robotic applications: a comprehensive survey.Artificial Intelligence Review, 55(2):945–990

  69. [69]

    and Ying, L

    Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830

  70. [70]

    Reinforcementlearningforcost-aware Markovdecisionprocesses

    Suttle,W.,Zhang,K.,Yang,Z.,Liu,J.,andKraemer,D.(2021). Reinforcementlearningforcost-aware Markovdecisionprocesses. InInternationalConferenceonMachineLearning,pages9989–9999.PMLR

  71. [71]

    Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press

  72. [72]

    Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation andQ-learning.Machine learning, 16(3):185–202

  73. [73]

    Tsitsiklis, J. N. and Van Roy, B. (1999). Average cost temporal-difference learning.Automatica, 35(11):1799–1808

  74. [74]

    Wainwright, M. J. (2019a). Stochastic approximation with cone-contractive operators: Sharpℓ∞- bounds forQ-learning.Preprint arXiv:1905.06265

  75. [75]

    Wainwright, M. J. (2019b). Variance-reducedQ-learning is minimax optimal.Preprint arXiv:1906.04697

  76. [76]

    Wan, Y., Naik, A., and Sutton, R. (2021a). Average-reward learning and planning with options. Advances in Neural Information Processing Systems, 34:22758–22769

  77. [77]

    Learningandplanninginaverage-rewardMarkovdecision processes

    Wan,Y.,Naik,A.,andSutton,R.S.(2021b). Learningandplanninginaverage-rewardMarkovdecision processes. InInternational Conference on Machine Learning, pages 10653–10662. PMLR

  78. [78]

    Wan, Y., Yu, H., and Sutton, R. S. (2024). On convergence of average-rewardQ-learning in weakly communicating Markov decision processes.Preprint arXiv:2408.16262

  79. [79]

    Wang, S., Blanchet, J., and Glynn, P. (2024). Optimal sample complexity for average reward Markov decision processes. InThe Twelfth International Conference on Learning Representations

  80. [80]

    Watkins, C. J. and Dayan, P. (1992).Q-learning.Machine learning, 8(3-4):279–292

Showing first 80 references.