pith. sign in

arxiv: 2510.16132 · v2 · submitted 2025-10-17 · 💻 cs.LG · math.OC· stat.ML

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Pith reviewed 2026-05-18 05:43 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords Q-learningfinite-time analysistime-varying policieslast-iterate convergenceon-policy samplingMarkov decision processesminimal assumptions
0
0 comments X

The pith

Q-learning with time-varying policies converges to the optimal Q-function at rate O(1/ξ²) under only irreducibility assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a last-iterate convergence guarantee for Q-learning when the sampling policy changes at each step, showing that the expected squared maximum error to the optimal Q* shrinks fast enough to reach accuracy ξ after order 1/ξ² samples. This result holds as long as there exists at least one policy that can reach every state from every other state, removing the need for stronger mixing or coverage conditions that earlier analyses required. A sympathetic reader would care because many practical systems must learn and act with the same evolving policy rather than maintaining a separate exploratory one, and the new bound clarifies the exploration-exploitation trade-off that arises in such on-policy settings. The analysis also supplies a rate for the value of the current learning policy itself, showing how exploitation improves as the policy improves.

Core claim

Under the sole assumption that an irreducible policy exists, Q-learning with rapidly changing learning policies satisfies a last-iterate bound E[||Q_k - Q*||_∞²] that yields sample complexity O(1/ξ²) to make E[||Q_k - Q*||_∞] ≤ ξ; the same framework produces a finite-time bound on E[||Q^{π_k} - Q*||_∞²] that makes the exploration-exploitation tension explicit.

What carries the argument

A Poisson-equation decomposition of the time-inhomogeneous Markovian noise under a lazy transition kernel, which splits the noise into a martingale-difference sequence plus controllable residual terms whose size is bounded by sensitivity of the Poisson solution to changes in both the Q-estimate and the policy.

If this is right

  • The sample complexity matches the best known off-policy Q-learning rate in the accuracy parameter ξ but carries a worse dependence on exploration constants.
  • The current policy's value function also converges, so exploitation improves automatically as the policy approaches optimality.
  • The same Poisson-equation technique can be applied to other single-timescale algorithms that use time-varying policies, such as actor-critic methods.
  • Numerical experiments on standard test MDPs confirm that the predicted rates appear in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In environments where maintaining a separate exploratory policy is costly, on-policy Q-learning may become preferable once the exploitation advantage is quantified.
  • The minimal-assumption result suggests that similar last-iterate bounds could be derived for other on-policy temporal-difference methods without invoking uniform ergodicity.
  • If the irreducibility condition fails, one could test whether adding a small amount of forced exploration restores the rate without changing the core algorithm.

Load-bearing premise

There exists at least one policy whose induced Markov chain can reach every state from every other state.

What would settle it

Run the algorithm on a finite MDP whose only irreducible policies have extremely poor mixing times and check whether the observed squared infinity-norm error still decays at the claimed 1/k rate.

Figures

Figures reproduced from arXiv: 2510.16132 by Phalguni Nanda, Zaiwei Chen.

Figure 2
Figure 2. Figure 2: Convergence rates of [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The exploration–exploitation trade￾off. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $\mathcal{O}(1/\xi^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi$. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript provides the first finite-time analysis of Q-learning with time-varying on-policy sampling for discounted MDPs, under the sole assumption that there exists at least one policy inducing an irreducible Markov chain. It establishes a last-iterate rate on E[||Q_k - Q^*||_∞²] that yields an O(1/ξ²) sample complexity to reach E[||Q_k - Q^*||_∞] ≤ ξ, matching off-policy Q-learning up to worse dependence on exploration parameters. A parallel rate is derived for E[||Q^{π_k} - Q^*||_∞²]. The analysis decomposes time-inhomogeneous Markovian noise via a Poisson equation under a lazy transition matrix, separating a martingale-difference term from residuals controlled by sensitivity of the Poisson solution to both the current Q-estimate and the evolving policy π_k. Numerical experiments are included to support the rates.

Significance. If the central rates hold, the result is significant for extending finite-time guarantees to on-policy Q-learning under minimal exploration assumptions, explicitly trading off weaker exploration against the exploitation benefit of policy improvement. The Poisson-equation decomposition and sensitivity analysis constitute a reusable technique for time-varying policies that could apply to single-timescale actor-critic methods and learning-in-games settings. The paper earns credit for the minimal assumption, the explicit last-iterate squared-error rate, and the reproducible numerical validation.

major comments (2)
  1. [§4.2, Lemma 4.3] §4.2, the sensitivity bound on the Poisson solution (Lemma 4.3): the claimed uniform Lipschitz constant with respect to π_k relies on a fixed irreducibility gap, but the minimal assumption only guarantees existence of one irreducible policy; as π_k converges to a nearly deterministic optimum the induced chain can have arbitrarily small transition probabilities, so the Poisson-solution norm and its policy derivative may grow with k. This directly affects the size of the residual terms in the noise decomposition and is load-bearing for obtaining the stated O(1/k) decay.
  2. [Theorem 5.1] Theorem 5.1, the recursion for E[||Q_k - Q^*||_∞²]: the proof closes the recursion by absorbing the residual into a 1/k term, but without a k-uniform bound on the sensitivity constants the absorption step appears to require an additional assumption on the minimal exploration probability that is not stated in the theorem hypotheses.
minor comments (2)
  1. [Notation section] The definition of the lazy transition matrix is introduced only in §3; moving a brief reminder to the notation section would improve readability for readers focused on the main theorem.
  2. [Figure 2] Figure 2 caption does not indicate which curves correspond to different values of the exploration parameter; adding this would make the empirical support for the theory clearer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The major comments correctly identify that our sensitivity analysis requires a uniform bound on the irreducibility gap for the time-varying policies, which is not guaranteed by the sole existence of one irreducible policy. We will revise the manuscript to include a mild uniform exploration assumption on the sequence of policies to close the analysis rigorously. We respond to each comment below.

read point-by-point responses
  1. Referee: [§4.2, Lemma 4.3] §4.2, the sensitivity bound on the Poisson solution (Lemma 4.3): the claimed uniform Lipschitz constant with respect to π_k relies on a fixed irreducibility gap, but the minimal assumption only guarantees existence of one irreducible policy; as π_k converges to a nearly deterministic optimum the induced chain can have arbitrarily small transition probabilities, so the Poisson-solution norm and its policy derivative may grow with k. This directly affects the size of the residual terms in the noise decomposition and is load-bearing for obtaining the stated O(1/k) decay.

    Authors: We agree with this assessment. The current statement of Lemma 4.3 does not explicitly ensure uniformity of the Lipschitz constant as π_k evolves toward a deterministic policy. In the revision we will add the assumption that the sequence of learning policies {π_k} satisfies a uniform lower bound δ > 0 on the smallest positive transition probability (equivalently, a uniform spectral gap for the lazy chains). The laziness parameter λ and this δ together yield a k-independent bound on the Poisson solution and its derivatives with respect to both Q and π. We will rewrite the lemma statement and proof to make the dependence on δ explicit and add a short discussion noting that this assumption remains weaker than the fixed exploration distribution required by off-policy analyses. revision: yes

  2. Referee: [Theorem 5.1] Theorem 5.1, the recursion for E[||Q_k - Q^*||_∞²]: the proof closes the recursion by absorbing the residual into a 1/k term, but without a k-uniform bound on the sensitivity constants the absorption step appears to require an additional assumption on the minimal exploration probability that is not stated in the theorem hypotheses.

    Authors: We agree that the absorption step in the proof of Theorem 5.1 relies on k-uniform control of the sensitivity constants. With the revised Lemma 4.3 that incorporates the uniform δ, the residual terms remain O(1/k) with a constant depending on δ but independent of k. We will expand the proof to display this absorption explicitly, state the dependence of the final rate on δ, and update the theorem hypotheses and sample-complexity corollary accordingly. No other changes to the main claims are required. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new Poisson decomposition under stated MDP assumptions

full rationale

The paper derives last-iterate rates for E[||Q_k - Q*||_∞²] via a Poisson-equation decomposition of time-inhomogeneous Markov noise into martingale differences plus controlled residuals, with sensitivity bounds on the Poisson solution w.r.t. Q-estimates and π_k. This is presented as a technical contribution under the single minimal assumption of one irreducible policy. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the rates are expressed directly in terms of discount factor, irreducibility constants, and problem parameters without renaming or smuggling prior results as new predictions. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard MDP assumptions plus the new analytical technique of Poisson-equation decomposition for time-inhomogeneous noise; no free parameters or invented entities are introduced beyond the problem setup.

pith-pipeline@v0.9.0 · 5841 in / 1270 out tokens · 72365 ms · 2026-05-18T05:43:33.640344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

    cs.LG 2026-05 unverdicted novelty 8.0

    Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    MIT press, 2018

    Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT press, 2018

  2. [2]

    Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017

  3. [3]

    End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

    SergeyLevine, ChelseaFinn, TrevorDarrell, andPieterAbbeel. End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

  4. [4]

    Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput

    M.MehdiAfsar,TraffordCrump,andBehrouzFar. Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput. Surv., 55(7), December 2022. ISSN 0360-0300. doi: 10.1145/3543846. URL https://doi.org/10.1145/3543846

  5. [5]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  6. [6]

    Q-learning.Machine learning, 8(3-4):279–292, 1992

    Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3-4):279–292, 1992

  7. [7]

    A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951

  8. [8]

    Rusu, Joel Veness, Marc G

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and DemisHassabis. Human-levelcontrolthroughdeepreinforcementlearnin...

  9. [9]

    Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994

    John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994

  10. [10]

    The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

    Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

  11. [11]

    Springer, 2009

    Vivek S Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer, 2009

  12. [12]

    The asymptotic convergence-rate of q-learning

    Csaba Szepesvári. The asymptotic convergence-rate of q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998

  13. [13]

    Beck and R

    C.L. Beck and R. Srikant. Error bounds for constant step-size Q-learning.Systems & Control Letters, 61 (12):1203–1208, 2012. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2012.08.014. URL https://www.sciencedirect.com/science/article/pii/S016769111200179X

  14. [14]

    Beck and R

    Carolyn L. Beck and R. Srikant. Improved upper bounds on the expected error in constant step-size Q-learning. In2013 American Control Conference, pages 1926–1931, 2013. doi: 10.1109/ACC.2013. 6580117

  15. [15]

    A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4): 1352–1367, 2024

    Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4): 1352–1367, 2024

  16. [16]

    Final iteration convergence bound of Q-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

    Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

  17. [17]

    Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

    Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞-bounds for 𝑄-learning.Preprint arXiv:1905.06265, 2019

  18. [18]

    Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

    Martin J Wainwright. Variance-reduced q-learning is minimax optimal.Preprint arXiv:1906.04697, 2019

  19. [19]

    Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003

    Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003

  20. [20]

    Finite-time analysis of asynchronous stochastic approximation and Q-learning

    Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR, 2020. 20

  21. [21]

    Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction

    Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc., 2020

  22. [22]

    A statistical analysis of polyak-ruppert averaged q-learning

    Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023

  23. [23]

    Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J. Kappen. Dynamic policy programming.J. Mach. Learn. Res., 13(1):3207–3245, November 2012. ISSN 1532-4435

  24. [24]

    Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

    Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

  25. [25]

    Linear Q-le arning does not diverge: Convergence rates to a bounded set

    Xinyu Liu, Zixuan Xie, and Shangtong Zhang. Linear q-learning does not diverge inL2: Convergence rates to a bounded set.Preprint arXiv:2501.19254, 2025

  26. [26]

    Deep reinforcement learning with double q-learning

    Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 2094–2100. AAAI Press, 2016

  27. [27]

    Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax

    MichelTokicandGüntherPalm. Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax. InAnnual conference on artificial intelligence, pages 335–346. Springer, 2011

  28. [28]

    Dueling network architectures for deep reinforcement learning

    Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

  29. [29]

    Finite-time error bounds for linear stochastic approximation and TD-learning

    R Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830, 2019

  30. [30]

    A finite-time analysis of temporal difference learning with linear function approximation

    Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite-time analysis of temporal difference learning with linear function approximation. InConference On Learning Theory, pages 1691–1692, 2018

  31. [31]

    Finite-sample analysis for SARSA with linear function approximation

    Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA with linear function approximation. InAdvances in Neural Information Processing Systems, pages 8668–8678, 2019

  32. [32]

    Stochastic approximation with unbounded Markovian noise: A general-purpose theorem

    Shaan Ul Haque and Siva Theja Maguluri. Stochastic approximation with unbounded Markovian noise: A general-purpose theorem. InInternational Conference on Artificial Intelligence and Statistics, pages 3718–3726. PMLR, 2025

  33. [33]

    Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430, 2022

    Siddharth Chandak, Vivek S Borkar, and Parth Dodhia. Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430, 2022

  34. [34]

    Convergence of stochastic iterative dynamic programming algorithms

    Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710, 1994

  35. [35]

    A unified switching system perspective and convergence analysis of Q-learning algorithms.Advances in Neural Information Processing Systems, 33:15556–15567, 2020

    Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms.Advances in Neural Information Processing Systems, 33:15556–15567, 2020

  36. [36]

    Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017

    AdithyaMDevrajandSeanMeyn. Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017. 21

  37. [37]

    Instance-optimality in optimal valueestimation: Adaptivityviavariance-reducedq-learning.IEEETransactionsonInformationTheory, 2024

    Eric Xia, Koulik Khamaru, Martin J Wainwright, and Michael I Jordan. Instance-optimality in optimal valueestimation: Adaptivityviavariance-reducedq-learning.IEEETransactionsonInformationTheory, 2024

  38. [38]

    arXiv preprint arXiv:2401.13884 , year=

    Yixuan Zhang and Qiaomin Xie. Constant stepsize q-learning: Distributional convergence, bias and extrapolation.Preprint arXiv:2401.13884, 2024

  39. [39]

    An analysis of reinforcement learning with function approximation

    Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008

  40. [40]

    Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

    Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

  41. [41]

    TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024

    SeanMeyn. TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024

  42. [42]

    The blessing of heterogeneity in federated q-learning: Linear speedup and beyond.Journal of Machine Learning Research, 26(26):1–85, 2025

    Jiin Woo, Gauri Joshi, and Yuejie Chi. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond.Journal of Machine Learning Research, 26(26):1–85, 2025

  43. [43]

    Federated reinforcement learning: Linearspeedupundermarkoviansampling

    Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linearspeedupundermarkoviansampling. InInternationalConferenceonMachineLearning, pages 10997–11057. PMLR, 2022

  44. [44]

    Q-learning with logarithmic regret

    Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021

  45. [45]

    Online Q-learning using connectionist systems.University of Cambridge, Department of Engineering, Cambridge, UK, 37, 1994

    Gavin A Rummery and Mahesan Niranjan. Online Q-learning using connectionist systems.University of Cambridge, Department of Engineering, Cambridge, UK, 37, 1994

  46. [46]

    Convergence results for single-step on-policy reinforcement-learning algorithms.Machine learning, 38:287–308, 2000

    Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms.Machine learning, 38:287–308, 2000

  47. [47]

    On the convergence of sarsa with linear function approximation

    Shangtong Zhang, Remi Tachet Des Combes, and Romain Laroche. On the convergence of sarsa with linear function approximation. InInternational Conference on Machine Learning, pages 41613–41646. PMLR, 2023

  48. [48]

    A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

    Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

  49. [49]

    Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control, 2022

    Sajad Khodadadian, Thinh T Doan, Justin Romberg, and Siva Theja Maguluri. Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control, 2022

  50. [50]

    A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games.Advances in Neural Information Processing Systems, 36:75826–75883, 2023

    Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games.Advances in Neural Information Processing Systems, 36:75826–75883, 2023

  51. [51]

    Two-timescale Q-learning with function approximation in zero-sum stochastic games

    Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. Two-timescale Q-learning with function approximation in zero-sum stochastic games. InProceedings of the 25th ACM Conference on Economics and Computation, EC ’24, page 378, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400707049. doi: 10.1145/3670865.3673...

  52. [52]

    John Wiley & Sons, 2014

    Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

  53. [53]

    Athena Scientific, 1996

    Dimitri P Bertsekas and John N Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

  54. [54]

    Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales

    StefanBanach. Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales. Fund. math, 3(1):133–181, 1922

  55. [55]

    On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

    Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.Preprint arXiv:1704.00805, 2017

  56. [56]

    AmericanMathematical Soc., 2017

    DavidALevinandYuvalPeres.MarkovChainsandMixingTimes,volume107. AmericanMathematical Soc., 2017

  57. [57]

    Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis

    Ziyi Chen, Yi Zhou, Rong-Rong Chen, and Shaofeng Zou. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. InInternational Conference on Machine Learning, pages 3794–3834. PMLR, 2022

  58. [58]

    Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game

    Ziyi Chen, Shaocong Ma, and Yi Zhou. Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game. InInternational Conference on Learning Representations, 2021

  59. [59]

    Sample complexity bounds for two timescale value-based reinforcement learningalgorithms

    Tengyu Xu and Yingbin Liang. Sample complexity bounds for two timescale value-based reinforcement learningalgorithms. InInternationalConferenceonArtificialIntelligenceandStatistics, pages811–819. PMLR, 2021

  60. [60]

    On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021

    Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021

  61. [61]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

  62. [62]

    Sharper model-free reinforcement learning for average-reward markov decision processes

    Zihan Zhang and Qiaomin Xie. Sharper model-free reinforcement learning for average-reward markov decision processes. InThe Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR, 2023

  63. [63]

    Near sample-optimal reduction-based policy learning for average reward mdp.Preprint arXiv:2212.00603, 2022

    Jinghan Wang, Mengdi Wang, and Lin F Yang. Near sample-optimal reduction-based policy learning for average reward mdp.Preprint arXiv:2212.00603, 2022

  64. [64]

    R. Douc, E. Moulines, P. Priouret, and P. Soulier.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer International Publishing, 2018. ISBN 9783319977041. URLhttps://books.google.com/books?id=QaZ-DwAAQBAJ

  65. [65]

    Solution representations for poisson’s equation, martingale structure, and the markov chain central limit theorem.Stochastic Systems, 14(1):47–68, 2024

    Peter W Glynn and Alex Infanger. Solution representations for poisson’s equation, martingale structure, and the markov chain central limit theorem.Stochastic Systems, 14(1):47–68, 2024

  66. [66]

    Generalized inverses and their application to applied probability problems.Linear Algebra and its Applications, 45:157–198, 1982

    Jeffrey J Hunter. Generalized inverses and their application to applied probability problems.Linear Algebra and its Applications, 45:157–198, 1982

  67. [67]

    A liapounov bound for solutions of the poisson equation.The Annals of Probability, pages 916–931, 1996

    Peter W Glynn and Sean P Meyn. A liapounov bound for solutions of the poisson equation.The Annals of Probability, pages 916–931, 1996

  68. [68]

    An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025

    Zaiwei Chen and Siva Theja Maguluri. An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025. ISSN 0005-1098. doi: https://doi.org/10.1016/ j.automatica.2025.112395. URL https://www.sciencedirect.com/science/article/pii/ S0005109825002894. 23

  69. [69]

    Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006

    Abhijit Gosavi. Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006

  70. [70]

    Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013

    Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013

  71. [71]

    Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194, 2012

    Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194, 2012

  72. [72]

    Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017

    Amir Beck.First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997. URLhttps://epubs.siam.org/doi/ abs/10.1137/1.9781611974997

  73. [73]

    YizhouZhang,GuannanQu,PanXu,YihengLin,ZaiweiChen,andAdamWierman. Globalconvergence of localized policy iteration in networked multi-agent reinforcement learning.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1–51, 2023. 24 Appendices A Proofs of All Technical Results in Section 3 A.1 Assuming𝜋 𝑏 (𝑎|𝑠)>0for all(𝑠, 𝑎)is withou...

  74. [74]

    To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0

    We first show that𝑥= Í∞ 𝑘=0 P 𝑘 𝑦/2 is well-defined; that is, the limitlim𝑘→∞ Í𝑘 𝑛=0 P 𝑛𝑦 exists and is finite. To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0 . We will show that the sequence{𝑧 𝑘 } is Cauchy. For any𝑘 1, 𝑘2 ≥0(assume without loss of generality that𝑘1 ≤𝑘 2), we have ∥𝑧 𝑘2 −𝑧 𝑘1 ∥∞ = 𝑘2∑︁ 𝑛=𝑘1+1 P 𝑛𝑦 ∞ ≤ 𝑘2∑︁ 𝑛=𝑘1+1 ∥P 𝑛𝑦∥ ∞ = 𝑘2∑︁ 𝑛=𝑘1+1 ...

  75. [75]

    𝑟𝑏∑︁ 𝑖=0 𝑟𝑏 +1 𝑖+1 𝑃𝑖 𝜋𝑏 (𝑠 ′′, 𝑠′) # 𝜋𝑏 (𝑎 ′|𝑠 ′) (Change of variable:𝑖=𝑗−1) = 1 2𝑟𝑏+1 ∑︁ 𝑠′′ ∈ S 𝑝(𝑠 ′′ |𝑠, 𝑎)

    For any𝑛≥0, we have ∥𝑥1 −𝑥 2∥∞ = 1 2 ∞∑︁ 𝑘=0 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 P 𝑘 1 𝑦1 − 𝑛−1∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 ∥∞∥𝑦 1 −𝑦 2∥∞ + 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 − P 𝑘 2 ∥∞∥𝑦 2∥∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ . We now bound each term on the right-hand side. Since eachP𝑘 1...