A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Phalguni Nanda; Zaiwei Chen

arxiv: 2510.16132 · v2 · submitted 2025-10-17 · 💻 cs.LG · math.OC· stat.ML

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Phalguni Nanda , Zaiwei Chen This is my paper

Pith reviewed 2026-05-18 05:43 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords Q-learningfinite-time analysistime-varying policieslast-iterate convergenceon-policy samplingMarkov decision processesminimal assumptions

0 comments

The pith

Q-learning with time-varying policies converges to the optimal Q-function at rate O(1/ξ²) under only irreducibility assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a last-iterate convergence guarantee for Q-learning when the sampling policy changes at each step, showing that the expected squared maximum error to the optimal Q* shrinks fast enough to reach accuracy ξ after order 1/ξ² samples. This result holds as long as there exists at least one policy that can reach every state from every other state, removing the need for stronger mixing or coverage conditions that earlier analyses required. A sympathetic reader would care because many practical systems must learn and act with the same evolving policy rather than maintaining a separate exploratory one, and the new bound clarifies the exploration-exploitation trade-off that arises in such on-policy settings. The analysis also supplies a rate for the value of the current learning policy itself, showing how exploitation improves as the policy improves.

Core claim

Under the sole assumption that an irreducible policy exists, Q-learning with rapidly changing learning policies satisfies a last-iterate bound E[||Q_k - Q*||_∞²] that yields sample complexity O(1/ξ²) to make E[||Q_k - Q*||_∞] ≤ ξ; the same framework produces a finite-time bound on E[||Q^{π_k} - Q*||_∞²] that makes the exploration-exploitation tension explicit.

What carries the argument

A Poisson-equation decomposition of the time-inhomogeneous Markovian noise under a lazy transition kernel, which splits the noise into a martingale-difference sequence plus controllable residual terms whose size is bounded by sensitivity of the Poisson solution to changes in both the Q-estimate and the policy.

If this is right

The sample complexity matches the best known off-policy Q-learning rate in the accuracy parameter ξ but carries a worse dependence on exploration constants.
The current policy's value function also converges, so exploitation improves automatically as the policy approaches optimality.
The same Poisson-equation technique can be applied to other single-timescale algorithms that use time-varying policies, such as actor-critic methods.
Numerical experiments on standard test MDPs confirm that the predicted rates appear in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In environments where maintaining a separate exploratory policy is costly, on-policy Q-learning may become preferable once the exploitation advantage is quantified.
The minimal-assumption result suggests that similar last-iterate bounds could be derived for other on-policy temporal-difference methods without invoking uniform ergodicity.
If the irreducibility condition fails, one could test whether adding a small amount of forced exploration restores the rate without changing the core algorithm.

Load-bearing premise

There exists at least one policy whose induced Markov chain can reach every state from every other state.

What would settle it

Run the algorithm on a finite MDP whose only irreducible policies have extremely poor mixing times and check whether the observed squared infinity-norm error still decays at the claimed 1/k rate.

Figures

Figures reproduced from arXiv: 2510.16132 by Phalguni Nanda, Zaiwei Chen.

**Figure 4.** Figure 4: The exploration–exploitation tradeoff. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $\mathcal{O}(1/\xi^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi$. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript provides the first finite-time analysis of Q-learning with time-varying on-policy sampling for discounted MDPs, under the sole assumption that there exists at least one policy inducing an irreducible Markov chain. It establishes a last-iterate rate on E[||Q_k - Q^*||_∞²] that yields an O(1/ξ²) sample complexity to reach E[||Q_k - Q^*||_∞] ≤ ξ, matching off-policy Q-learning up to worse dependence on exploration parameters. A parallel rate is derived for E[||Q^{π_k} - Q^*||_∞²]. The analysis decomposes time-inhomogeneous Markovian noise via a Poisson equation under a lazy transition matrix, separating a martingale-difference term from residuals controlled by sensitivity of the Poisson solution to both the current Q-estimate and the evolving policy π_k. Numerical experiments are included to support the rates.

Significance. If the central rates hold, the result is significant for extending finite-time guarantees to on-policy Q-learning under minimal exploration assumptions, explicitly trading off weaker exploration against the exploitation benefit of policy improvement. The Poisson-equation decomposition and sensitivity analysis constitute a reusable technique for time-varying policies that could apply to single-timescale actor-critic methods and learning-in-games settings. The paper earns credit for the minimal assumption, the explicit last-iterate squared-error rate, and the reproducible numerical validation.

major comments (2)

[§4.2, Lemma 4.3] §4.2, the sensitivity bound on the Poisson solution (Lemma 4.3): the claimed uniform Lipschitz constant with respect to π_k relies on a fixed irreducibility gap, but the minimal assumption only guarantees existence of one irreducible policy; as π_k converges to a nearly deterministic optimum the induced chain can have arbitrarily small transition probabilities, so the Poisson-solution norm and its policy derivative may grow with k. This directly affects the size of the residual terms in the noise decomposition and is load-bearing for obtaining the stated O(1/k) decay.
[Theorem 5.1] Theorem 5.1, the recursion for E[||Q_k - Q^*||_∞²]: the proof closes the recursion by absorbing the residual into a 1/k term, but without a k-uniform bound on the sensitivity constants the absorption step appears to require an additional assumption on the minimal exploration probability that is not stated in the theorem hypotheses.

minor comments (2)

[Notation section] The definition of the lazy transition matrix is introduced only in §3; moving a brief reminder to the notation section would improve readability for readers focused on the main theorem.
[Figure 2] Figure 2 caption does not indicate which curves correspond to different values of the exploration parameter; adding this would make the empirical support for the theory clearer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The major comments correctly identify that our sensitivity analysis requires a uniform bound on the irreducibility gap for the time-varying policies, which is not guaranteed by the sole existence of one irreducible policy. We will revise the manuscript to include a mild uniform exploration assumption on the sequence of policies to close the analysis rigorously. We respond to each comment below.

read point-by-point responses

Referee: [§4.2, Lemma 4.3] §4.2, the sensitivity bound on the Poisson solution (Lemma 4.3): the claimed uniform Lipschitz constant with respect to π_k relies on a fixed irreducibility gap, but the minimal assumption only guarantees existence of one irreducible policy; as π_k converges to a nearly deterministic optimum the induced chain can have arbitrarily small transition probabilities, so the Poisson-solution norm and its policy derivative may grow with k. This directly affects the size of the residual terms in the noise decomposition and is load-bearing for obtaining the stated O(1/k) decay.

Authors: We agree with this assessment. The current statement of Lemma 4.3 does not explicitly ensure uniformity of the Lipschitz constant as π_k evolves toward a deterministic policy. In the revision we will add the assumption that the sequence of learning policies {π_k} satisfies a uniform lower bound δ > 0 on the smallest positive transition probability (equivalently, a uniform spectral gap for the lazy chains). The laziness parameter λ and this δ together yield a k-independent bound on the Poisson solution and its derivatives with respect to both Q and π. We will rewrite the lemma statement and proof to make the dependence on δ explicit and add a short discussion noting that this assumption remains weaker than the fixed exploration distribution required by off-policy analyses. revision: yes
Referee: [Theorem 5.1] Theorem 5.1, the recursion for E[||Q_k - Q^*||_∞²]: the proof closes the recursion by absorbing the residual into a 1/k term, but without a k-uniform bound on the sensitivity constants the absorption step appears to require an additional assumption on the minimal exploration probability that is not stated in the theorem hypotheses.

Authors: We agree that the absorption step in the proof of Theorem 5.1 relies on k-uniform control of the sensitivity constants. With the revised Lemma 4.3 that incorporates the uniform δ, the residual terms remain O(1/k) with a constant depending on δ but independent of k. We will expand the proof to display this absorption explicitly, state the dependence of the final rate on δ, and update the theorem hypotheses and sample-complexity corollary accordingly. No other changes to the main claims are required. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new Poisson decomposition under stated MDP assumptions

full rationale

The paper derives last-iterate rates for E[||Q_k - Q*||_∞²] via a Poisson-equation decomposition of time-inhomogeneous Markov noise into martingale differences plus controlled residuals, with sensitivity bounds on the Poisson solution w.r.t. Q-estimates and π_k. This is presented as a technical contribution under the single minimal assumption of one irreducible policy. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the rates are expressed directly in terms of discount factor, irreducibility constants, and problem parameters without renaming or smuggling prior results as new predictions. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard MDP assumptions plus the new analytical technique of Poisson-equation decomposition for time-inhomogeneous noise; no free parameters or invented entities are introduced beyond the problem setup.

pith-pipeline@v0.9.0 · 5841 in / 1270 out tokens · 72365 ms · 2026-05-18T05:43:33.640344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we employ a refined approach that leverages the Poisson equation to decompose the Markovian noise corresponding to the lazy transition matrix into a martingale-difference term and residual terms... sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 3.1. There exists a policy π_b such that the Markov chain {S_k} induced by π_b is irreducible.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions
cs.LG 2026-05 unverdicted novelty 8.0

Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT press, 2018

work page 2018
[2]

Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017

work page 2017
[3]

End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

SergeyLevine, ChelseaFinn, TrevorDarrell, andPieterAbbeel. End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

work page 2016
[4]

Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput

M.MehdiAfsar,TraffordCrump,andBehrouzFar. Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput. Surv., 55(7), December 2022. ISSN 0360-0300. doi: 10.1145/3543846. URL https://doi.org/10.1145/3543846

work page doi:10.1145/3543846 2022
[5]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022
[6]

Q-learning.Machine learning, 8(3-4):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3-4):279–292, 1992

work page 1992
[7]

A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951

work page 1951
[8]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and DemisHassabis. Human-levelcontrolthroughdeepreinforcementlearnin...

work page 2015
[9]

Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994

work page 1994
[10]

The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

work page 2000
[11]

Springer, 2009

Vivek S Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer, 2009

work page 2009
[12]

The asymptotic convergence-rate of q-learning

Csaba Szepesvári. The asymptotic convergence-rate of q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998

work page 1998
[13]

Beck and R

C.L. Beck and R. Srikant. Error bounds for constant step-size Q-learning.Systems & Control Letters, 61 (12):1203–1208, 2012. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2012.08.014. URL https://www.sciencedirect.com/science/article/pii/S016769111200179X

work page doi:10.1016/j.sysconle.2012.08.014 2012
[14]

Beck and R

Carolyn L. Beck and R. Srikant. Improved upper bounds on the expected error in constant step-size Q-learning. In2013 American Control Conference, pages 1926–1931, 2013. doi: 10.1109/ACC.2013. 6580117

work page doi:10.1109/acc.2013 1926
[15]

A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4): 1352–1367, 2024

Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4): 1352–1367, 2024

work page 2024
[16]

Final iteration convergence bound of Q-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

work page 2024
[17]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞-bounds for 𝑄-learning.Preprint arXiv:1905.06265, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[18]

Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

Martin J Wainwright. Variance-reduced q-learning is minimax optimal.Preprint arXiv:1906.04697, 2019

work page arXiv 1906
[19]

Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003

Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003

work page 2003
[20]

Finite-time analysis of asynchronous stochastic approximation and Q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR, 2020. 20

work page 2020
[21]

Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc., 2020

work page 2020
[22]

A statistical analysis of polyak-ruppert averaged q-learning

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023

work page 2023
[23]

Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J. Kappen. Dynamic policy programming.J. Mach. Learn. Res., 13(1):3207–3245, November 2012. ISSN 1532-4435

work page 2012
[24]

Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

work page 2018
[25]

Linear Q-le arning does not diverge: Convergence rates to a bounded set

Xinyu Liu, Zixuan Xie, and Shangtong Zhang. Linear q-learning does not diverge inL2: Convergence rates to a bounded set.Preprint arXiv:2501.19254, 2025

work page arXiv 2025
[26]

Deep reinforcement learning with double q-learning

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 2094–2100. AAAI Press, 2016

work page 2094
[27]

Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax

MichelTokicandGüntherPalm. Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax. InAnnual conference on artificial intelligence, pages 335–346. Springer, 2011

work page 2011
[28]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

work page 1995
[29]

Finite-time error bounds for linear stochastic approximation and TD-learning

R Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830, 2019

work page 2019
[30]

A finite-time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite-time analysis of temporal difference learning with linear function approximation. InConference On Learning Theory, pages 1691–1692, 2018

work page 2018
[31]

Finite-sample analysis for SARSA with linear function approximation

Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA with linear function approximation. InAdvances in Neural Information Processing Systems, pages 8668–8678, 2019

work page 2019
[32]

Stochastic approximation with unbounded Markovian noise: A general-purpose theorem

Shaan Ul Haque and Siva Theja Maguluri. Stochastic approximation with unbounded Markovian noise: A general-purpose theorem. InInternational Conference on Artificial Intelligence and Statistics, pages 3718–3726. PMLR, 2025

work page 2025
[33]

Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430, 2022

Siddharth Chandak, Vivek S Borkar, and Parth Dodhia. Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430, 2022

work page 2022
[34]

Convergence of stochastic iterative dynamic programming algorithms

Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710, 1994

work page 1994
[35]

A unified switching system perspective and convergence analysis of Q-learning algorithms.Advances in Neural Information Processing Systems, 33:15556–15567, 2020

Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms.Advances in Neural Information Processing Systems, 33:15556–15567, 2020

work page 2020
[36]

Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017

AdithyaMDevrajandSeanMeyn. Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017. 21

work page 2017
[37]

Instance-optimality in optimal valueestimation: Adaptivityviavariance-reducedq-learning.IEEETransactionsonInformationTheory, 2024

Eric Xia, Koulik Khamaru, Martin J Wainwright, and Michael I Jordan. Instance-optimality in optimal valueestimation: Adaptivityviavariance-reducedq-learning.IEEETransactionsonInformationTheory, 2024

work page 2024
[38]

arXiv preprint arXiv:2401.13884 , year=

Yixuan Zhang and Qiaomin Xie. Constant stepsize q-learning: Distributional convergence, bias and extrapolation.Preprint arXiv:2401.13884, 2024

work page arXiv 2024
[39]

An analysis of reinforcement learning with function approximation

Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008

work page 2008
[40]

Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

work page 2023
[41]

TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024

SeanMeyn. TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024

work page 2024
[42]

The blessing of heterogeneity in federated q-learning: Linear speedup and beyond.Journal of Machine Learning Research, 26(26):1–85, 2025

Jiin Woo, Gauri Joshi, and Yuejie Chi. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond.Journal of Machine Learning Research, 26(26):1–85, 2025

work page 2025
[43]

Federated reinforcement learning: Linearspeedupundermarkoviansampling

Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linearspeedupundermarkoviansampling. InInternationalConferenceonMachineLearning, pages 10997–11057. PMLR, 2022

work page 2022
[44]

Q-learning with logarithmic regret

Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021

work page 2021
[45]

Online Q-learning using connectionist systems.University of Cambridge, Department of Engineering, Cambridge, UK, 37, 1994

Gavin A Rummery and Mahesan Niranjan. Online Q-learning using connectionist systems.University of Cambridge, Department of Engineering, Cambridge, UK, 37, 1994

work page 1994
[46]

Convergence results for single-step on-policy reinforcement-learning algorithms.Machine learning, 38:287–308, 2000

Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms.Machine learning, 38:287–308, 2000

work page 2000
[47]

On the convergence of sarsa with linear function approximation

Shangtong Zhang, Remi Tachet Des Combes, and Romain Laroche. On the convergence of sarsa with linear function approximation. InInternational Conference on Machine Learning, pages 41613–41646. PMLR, 2023

work page 2023
[48]

A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

work page 2020
[49]

Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control, 2022

Sajad Khodadadian, Thinh T Doan, Justin Romberg, and Siva Theja Maguluri. Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control, 2022

work page 2022
[50]

A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games.Advances in Neural Information Processing Systems, 36:75826–75883, 2023

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games.Advances in Neural Information Processing Systems, 36:75826–75883, 2023

work page 2023
[51]

Two-timescale Q-learning with function approximation in zero-sum stochastic games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. Two-timescale Q-learning with function approximation in zero-sum stochastic games. InProceedings of the 25th ACM Conference on Economics and Computation, EC ’24, page 378, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400707049. doi: 10.1145/3670865.3673...

work page doi:10.1145/3670865.3673491 2024
[52]

John Wiley & Sons, 2014

Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

work page 2014
[53]

Athena Scientific, 1996

Dimitri P Bertsekas and John N Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

work page 1996
[54]

Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales

StefanBanach. Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales. Fund. math, 3(1):133–181, 1922

work page 1922
[55]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.Preprint arXiv:1704.00805, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[56]

AmericanMathematical Soc., 2017

DavidALevinandYuvalPeres.MarkovChainsandMixingTimes,volume107. AmericanMathematical Soc., 2017

work page 2017
[57]

Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis

Ziyi Chen, Yi Zhou, Rong-Rong Chen, and Shaofeng Zou. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. InInternational Conference on Machine Learning, pages 3794–3834. PMLR, 2022

work page 2022
[58]

Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game

Ziyi Chen, Shaocong Ma, and Yi Zhou. Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game. InInternational Conference on Learning Representations, 2021

work page 2021
[59]

Sample complexity bounds for two timescale value-based reinforcement learningalgorithms

Tengyu Xu and Yingbin Liang. Sample complexity bounds for two timescale value-based reinforcement learningalgorithms. InInternationalConferenceonArtificialIntelligenceandStatistics, pages811–819. PMLR, 2021

work page 2021
[60]

On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021

Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021

work page 2021
[61]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

work page 2020
[62]

Sharper model-free reinforcement learning for average-reward markov decision processes

Zihan Zhang and Qiaomin Xie. Sharper model-free reinforcement learning for average-reward markov decision processes. InThe Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR, 2023

work page 2023
[63]

Near sample-optimal reduction-based policy learning for average reward mdp.Preprint arXiv:2212.00603, 2022

Jinghan Wang, Mengdi Wang, and Lin F Yang. Near sample-optimal reduction-based policy learning for average reward mdp.Preprint arXiv:2212.00603, 2022

work page arXiv 2022
[64]

R. Douc, E. Moulines, P. Priouret, and P. Soulier.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer International Publishing, 2018. ISBN 9783319977041. URLhttps://books.google.com/books?id=QaZ-DwAAQBAJ

work page 2018
[65]

Solution representations for poisson’s equation, martingale structure, and the markov chain central limit theorem.Stochastic Systems, 14(1):47–68, 2024

Peter W Glynn and Alex Infanger. Solution representations for poisson’s equation, martingale structure, and the markov chain central limit theorem.Stochastic Systems, 14(1):47–68, 2024

work page 2024
[66]

Generalized inverses and their application to applied probability problems.Linear Algebra and its Applications, 45:157–198, 1982

Jeffrey J Hunter. Generalized inverses and their application to applied probability problems.Linear Algebra and its Applications, 45:157–198, 1982

work page 1982
[67]

A liapounov bound for solutions of the poisson equation.The Annals of Probability, pages 916–931, 1996

Peter W Glynn and Sean P Meyn. A liapounov bound for solutions of the poisson equation.The Annals of Probability, pages 916–931, 1996

work page 1996
[68]

An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025

Zaiwei Chen and Siva Theja Maguluri. An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025. ISSN 0005-1098. doi: https://doi.org/10.1016/ j.automatica.2025.112395. URL https://www.sciencedirect.com/science/article/pii/ S0005109825002894. 23

work page arXiv 2025
[69]

Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006

Abhijit Gosavi. Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006

work page 2006
[70]

Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013

work page 2013
[71]

Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194, 2012

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194, 2012

work page 2012
[72]

Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017

Amir Beck.First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997. URLhttps://epubs.siam.org/doi/ abs/10.1137/1.9781611974997

work page doi:10.1137/1.9781611974997 2017
[73]

YizhouZhang,GuannanQu,PanXu,YihengLin,ZaiweiChen,andAdamWierman. Globalconvergence of localized policy iteration in networked multi-agent reinforcement learning.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1–51, 2023. 24 Appendices A Proofs of All Technical Results in Section 3 A.1 Assuming𝜋 𝑏 (𝑎|𝑠)>0for all(𝑠, 𝑎)is withou...

work page 2023
[74]

To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0

We first show that𝑥= Í∞ 𝑘=0 P 𝑘 𝑦/2 is well-defined; that is, the limitlim𝑘→∞ Í𝑘 𝑛=0 P 𝑛𝑦 exists and is finite. To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0 . We will show that the sequence{𝑧 𝑘 } is Cauchy. For any𝑘 1, 𝑘2 ≥0(assume without loss of generality that𝑘1 ≤𝑘 2), we have ∥𝑧 𝑘2 −𝑧 𝑘1 ∥∞ = 𝑘2∑︁ 𝑛=𝑘1+1 P 𝑛𝑦 ∞ ≤ 𝑘2∑︁ 𝑛=𝑘1+1 ∥P 𝑛𝑦∥ ∞ = 𝑘2∑︁ 𝑛=𝑘1+1 ...

work page
[75]

𝑟𝑏∑︁ 𝑖=0 𝑟𝑏 +1 𝑖+1 𝑃𝑖 𝜋𝑏 (𝑠 ′′, 𝑠′) # 𝜋𝑏 (𝑎 ′|𝑠 ′) (Change of variable:𝑖=𝑗−1) = 1 2𝑟𝑏+1 ∑︁ 𝑠′′ ∈ S 𝑝(𝑠 ′′ |𝑠, 𝑎)

For any𝑛≥0, we have ∥𝑥1 −𝑥 2∥∞ = 1 2 ∞∑︁ 𝑘=0 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 P 𝑘 1 𝑦1 − 𝑛−1∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 ∥∞∥𝑦 1 −𝑦 2∥∞ + 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 − P 𝑘 2 ∥∞∥𝑦 2∥∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ . We now bound each term on the right-hand side. Since eachP𝑘 1...

work page

[1] [1]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT press, 2018

work page 2018

[2] [2]

Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017

work page 2017

[3] [3]

End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

SergeyLevine, ChelseaFinn, TrevorDarrell, andPieterAbbeel. End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016

work page 2016

[4] [4]

Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput

M.MehdiAfsar,TraffordCrump,andBehrouzFar. Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput. Surv., 55(7), December 2022. ISSN 0360-0300. doi: 10.1145/3543846. URL https://doi.org/10.1145/3543846

work page doi:10.1145/3543846 2022

[5] [5]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022

[6] [6]

Q-learning.Machine learning, 8(3-4):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3-4):279–292, 1992

work page 1992

[7] [7]

A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951

work page 1951

[8] [8]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and DemisHassabis. Human-levelcontrolthroughdeepreinforcementlearnin...

work page 2015

[9] [9]

Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994

work page 1994

[10] [10]

The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

work page 2000

[11] [11]

Springer, 2009

Vivek S Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer, 2009

work page 2009

[12] [12]

The asymptotic convergence-rate of q-learning

Csaba Szepesvári. The asymptotic convergence-rate of q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998

work page 1998

[13] [13]

Beck and R

C.L. Beck and R. Srikant. Error bounds for constant step-size Q-learning.Systems & Control Letters, 61 (12):1203–1208, 2012. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2012.08.014. URL https://www.sciencedirect.com/science/article/pii/S016769111200179X

work page doi:10.1016/j.sysconle.2012.08.014 2012

[14] [14]

Beck and R

Carolyn L. Beck and R. Srikant. Improved upper bounds on the expected error in constant step-size Q-learning. In2013 American Control Conference, pages 1926–1931, 2013. doi: 10.1109/ACC.2013. 6580117

work page doi:10.1109/acc.2013 1926

[15] [15]

A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4): 1352–1367, 2024

Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4): 1352–1367, 2024

work page 2024

[16] [16]

Final iteration convergence bound of Q-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

work page 2024

[17] [17]

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞-bounds for 𝑄-learning.Preprint arXiv:1905.06265, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[18] [18]

Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019

Martin J Wainwright. Variance-reduced q-learning is minimax optimal.Preprint arXiv:1906.04697, 2019

work page arXiv 1906

[19] [19]

Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003

Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003

work page 2003

[20] [20]

Finite-time analysis of asynchronous stochastic approximation and Q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR, 2020. 20

work page 2020

[21] [21]

Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc., 2020

work page 2020

[22] [22]

A statistical analysis of polyak-ruppert averaged q-learning

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023

work page 2023

[23] [23]

Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J. Kappen. Dynamic policy programming.J. Mach. Learn. Res., 13(1):3207–3245, November 2012. ISSN 1532-4435

work page 2012

[24] [24]

Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018

work page 2018

[25] [25]

Linear Q-le arning does not diverge: Convergence rates to a bounded set

Xinyu Liu, Zixuan Xie, and Shangtong Zhang. Linear q-learning does not diverge inL2: Convergence rates to a bounded set.Preprint arXiv:2501.19254, 2025

work page arXiv 2025

[26] [26]

Deep reinforcement learning with double q-learning

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 2094–2100. AAAI Press, 2016

work page 2094

[27] [27]

Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax

MichelTokicandGüntherPalm. Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax. InAnnual conference on artificial intelligence, pages 335–346. Springer, 2011

work page 2011

[28] [28]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016

work page 1995

[29] [29]

Finite-time error bounds for linear stochastic approximation and TD-learning

R Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830, 2019

work page 2019

[30] [30]

A finite-time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite-time analysis of temporal difference learning with linear function approximation. InConference On Learning Theory, pages 1691–1692, 2018

work page 2018

[31] [31]

Finite-sample analysis for SARSA with linear function approximation

Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA with linear function approximation. InAdvances in Neural Information Processing Systems, pages 8668–8678, 2019

work page 2019

[32] [32]

Stochastic approximation with unbounded Markovian noise: A general-purpose theorem

Shaan Ul Haque and Siva Theja Maguluri. Stochastic approximation with unbounded Markovian noise: A general-purpose theorem. InInternational Conference on Artificial Intelligence and Statistics, pages 3718–3726. PMLR, 2025

work page 2025

[33] [33]

Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430, 2022

Siddharth Chandak, Vivek S Borkar, and Parth Dodhia. Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430, 2022

work page 2022

[34] [34]

Convergence of stochastic iterative dynamic programming algorithms

Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710, 1994

work page 1994

[35] [35]

A unified switching system perspective and convergence analysis of Q-learning algorithms.Advances in Neural Information Processing Systems, 33:15556–15567, 2020

Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms.Advances in Neural Information Processing Systems, 33:15556–15567, 2020

work page 2020

[36] [36]

Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017

AdithyaMDevrajandSeanMeyn. Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017. 21

work page 2017

[37] [37]

Instance-optimality in optimal valueestimation: Adaptivityviavariance-reducedq-learning.IEEETransactionsonInformationTheory, 2024

Eric Xia, Koulik Khamaru, Martin J Wainwright, and Michael I Jordan. Instance-optimality in optimal valueestimation: Adaptivityviavariance-reducedq-learning.IEEETransactionsonInformationTheory, 2024

work page 2024

[38] [38]

arXiv preprint arXiv:2401.13884 , year=

Yixuan Zhang and Qiaomin Xie. Constant stepsize q-learning: Distributional convergence, bias and extrapolation.Preprint arXiv:2401.13884, 2024

work page arXiv 2024

[39] [39]

An analysis of reinforcement learning with function approximation

Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008

work page 2008

[40] [40]

Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

work page 2023

[41] [41]

TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024

SeanMeyn. TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024

work page 2024

[42] [42]

The blessing of heterogeneity in federated q-learning: Linear speedup and beyond.Journal of Machine Learning Research, 26(26):1–85, 2025

Jiin Woo, Gauri Joshi, and Yuejie Chi. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond.Journal of Machine Learning Research, 26(26):1–85, 2025

work page 2025

[43] [43]

Federated reinforcement learning: Linearspeedupundermarkoviansampling

Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linearspeedupundermarkoviansampling. InInternationalConferenceonMachineLearning, pages 10997–11057. PMLR, 2022

work page 2022

[44] [44]

Q-learning with logarithmic regret

Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021

work page 2021

[45] [45]

Online Q-learning using connectionist systems.University of Cambridge, Department of Engineering, Cambridge, UK, 37, 1994

Gavin A Rummery and Mahesan Niranjan. Online Q-learning using connectionist systems.University of Cambridge, Department of Engineering, Cambridge, UK, 37, 1994

work page 1994

[46] [46]

Convergence results for single-step on-policy reinforcement-learning algorithms.Machine learning, 38:287–308, 2000

Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms.Machine learning, 38:287–308, 2000

work page 2000

[47] [47]

On the convergence of sarsa with linear function approximation

Shangtong Zhang, Remi Tachet Des Combes, and Romain Laroche. On the convergence of sarsa with linear function approximation. InInternational Conference on Machine Learning, pages 41613–41646. PMLR, 2023

work page 2023

[48] [48]

A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

work page 2020

[49] [49]

Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control, 2022

Sajad Khodadadian, Thinh T Doan, Justin Romberg, and Siva Theja Maguluri. Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control, 2022

work page 2022

[50] [50]

A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games.Advances in Neural Information Processing Systems, 36:75826–75883, 2023

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games.Advances in Neural Information Processing Systems, 36:75826–75883, 2023

work page 2023

[51] [51]

Two-timescale Q-learning with function approximation in zero-sum stochastic games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. Two-timescale Q-learning with function approximation in zero-sum stochastic games. InProceedings of the 25th ACM Conference on Economics and Computation, EC ’24, page 378, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400707049. doi: 10.1145/3670865.3673...

work page doi:10.1145/3670865.3673491 2024

[52] [52]

John Wiley & Sons, 2014

Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

work page 2014

[53] [53]

Athena Scientific, 1996

Dimitri P Bertsekas and John N Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

work page 1996

[54] [54]

Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales

StefanBanach. Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales. Fund. math, 3(1):133–181, 1922

work page 1922

[55] [55]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.Preprint arXiv:1704.00805, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[56] [56]

AmericanMathematical Soc., 2017

DavidALevinandYuvalPeres.MarkovChainsandMixingTimes,volume107. AmericanMathematical Soc., 2017

work page 2017

[57] [57]

Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis

Ziyi Chen, Yi Zhou, Rong-Rong Chen, and Shaofeng Zou. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. InInternational Conference on Machine Learning, pages 3794–3834. PMLR, 2022

work page 2022

[58] [58]

Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game

Ziyi Chen, Shaocong Ma, and Yi Zhou. Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game. InInternational Conference on Learning Representations, 2021

work page 2021

[59] [59]

Sample complexity bounds for two timescale value-based reinforcement learningalgorithms

Tengyu Xu and Yingbin Liang. Sample complexity bounds for two timescale value-based reinforcement learningalgorithms. InInternationalConferenceonArtificialIntelligenceandStatistics, pages811–819. PMLR, 2021

work page 2021

[60] [60]

On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021

Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021

work page 2021

[61] [61]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

work page 2020

[62] [62]

Sharper model-free reinforcement learning for average-reward markov decision processes

Zihan Zhang and Qiaomin Xie. Sharper model-free reinforcement learning for average-reward markov decision processes. InThe Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR, 2023

work page 2023

[63] [63]

Near sample-optimal reduction-based policy learning for average reward mdp.Preprint arXiv:2212.00603, 2022

Jinghan Wang, Mengdi Wang, and Lin F Yang. Near sample-optimal reduction-based policy learning for average reward mdp.Preprint arXiv:2212.00603, 2022

work page arXiv 2022

[64] [64]

R. Douc, E. Moulines, P. Priouret, and P. Soulier.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer International Publishing, 2018. ISBN 9783319977041. URLhttps://books.google.com/books?id=QaZ-DwAAQBAJ

work page 2018

[65] [65]

Solution representations for poisson’s equation, martingale structure, and the markov chain central limit theorem.Stochastic Systems, 14(1):47–68, 2024

Peter W Glynn and Alex Infanger. Solution representations for poisson’s equation, martingale structure, and the markov chain central limit theorem.Stochastic Systems, 14(1):47–68, 2024

work page 2024

[66] [66]

Generalized inverses and their application to applied probability problems.Linear Algebra and its Applications, 45:157–198, 1982

Jeffrey J Hunter. Generalized inverses and their application to applied probability problems.Linear Algebra and its Applications, 45:157–198, 1982

work page 1982

[67] [67]

A liapounov bound for solutions of the poisson equation.The Annals of Probability, pages 916–931, 1996

Peter W Glynn and Sean P Meyn. A liapounov bound for solutions of the poisson equation.The Annals of Probability, pages 916–931, 1996

work page 1996

[68] [68]

An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025

Zaiwei Chen and Siva Theja Maguluri. An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025. ISSN 0005-1098. doi: https://doi.org/10.1016/ j.automatica.2025.112395. URL https://www.sciencedirect.com/science/article/pii/ S0005109825002894. 23

work page arXiv 2025

[69] [69]

Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006

Abhijit Gosavi. Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006

work page 2006

[70] [70]

Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013

work page 2013

[71] [71]

Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194, 2012

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194, 2012

work page 2012

[72] [72]

Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017

Amir Beck.First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997. URLhttps://epubs.siam.org/doi/ abs/10.1137/1.9781611974997

work page doi:10.1137/1.9781611974997 2017

[73] [73]

YizhouZhang,GuannanQu,PanXu,YihengLin,ZaiweiChen,andAdamWierman. Globalconvergence of localized policy iteration in networked multi-agent reinforcement learning.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1–51, 2023. 24 Appendices A Proofs of All Technical Results in Section 3 A.1 Assuming𝜋 𝑏 (𝑎|𝑠)>0for all(𝑠, 𝑎)is withou...

work page 2023

[74] [74]

To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0

We first show that𝑥= Í∞ 𝑘=0 P 𝑘 𝑦/2 is well-defined; that is, the limitlim𝑘→∞ Í𝑘 𝑛=0 P 𝑛𝑦 exists and is finite. To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0 . We will show that the sequence{𝑧 𝑘 } is Cauchy. For any𝑘 1, 𝑘2 ≥0(assume without loss of generality that𝑘1 ≤𝑘 2), we have ∥𝑧 𝑘2 −𝑧 𝑘1 ∥∞ = 𝑘2∑︁ 𝑛=𝑘1+1 P 𝑛𝑦 ∞ ≤ 𝑘2∑︁ 𝑛=𝑘1+1 ∥P 𝑛𝑦∥ ∞ = 𝑘2∑︁ 𝑛=𝑘1+1 ...

work page

[75] [75]

𝑟𝑏∑︁ 𝑖=0 𝑟𝑏 +1 𝑖+1 𝑃𝑖 𝜋𝑏 (𝑠 ′′, 𝑠′) # 𝜋𝑏 (𝑎 ′|𝑠 ′) (Change of variable:𝑖=𝑗−1) = 1 2𝑟𝑏+1 ∑︁ 𝑠′′ ∈ S 𝑝(𝑠 ′′ |𝑠, 𝑎)

For any𝑛≥0, we have ∥𝑥1 −𝑥 2∥∞ = 1 2 ∞∑︁ 𝑘=0 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 P 𝑘 1 𝑦1 − 𝑛−1∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 ∥∞∥𝑦 1 −𝑦 2∥∞ + 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 − P 𝑘 2 ∥∞∥𝑦 2∥∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ . We now bound each term on the right-hand side. Since eachP𝑘 1...

work page