A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies
Pith reviewed 2026-05-18 05:43 UTC · model grok-4.3
The pith
Q-learning with time-varying policies converges to the optimal Q-function at rate O(1/ξ²) under only irreducibility assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the sole assumption that an irreducible policy exists, Q-learning with rapidly changing learning policies satisfies a last-iterate bound E[||Q_k - Q*||_∞²] that yields sample complexity O(1/ξ²) to make E[||Q_k - Q*||_∞] ≤ ξ; the same framework produces a finite-time bound on E[||Q^{π_k} - Q*||_∞²] that makes the exploration-exploitation tension explicit.
What carries the argument
A Poisson-equation decomposition of the time-inhomogeneous Markovian noise under a lazy transition kernel, which splits the noise into a martingale-difference sequence plus controllable residual terms whose size is bounded by sensitivity of the Poisson solution to changes in both the Q-estimate and the policy.
If this is right
- The sample complexity matches the best known off-policy Q-learning rate in the accuracy parameter ξ but carries a worse dependence on exploration constants.
- The current policy's value function also converges, so exploitation improves automatically as the policy approaches optimality.
- The same Poisson-equation technique can be applied to other single-timescale algorithms that use time-varying policies, such as actor-critic methods.
- Numerical experiments on standard test MDPs confirm that the predicted rates appear in practice.
Where Pith is reading between the lines
- In environments where maintaining a separate exploratory policy is costly, on-policy Q-learning may become preferable once the exploitation advantage is quantified.
- The minimal-assumption result suggests that similar last-iterate bounds could be derived for other on-policy temporal-difference methods without invoking uniform ergodicity.
- If the irreducibility condition fails, one could test whether adding a small amount of forced exploration restores the rate without changing the core algorithm.
Load-bearing premise
There exists at least one policy whose induced Markov chain can reach every state from every other state.
What would settle it
Run the algorithm on a finite MDP whose only irreducible policies have extremely poor mixing times and check whether the observed squared infinity-norm error still decays at the claimed 1/k rate.
Figures
read the original abstract
In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $\mathcal{O}(1/\xi^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi$. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript provides the first finite-time analysis of Q-learning with time-varying on-policy sampling for discounted MDPs, under the sole assumption that there exists at least one policy inducing an irreducible Markov chain. It establishes a last-iterate rate on E[||Q_k - Q^*||_∞²] that yields an O(1/ξ²) sample complexity to reach E[||Q_k - Q^*||_∞] ≤ ξ, matching off-policy Q-learning up to worse dependence on exploration parameters. A parallel rate is derived for E[||Q^{π_k} - Q^*||_∞²]. The analysis decomposes time-inhomogeneous Markovian noise via a Poisson equation under a lazy transition matrix, separating a martingale-difference term from residuals controlled by sensitivity of the Poisson solution to both the current Q-estimate and the evolving policy π_k. Numerical experiments are included to support the rates.
Significance. If the central rates hold, the result is significant for extending finite-time guarantees to on-policy Q-learning under minimal exploration assumptions, explicitly trading off weaker exploration against the exploitation benefit of policy improvement. The Poisson-equation decomposition and sensitivity analysis constitute a reusable technique for time-varying policies that could apply to single-timescale actor-critic methods and learning-in-games settings. The paper earns credit for the minimal assumption, the explicit last-iterate squared-error rate, and the reproducible numerical validation.
major comments (2)
- [§4.2, Lemma 4.3] §4.2, the sensitivity bound on the Poisson solution (Lemma 4.3): the claimed uniform Lipschitz constant with respect to π_k relies on a fixed irreducibility gap, but the minimal assumption only guarantees existence of one irreducible policy; as π_k converges to a nearly deterministic optimum the induced chain can have arbitrarily small transition probabilities, so the Poisson-solution norm and its policy derivative may grow with k. This directly affects the size of the residual terms in the noise decomposition and is load-bearing for obtaining the stated O(1/k) decay.
- [Theorem 5.1] Theorem 5.1, the recursion for E[||Q_k - Q^*||_∞²]: the proof closes the recursion by absorbing the residual into a 1/k term, but without a k-uniform bound on the sensitivity constants the absorption step appears to require an additional assumption on the minimal exploration probability that is not stated in the theorem hypotheses.
minor comments (2)
- [Notation section] The definition of the lazy transition matrix is introduced only in §3; moving a brief reminder to the notation section would improve readability for readers focused on the main theorem.
- [Figure 2] Figure 2 caption does not indicate which curves correspond to different values of the exploration parameter; adding this would make the empirical support for the theory clearer.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The major comments correctly identify that our sensitivity analysis requires a uniform bound on the irreducibility gap for the time-varying policies, which is not guaranteed by the sole existence of one irreducible policy. We will revise the manuscript to include a mild uniform exploration assumption on the sequence of policies to close the analysis rigorously. We respond to each comment below.
read point-by-point responses
-
Referee: [§4.2, Lemma 4.3] §4.2, the sensitivity bound on the Poisson solution (Lemma 4.3): the claimed uniform Lipschitz constant with respect to π_k relies on a fixed irreducibility gap, but the minimal assumption only guarantees existence of one irreducible policy; as π_k converges to a nearly deterministic optimum the induced chain can have arbitrarily small transition probabilities, so the Poisson-solution norm and its policy derivative may grow with k. This directly affects the size of the residual terms in the noise decomposition and is load-bearing for obtaining the stated O(1/k) decay.
Authors: We agree with this assessment. The current statement of Lemma 4.3 does not explicitly ensure uniformity of the Lipschitz constant as π_k evolves toward a deterministic policy. In the revision we will add the assumption that the sequence of learning policies {π_k} satisfies a uniform lower bound δ > 0 on the smallest positive transition probability (equivalently, a uniform spectral gap for the lazy chains). The laziness parameter λ and this δ together yield a k-independent bound on the Poisson solution and its derivatives with respect to both Q and π. We will rewrite the lemma statement and proof to make the dependence on δ explicit and add a short discussion noting that this assumption remains weaker than the fixed exploration distribution required by off-policy analyses. revision: yes
-
Referee: [Theorem 5.1] Theorem 5.1, the recursion for E[||Q_k - Q^*||_∞²]: the proof closes the recursion by absorbing the residual into a 1/k term, but without a k-uniform bound on the sensitivity constants the absorption step appears to require an additional assumption on the minimal exploration probability that is not stated in the theorem hypotheses.
Authors: We agree that the absorption step in the proof of Theorem 5.1 relies on k-uniform control of the sensitivity constants. With the revised Lemma 4.3 that incorporates the uniform δ, the residual terms remain O(1/k) with a constant depending on δ but independent of k. We will expand the proof to display this absorption explicitly, state the dependence of the final rate on δ, and update the theorem hypotheses and sample-complexity corollary accordingly. No other changes to the main claims are required. revision: yes
Circularity Check
No significant circularity; derivation relies on new Poisson decomposition under stated MDP assumptions
full rationale
The paper derives last-iterate rates for E[||Q_k - Q*||_∞²] via a Poisson-equation decomposition of time-inhomogeneous Markov noise into martingale differences plus controlled residuals, with sensitivity bounds on the Poisson solution w.r.t. Q-estimates and π_k. This is presented as a technical contribution under the single minimal assumption of one irreducible policy. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the rates are expressed directly in terms of discount factor, irreducibility constants, and problem parameters without renaming or smuggling prior results as new predictions. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we employ a refined approach that leverages the Poisson equation to decompose the Markovian noise corresponding to the lazy transition matrix into a martingale-difference term and residual terms... sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Assumption 3.1. There exists a policy π_b such that the Markov chain {S_k} induced by π_b is irreducible.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions
Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.
Reference graph
Works this paper leans on
-
[1]
Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT press, 2018
work page 2018
-
[2]
Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354, 2017
work page 2017
-
[3]
End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016
SergeyLevine, ChelseaFinn, TrevorDarrell, andPieterAbbeel. End-to-endtrainingofdeepvisuomotor policies.Journal of Machine Learning Research, 17(39):1–40, 2016
work page 2016
-
[4]
Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput
M.MehdiAfsar,TraffordCrump,andBehrouzFar. Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput. Surv., 55(7), December 2022. ISSN 0360-0300. doi: 10.1145/3543846. URL https://doi.org/10.1145/3543846
-
[5]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....
work page 2022
-
[6]
Q-learning.Machine learning, 8(3-4):279–292, 1992
Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3-4):279–292, 1992
work page 1992
-
[7]
A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407, 1951
work page 1951
-
[8]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and DemisHassabis. Human-levelcontrolthroughdeepreinforcementlearnin...
work page 2015
-
[9]
Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994
John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3): 185–202, 1994
work page 1994
-
[10]
Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000
work page 2000
-
[11]
Vivek S Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer, 2009
work page 2009
-
[12]
The asymptotic convergence-rate of q-learning
Csaba Szepesvári. The asymptotic convergence-rate of q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998
work page 1998
-
[13]
C.L. Beck and R. Srikant. Error bounds for constant step-size Q-learning.Systems & Control Letters, 61 (12):1203–1208, 2012. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2012.08.014. URL https://www.sciencedirect.com/science/article/pii/S016769111200179X
-
[14]
Carolyn L. Beck and R. Srikant. Improved upper bounds on the expected error in constant step-size Q-learning. In2013 American Control Conference, pages 1926–1931, 2013. doi: 10.1109/ACC.2013. 6580117
-
[15]
Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4): 1352–1367, 2024
work page 2024
-
[16]
Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach.IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024
work page 2024
-
[17]
Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ∞-bounds for 𝑄-learning.Preprint arXiv:1905.06265, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[18]
Variance-reduced q-learning is minimax optimal.arXiv preprint arXiv:1906.04697, 2019
Martin J Wainwright. Variance-reduced q-learning is minimax optimal.Preprint arXiv:1906.04697, 2019
-
[19]
Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003
Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of Machine Learning Research, 5(Dec):1–25, 2003
work page 2003
-
[20]
Finite-time analysis of asynchronous stochastic approximation and Q-learning
Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR, 2020. 20
work page 2020
-
[21]
Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction
Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc., 2020
work page 2020
-
[22]
A statistical analysis of polyak-ruppert averaged q-learning
Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged q-learning. InInternational Conference on Artificial Intelligence and Statistics, pages 2207–2261. PMLR, 2023
work page 2023
-
[23]
Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J. Kappen. Dynamic policy programming.J. Mach. Learn. Res., 13(1):3207–3245, November 2012. ISSN 1532-4435
work page 2012
-
[24]
Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018
work page 2018
-
[25]
Linear Q-le arning does not diverge: Convergence rates to a bounded set
Xinyu Liu, Zixuan Xie, and Shangtong Zhang. Linear q-learning does not diverge inL2: Convergence rates to a bounded set.Preprint arXiv:2501.19254, 2025
-
[26]
Deep reinforcement learning with double q-learning
Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 2094–2100. AAAI Press, 2016
work page 2094
-
[27]
Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax
MichelTokicandGüntherPalm. Value-differencebasedexploration: Adaptivecontrolbetween 𝜖-greedy and softmax. InAnnual conference on artificial intelligence, pages 335–346. Springer, 2011
work page 2011
-
[28]
Dueling network architectures for deep reinforcement learning
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning, pages 1995–2003. PMLR, 2016
work page 1995
-
[29]
Finite-time error bounds for linear stochastic approximation and TD-learning
R Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830, 2019
work page 2019
-
[30]
A finite-time analysis of temporal difference learning with linear function approximation
Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite-time analysis of temporal difference learning with linear function approximation. InConference On Learning Theory, pages 1691–1692, 2018
work page 2018
-
[31]
Finite-sample analysis for SARSA with linear function approximation
Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA with linear function approximation. InAdvances in Neural Information Processing Systems, pages 8668–8678, 2019
work page 2019
-
[32]
Stochastic approximation with unbounded Markovian noise: A general-purpose theorem
Shaan Ul Haque and Siva Theja Maguluri. Stochastic approximation with unbounded Markovian noise: A general-purpose theorem. InInternational Conference on Artificial Intelligence and Statistics, pages 3718–3726. PMLR, 2025
work page 2025
-
[33]
Siddharth Chandak, Vivek S Borkar, and Parth Dodhia. Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430, 2022
work page 2022
-
[34]
Convergence of stochastic iterative dynamic programming algorithms
Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. InAdvances in neural information processing systems, pages 703–710, 1994
work page 1994
-
[35]
Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms.Advances in Neural Information Processing Systems, 33:15556–15567, 2020
work page 2020
-
[36]
Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017
AdithyaMDevrajandSeanMeyn. Zapq-learning.AdvancesinNeuralInformationProcessingSystems, 30, 2017. 21
work page 2017
-
[37]
Eric Xia, Koulik Khamaru, Martin J Wainwright, and Michael I Jordan. Instance-optimality in optimal valueestimation: Adaptivityviavariance-reducedq-learning.IEEETransactionsonInformationTheory, 2024
work page 2024
-
[38]
arXiv preprint arXiv:2401.13884 , year=
Yixuan Zhang and Qiaomin Xie. Constant stepsize q-learning: Distributional convergence, bias and extrapolation.Preprint arXiv:2401.13884, 2024
-
[39]
An analysis of reinforcement learning with function approximation
Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. InProceedings of the 25th international conference on Machine learning, pages 664–671, 2008
work page 2008
-
[40]
Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023
work page 2023
-
[41]
TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024
SeanMeyn. TheprojectedBellmanequationinreinforcementlearning.IEEETransactionsonAutomatic Control, 2024
work page 2024
-
[42]
Jiin Woo, Gauri Joshi, and Yuejie Chi. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond.Journal of Machine Learning Research, 26(26):1–85, 2025
work page 2025
-
[43]
Federated reinforcement learning: Linearspeedupundermarkoviansampling
Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linearspeedupundermarkoviansampling. InInternationalConferenceonMachineLearning, pages 10997–11057. PMLR, 2022
work page 2022
-
[44]
Q-learning with logarithmic regret
Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. InInternational Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021
work page 2021
-
[45]
Gavin A Rummery and Mahesan Niranjan. Online Q-learning using connectionist systems.University of Cambridge, Department of Engineering, Cambridge, UK, 37, 1994
work page 1994
-
[46]
Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms.Machine learning, 38:287–308, 2000
work page 2000
-
[47]
On the convergence of sarsa with linear function approximation
Shangtong Zhang, Remi Tachet Des Combes, and Romain Laroche. On the convergence of sarsa with linear function approximation. InInternational Conference on Machine Learning, pages 41613–41646. PMLR, 2023
work page 2023
-
[48]
Yue Frank Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020
work page 2020
-
[49]
Sajad Khodadadian, Thinh T Doan, Justin Romberg, and Siva Theja Maguluri. Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control, 2022
work page 2022
-
[50]
Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games.Advances in Neural Information Processing Systems, 36:75826–75883, 2023
work page 2023
-
[51]
Two-timescale Q-learning with function approximation in zero-sum stochastic games
Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, and Adam Wierman. Two-timescale Q-learning with function approximation in zero-sum stochastic games. InProceedings of the 25th ACM Conference on Economics and Computation, EC ’24, page 378, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400707049. doi: 10.1145/3670865.3673...
-
[52]
Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014
work page 2014
-
[53]
Dimitri P Bertsekas and John N Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996
work page 1996
-
[54]
Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales
StefanBanach. Surlesopérationsdanslesensemblesabstraitsetleurapplicationauxéquationsintégrales. Fund. math, 3(1):133–181, 1922
work page 1922
-
[55]
On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning
Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning.Preprint arXiv:1704.00805, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[56]
AmericanMathematical Soc., 2017
DavidALevinandYuvalPeres.MarkovChainsandMixingTimes,volume107. AmericanMathematical Soc., 2017
work page 2017
-
[57]
Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis
Ziyi Chen, Yi Zhou, Rong-Rong Chen, and Shaofeng Zou. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. InInternational Conference on Machine Learning, pages 3794–3834. PMLR, 2022
work page 2022
-
[58]
Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game
Ziyi Chen, Shaocong Ma, and Yi Zhou. Sample efficient stochastic policy extra-gradient algorithm for zero-sum markov game. InInternational Conference on Learning Representations, 2021
work page 2021
-
[59]
Sample complexity bounds for two timescale value-based reinforcement learningalgorithms
Tengyu Xu and Yingbin Liang. Sample complexity bounds for two timescale value-based reinforcement learningalgorithms. InInternationalConferenceonArtificialIntelligenceandStatistics, pages811–819. PMLR, 2021
work page 2021
-
[60]
Shuang Qiu, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021
work page 2021
-
[61]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020
work page 2020
-
[62]
Sharper model-free reinforcement learning for average-reward markov decision processes
Zihan Zhang and Qiaomin Xie. Sharper model-free reinforcement learning for average-reward markov decision processes. InThe Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR, 2023
work page 2023
-
[63]
Jinghan Wang, Mengdi Wang, and Lin F Yang. Near sample-optimal reduction-based policy learning for average reward mdp.Preprint arXiv:2212.00603, 2022
-
[64]
R. Douc, E. Moulines, P. Priouret, and P. Soulier.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer International Publishing, 2018. ISBN 9783319977041. URLhttps://books.google.com/books?id=QaZ-DwAAQBAJ
work page 2018
-
[65]
Peter W Glynn and Alex Infanger. Solution representations for poisson’s equation, martingale structure, and the markov chain central limit theorem.Stochastic Systems, 14(1):47–68, 2024
work page 2024
-
[66]
Jeffrey J Hunter. Generalized inverses and their application to applied probability problems.Linear Algebra and its Applications, 45:157–198, 1982
work page 1982
-
[67]
Peter W Glynn and Sean P Meyn. A liapounov bound for solutions of the poisson equation.The Annals of Probability, pages 916–931, 1996
work page 1996
-
[68]
An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025
Zaiwei Chen and Siva Theja Maguluri. An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395, 2025. ISSN 0005-1098. doi: https://doi.org/10.1016/ j.automatica.2025.112395. URL https://www.sciencedirect.com/science/article/pii/ S0005109825002894. 23
-
[69]
Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006
Abhijit Gosavi. Boundedness of iterates in Q-learning.Systems & control letters, 55(4):347–349, 2006
work page 2006
-
[70]
Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model.Machine learning, 91(3):325–349, 2013
work page 2013
-
[71]
Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends®in Machine Learning, 4(2):107–194, 2012
work page 2012
-
[72]
Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017
Amir Beck.First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997. URLhttps://epubs.siam.org/doi/ abs/10.1137/1.9781611974997
-
[73]
YizhouZhang,GuannanQu,PanXu,YihengLin,ZaiweiChen,andAdamWierman. Globalconvergence of localized policy iteration in networked multi-agent reinforcement learning.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1–51, 2023. 24 Appendices A Proofs of All Technical Results in Section 3 A.1 Assuming𝜋 𝑏 (𝑎|𝑠)>0for all(𝑠, 𝑎)is withou...
work page 2023
-
[74]
To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0
We first show that𝑥= Í∞ 𝑘=0 P 𝑘 𝑦/2 is well-defined; that is, the limitlim𝑘→∞ Í𝑘 𝑛=0 P 𝑛𝑦 exists and is finite. To this end, define𝑧𝑘 := Í𝑘 𝑛=0 P 𝑛𝑦 for any𝑘≥0 . We will show that the sequence{𝑧 𝑘 } is Cauchy. For any𝑘 1, 𝑘2 ≥0(assume without loss of generality that𝑘1 ≤𝑘 2), we have ∥𝑧 𝑘2 −𝑧 𝑘1 ∥∞ = 𝑘2∑︁ 𝑛=𝑘1+1 P 𝑛𝑦 ∞ ≤ 𝑘2∑︁ 𝑛=𝑘1+1 ∥P 𝑛𝑦∥ ∞ = 𝑘2∑︁ 𝑛=𝑘1+1 ...
-
[75]
For any𝑛≥0, we have ∥𝑥1 −𝑥 2∥∞ = 1 2 ∞∑︁ 𝑘=0 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 P 𝑘 1 𝑦1 − 𝑛−1∑︁ 𝑘=0 P 𝑘 2 𝑦2 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 − ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ ≤ 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 ∥∞∥𝑦 1 −𝑦 2∥∞ + 1 2 𝑛−1∑︁ 𝑘=0 ∥P 𝑘 1 − P 𝑘 2 ∥∞∥𝑦 2∥∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 1 𝑦1 ∞ + 1 2 ∞∑︁ 𝑘=𝑛 P 𝑘 2 𝑦2 ∞ . We now bound each term on the right-hand side. Since eachP𝑘 1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.