A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants
read the original abstract
This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as \textit{Markovian Stochastic Approximation} (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$, and off-policy TD algorithms including V-trace. As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in (Sutton, 1999).
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properti...
-
Sign-Separated Finite-Time Error Analysis of Q-Learning
Sign-separated analysis decomposes Q-learning errors into negative parts dominated by an optimal-policy LTI system and positive parts controlled by a switching system, yielding finite-time bounds for deterministic and...
-
A Switching System Theory of Q-Learning with Linear Function Approximation
Q-learning with linear function approximation is recast as a switched linear system whose mean dynamics converge precisely when the joint spectral radius of the switching matrices is less than one.
-
A Switching System Theory of Q-Learning with Linear Function Approximation
Derives an exact linear switched model for the mean dynamics of Q-learning with linear function approximation and relates convergence to joint spectral radius stability of the switched system, extending the view to st...
-
Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Derives contraction-based Q-value extensions for exponential utility and proves almost-sure convergence of two-timescale and one-timescale model-free algorithms in discounted MDPs.
-
Lyapunov-Certified Direct Switching Theory for Q-Learning
Q-learning error is recast as a switched linear recursion whose exponential rate is exactly the joint spectral radius of a direct switching family, yielding finite-time bounds via a product-defined Lyapunov function.
-
Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games
Provides the first finite-time convergence guarantees for Q-value iteration in general-sum Stackelberg Markov games.
-
Lyapunov-Certified Direct Switching Theory for Q-Learning
Q-learning convergence rates can be characterized exactly through the joint spectral radius of a stochastic switching linear system representation of the error dynamics.
-
Central Limit Theorems for Asynchronous Averaged Q-Learning
Establishes non-asymptotic and functional central limit theorems for asynchronous averaged Q-learning with explicit rates depending on iterations, state-action space, discount factor, and exploration quality.
-
Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms
Unified ODE convergence analysis for smooth Q-learning variants via p-norm Lyapunov functions, valid even when the Bellman operator is not a contraction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.