A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Karthikeyan Shanmugam; Sanjay Shakkottai; Siva Theja Maguluri; Zaiwei Chen

arxiv: 2102.01567 · v4 · pith:QKDDNVFSnew · submitted 2021-02-02 · 💻 cs.LG · math.OC· stat.ML

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Zaiwei Chen , Siva Theja Maguluri , Sanjay Shakkottai , Karthikeyan Shanmugam This is my paper

classification 💻 cs.LG math.OCstat.ML

keywords algorithmsconvergenceasynchronousboundsfinite-samplefirstguaranteeslambda

0 comments

read the original abstract

This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as \textit{Markovian Stochastic Approximation} (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$, and off-policy TD algorithms including V-trace. As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in (Sutton, 1999).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
math.PR 2026-05 unverdicted novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properti...
Sign-Separated Finite-Time Error Analysis of Q-Learning
cs.AI 2026-05 unverdicted novelty 7.0

Sign-separated analysis decomposes Q-learning errors into negative parts dominated by an optimal-policy LTI system and positive parts controlled by a switching system, yielding finite-time bounds for deterministic and...
A Switching System Theory of Q-Learning with Linear Function Approximation
cs.LG 2026-05 unverdicted novelty 7.0

Q-learning with linear function approximation is recast as a switched linear system whose mean dynamics converge precisely when the joint spectral radius of the switching matrices is less than one.
A Switching System Theory of Q-Learning with Linear Function Approximation
cs.LG 2026-05 unverdicted novelty 7.0

Derives an exact linear switched model for the mean dynamics of Q-learning with linear function approximation and relates convergence to joint spectral radius stability of the switched system, extending the view to st...
Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
cs.LG 2026-05 unverdicted novelty 7.0

Derives contraction-based Q-value extensions for exponential utility and proves almost-sure convergence of two-timescale and one-timescale model-free algorithms in discounted MDPs.
Lyapunov-Certified Direct Switching Theory for Q-Learning
cs.LG 2026-04 unverdicted novelty 7.0

Q-learning error is recast as a switched linear recursion whose exponential rate is exactly the joint spectral radius of a direct switching family, yielding finite-time bounds via a product-defined Lyapunov function.
Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games
cs.LG 2026-04 unverdicted novelty 7.0

Provides the first finite-time convergence guarantees for Q-value iteration in general-sum Stackelberg Markov games.
Lyapunov-Certified Direct Switching Theory for Q-Learning
cs.LG 2026-04 unverdicted novelty 6.0

Q-learning convergence rates can be characterized exactly through the joint spectral radius of a stochastic switching linear system representation of the error dynamics.
Central Limit Theorems for Asynchronous Averaged Q-Learning
cs.LG 2025-09 unverdicted novelty 6.0

Establishes non-asymptotic and functional central limit theorems for asynchronous averaged Q-learning with explicit rates depending on iterations, state-action space, discount factor, and exploration quality.
Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms
cs.LG 2024-04 unverdicted novelty 5.0

Unified ODE convergence analysis for smooth Q-learning variants via p-norm Lyapunov functions, valid even when the Bellman operator is not a contraction.