Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Martin J. Wainwright

Stochastic approximation with cone-contractive operators: Sharp ell_infty-bounds for Q-learning

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1905.06265 v2 pith:BRACNNQI submitted 2019-05-15 cs.LG math.OCstat.ML

Stochastic approximation with cone-contractive operators: Sharp ell_infty-bounds for Q-learning

Martin J. Wainwright This is my paper

classification cs.LG math.OCstat.ML

keywords boundslearninginftyresultsapproximationerrorgeneralnon-asymptotic

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Motivated by the study of $Q$-learning algorithms in reinforcement learning, we study a class of stochastic approximation procedures based on operators that satisfy monotonicity and quasi-contractivity conditions with respect to an underlying cone. We prove a general sandwich relation on the iterate error at each time, and use it to derive non-asymptotic bounds on the error in terms of a cone-induced gauge norm. These results are derived within a deterministic framework, requiring no assumptions on the noise. We illustrate these general bounds in application to synchronous $Q$-learning for discounted Markov decision processes with discrete state-action spaces, in particular by deriving non-asymptotic bounds on the $\ell_\infty$-norm for a range of stepsizes. These results are the sharpest known to date, and we show via simulation that the dependence of our bounds cannot be improved in a worst-case sense. These results show that relative to a model-based $Q$-iteration, the $\ell_\infty$-based sample complexity of $Q$-learning is suboptimal in terms of the discount factor $\gamma$.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sign-Separated Asymmetric Finite-Time Error Analysis of Q-Learning
cs.AI 2026-05 unverdicted novelty 7.0

Sign-separated analysis decomposes Q-learning errors into negative parts dominated by an optimal-policy LTI system and positive parts controlled by a switching system, yielding finite-time bounds for deterministic and...
Lyapunov-Certified Direct Switching Theory for Q-Learning
cs.LG 2026-04 unverdicted novelty 7.0

Q-learning error is recast as a switched linear recursion whose exponential rate is exactly the joint spectral radius of a direct switching family, yielding finite-time bounds via a product-defined Lyapunov function.
Gaussian Approximation for Asynchronous Q-learning
stat.ML 2026-04 unverdicted novelty 7.0

Derived rates of order up to n^{-1/6} log^4(n S A) for the high-dimensional CLT of averaged asynchronous Q-learning iterates, plus a general martingale-difference CLT.
A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies
cs.LG 2025-10 unverdicted novelty 7.0

Establishes last-iterate convergence rates for on-policy Q-learning under minimal irreducibility assumptions, with sample complexity O(1/ξ²) matching off-policy up to exploration factors.
From Set Convergence to Pointwise Convergence: Finite-Time Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes
cs.LG 2025-04 unverdicted novelty 7.0

Establishes Õ(1/k) mean-square last-iterate convergence for asynchronous average-reward Q-learning with adaptive stepsizes and proves adaptivity is necessary.
Sign-Separated Asymmetric Finite-Time Error Analysis of Q-Learning
cs.AI 2026-05 accept novelty 6.0

Sign-separated finite-time bounds for constant-step-size Q-learning certify a slower exponential envelope for positive errors than for negative errors, reflecting max-induced overestimation.
Lyapunov-Certified Direct Switching Theory for Q-Learning
cs.LG 2026-04 unverdicted novelty 6.0

Q-learning convergence rates can be characterized exactly through the joint spectral radius of a stochastic switching linear system representation of the error dynamics.
Central Limit Theorems for Asynchronous Averaged Q-Learning
cs.LG 2025-09 unverdicted novelty 6.0

Establishes non-asymptotic and functional central limit theorems for asynchronous averaged Q-learning with explicit rates depending on iterations, state-action space, discount factor, and exploration quality.
Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
cs.LG 2025-09 unverdicted novelty 6.0

A novel robust asynchronous Q-learning algorithm achieves finite-time convergence rates that match clean-data bounds up to an additive term proportional to the corruption fraction, with a matching information-theoreti...
Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms
cs.LG 2024-04 unverdicted novelty 5.0

Unified ODE convergence analysis for smooth Q-learning variants via p-norm Lyapunov functions, valid even when the Bellman operator is not a contraction.
Non-Asymptotic Convergence of Stochastic Iterative Algorithms: A Lyapunov Framework
cs.LG 2026-05 unverdicted novelty 2.0

A survey of Lyapunov techniques using generalized Moreau envelopes as universal functions for non-asymptotic mean-square convergence analysis of stochastic iterative algorithms under contractive operators.
Mathematical methods of reinforcement learning
math.OC 2026-07 accept

A survey unifying the operator-theoretic, probabilistic, and optimization-based mathematical structures underlying modern reinforcement learning algorithms.