Stochastic approximation with cone-contractive operators: Sharp ell_infty-bounds for Q-learning

Martin J. Wainwright

Authors on Pith no claims yet

classification 💻 cs.LG math.OCstat.ML

keywords boundslearninginftyresultsapproximationerrorgeneralnon-asymptotic

read the original abstract

Motivated by the study of $Q$-learning algorithms in reinforcement learning, we study a class of stochastic approximation procedures based on operators that satisfy monotonicity and quasi-contractivity conditions with respect to an underlying cone. We prove a general sandwich relation on the iterate error at each time, and use it to derive non-asymptotic bounds on the error in terms of a cone-induced gauge norm. These results are derived within a deterministic framework, requiring no assumptions on the noise. We illustrate these general bounds in application to synchronous $Q$-learning for discounted Markov decision processes with discrete state-action spaces, in particular by deriving non-asymptotic bounds on the $\ell_\infty$-norm for a range of stepsizes. These results are the sharpest known to date, and we show via simulation that the dependence of our bounds cannot be improved in a worst-case sense. These results show that relative to a model-based $Q$-iteration, the $\ell_\infty$-based sample complexity of $Q$-learning is suboptimal in terms of the discount factor $\gamma$.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lyapunov-Certified Direct Switching Theory for Q-Learning
cs.LG 2026-04 unverdicted novelty 7.0

Q-learning error is recast as a switched linear recursion whose exponential rate is exactly the joint spectral radius of a direct switching family, yielding finite-time bounds via a product-defined Lyapunov function.
Gaussian Approximation for Asynchronous Q-learning
stat.ML 2026-04 unverdicted novelty 7.0

Derived rates of order up to n^{-1/6} log^4(n S A) for the high-dimensional CLT of averaged asynchronous Q-learning iterates, plus a general martingale-difference CLT.