A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

· 2025 · cs.LG · arXiv 2510.16132

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^*\|_\infty^2]$, implying a sample complexity of order $\mathcal{O}(1/\xi^2)$ for achieving $\mathbb{E}[\|Q_k - Q^*\|_\infty]\le \xi$. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|_\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.

citing papers explorer

Showing 1 of 1 citing paper.

Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions cs.LG · 2026-05-13 · unverdicted · none · ref 41 · internal anchor
Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer