A Robust $\widetilde{\mathcal{O}}(1/\sqrt{T})$ Rate for Unprojected TD Learning with Linear Function Approximation

Francesco Orabona; Wei-Cheng Lee

arxiv: 2506.01052 · v3 · pith:XCEBT2SSnew · submitted 2025-06-01 · 💻 cs.LG · math.OC· stat.ML

A Robust widetilde{mathcal{O}}(1/sqrt{T}) Rate for Unprojected TD Learning with Linear Function Approximation

Wei-Cheng Lee , Francesco Orabona This is my paper

classification 💻 cs.LG math.OCstat.ML

keywords learningconvergencefunctionrateadditionalapproximationboundedcondition

0 comments

read the original abstract

We investigate the finite-time convergence properties of Temporal Difference (TD) learning with linear function approximation, a cornerstone of reinforcement learning. We are interested in the so-called ``robust'' setting, where the convergence guarantee does not depend on the potential function's minimal curvature. While prior work has established convergence guarantees in this setting, these results typically rely on the artificial assumption that each iterate is projected onto a bounded set. Removing such a condition was left as an open problem by Bhandari et al. (COLT'18), hypothesizing the need for additional ``regularity conditions''. In this paper, we show that the simple unprojected TD(0) converges with a rate of $\widetilde{\mathcal{O}}\left(\frac{\|\theta^*\|^2_2}{\sqrt{T}}\right)$ in expectation, even in the presence of Markovian noise. We do not require an additional regularity condition, but only a minor polylog correction to the learning rate. Our analysis reveals a novel self-bounding property of the TD updates and exploits it to guarantee bounded iterates.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
cs.LG 2025-09 unverdicted novelty 6.0

A novel robust asynchronous Q-learning algorithm achieves finite-time convergence rates that match clean-data bounds up to an additive term proportional to the corruption fraction, with a matching information-theoreti...