A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.
A finite time analysis of temporal difference learning with linear function approximation
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
A novel robust asynchronous Q-learning algorithm achieves finite-time convergence rates that match clean-data bounds up to an additive term proportional to the corruption fraction, with a matching information-theoretic lower bound.
citing papers explorer
-
Bridging the Gap Between Average and Discounted TD Learning
A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.
-
Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
A novel robust asynchronous Q-learning algorithm achieves finite-time convergence rates that match clean-data bounds up to an additive term proportional to the corruption fraction, with a matching information-theoretic lower bound.