A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.
Stochastic first-order methods for average-reward Markov decision processes.Mathematics of Operations Research, 2024
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Bridging the Gap Between Average and Discounted TD Learning
A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.