pith. sign in

arxiv: 2106.12729 · v1 · pith:WV7OYFZZnew · submitted 2021-06-24 · 💻 cs.LG · math.OC· stat.ML

Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

classification 💻 cs.LG math.OCstat.ML
keywords lambdaoff-policybellmanfinite-samplegeneralizedknownsamplingalgorithms
0
0 comments X
read the original abstract

In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. $Q^\pi(\lambda)$, Tree-Backup$(\lambda)$, Retrace$(\lambda)$, and $Q$-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for $Q^\pi(\lambda)$, Tree-Backup$(\lambda)$, and Retrace$(\lambda)$, and improve the best known bounds of $Q$-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.