An Off-policy Policy Gradient Theorem Using Emphatic Weightings

arxiv: 1811.09013 · v2 · pith:IHDA6NLVnew · submitted 2018-11-22 · 💻 cs.LG · stat.ML

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Ehsan Imani , Eric Graves , Martha White This is my paper

classification 💻 cs.LG stat.ML

keywords policygradienttheoremoff-policyunicodex2014emphaticweightings

0 comments p. Extension

pith:IHDA6NLV Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{IHDA6NLV}

Prints a linked pith:IHDA6NLV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of $emphatic$ $weightings$. We develop a new actor-critic algorithm$\unicode{x2014}$called Actor Critic with Emphatic weightings (ACE)$\unicode{x2014}$that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods$\unicode{x2014}$particularly OffPAC and DPG$\unicode{x2014}$converge to the wrong solution whereas ACE finds the optimal solution.

This paper has not been read by Pith yet.

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

discussion (0)