pith. sign in

arxiv: 1101.0428 · v1 · pith:EUKYQP6Rnew · submitted 2011-01-02 · 💻 cs.LG · cs.AI

The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning

classification 💻 cs.LG cs.AI
keywords learningvalue-functionfunctionpolicyvalueapproximatorcontrolgeneral
0
0 comments X
read the original abstract

In this theoretical paper we are concerned with the problem of learning a value function by a smooth general function approximator, to solve a deterministic episodic control problem in a large continuous state space. It is shown that learning the gradient of the value-function at every point along a trajectory generated by a greedy policy is a sufficient condition for the trajectory to be locally extremal, and often locally optimal, and we argue that this brings greater efficiency to value-function learning. This contrasts to traditional value-function learning in which the value-function must be learnt over the whole of state space. It is also proven that policy-gradient learning applied to a greedy policy on a value-function produces a weight update equivalent to a value-gradient weight update, which provides a surprising connection between these two alternative paradigms of reinforcement learning, and a convergence proof for control problems with a value function represented by a general smooth function approximator.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.