The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Lex Weaver; Nigel Tao

arxiv: 1301.2315 · v1 · pith:UKOXY56Qnew · submitted 2013-01-10 · 💻 cs.LG · cs.AI· stat.ML

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Lex Weaver , Nigel Tao This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords rewardbaselinegradientvariancealgorithmslearningoptimalreinforcement

0 comments

read the original abstract

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On Divergence Measures for Training GFlowNets
cs.LG 2024-10 unverdicted novelty 6.0

Introduces statistically efficient estimators for Renyi-α, Tsallis-α, reverse and forward KL divergences with REINFORCE and score-matching control variates for faster GFlowNet training.