pith. sign in

arxiv: 1301.2315 · v1 · pith:UKOXY56Qnew · submitted 2013-01-10 · 💻 cs.LG · cs.AI· stat.ML

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

classification 💻 cs.LG cs.AIstat.ML
keywords rewardbaselinegradientvariancealgorithmslearningoptimalreinforcement
0
0 comments X
read the original abstract

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On Divergence Measures for Training GFlowNets

    cs.LG 2024-10 unverdicted novelty 6.0

    Introduces statistically efficient estimators for Renyi-α, Tsallis-α, reverse and forward KL divergences with REINFORCE and score-matching control variates for faster GFlowNet training.