The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

· 2013 · cs.LG · arXiv 1301.2315

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

representative citing papers

On Advantage Estimates for Max@K Policy Gradients

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.

On Divergence Measures for Training GFlowNets

cs.LG · 2024-10-12 · unverdicted · novelty 6.0

Introduces statistically efficient estimators for Renyi-α, Tsallis-α, reverse and forward KL divergences with REINFORCE and score-matching control variates for faster GFlowNet training.

citing papers explorer

Showing 2 of 2 citing papers.

On Advantage Estimates for Max@K Policy Gradients cs.LG · 2026-06-04 · unverdicted · none · ref 56 · internal anchor
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
On Divergence Measures for Training GFlowNets cs.LG · 2024-10-12 · unverdicted · none · ref 94 · internal anchor
Introduces statistically efficient estimators for Renyi-α, Tsallis-α, reverse and forward KL divergences with REINFORCE and score-matching control variates for faster GFlowNet training.

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

fields

years

verdicts

representative citing papers

citing papers explorer