pith. sign in

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

There exist a number of reinforcement learning algorithms which learnby climbing the gradient of expected reward. Their long-runconvergence has been proved, even in partially observableenvironments with non-deterministic actions, and without the need fora system model. However, the variance of the gradient estimator hasbeen found to be a significant practical problem. Recent approacheshave discounted future rewards, introducing a bias-variance trade-offinto the gradient estimate. We incorporate a reward baseline into thelearning system, and show that it affects variance without introducingfurther bias. In particular, as we approach the zero-bias,high-variance parameterization, the optimal (or variance minimizing)constant reward baseline is equal to the long-term average expectedreward. Modified policy-gradient algorithms are presented, and anumber of experiments demonstrate their improvement over previous work.

fields

cs.LG 2

years

2026 1 2024 1

verdicts

UNVERDICTED 2

representative citing papers

On Advantage Estimates for Max@K Policy Gradients

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.

On Divergence Measures for Training GFlowNets

cs.LG · 2024-10-12 · unverdicted · novelty 6.0

Introduces statistically efficient estimators for Renyi-α, Tsallis-α, reverse and forward KL divergences with REINFORCE and score-matching control variates for faster GFlowNet training.

citing papers explorer

Showing 2 of 2 citing papers.

  • On Advantage Estimates for Max@K Policy Gradients cs.LG · 2026-06-04 · unverdicted · none · ref 56 · internal anchor

    Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.

  • On Divergence Measures for Training GFlowNets cs.LG · 2024-10-12 · unverdicted · none · ref 94 · internal anchor

    Introduces statistically efficient estimators for Renyi-α, Tsallis-α, reverse and forward KL divergences with REINFORCE and score-matching control variates for faster GFlowNet training.