Delightful Distributed Policy Gradient

· 2026 · cs.LG · arXiv 2603.20521

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate finite-batch updates through large perpendicular components, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and preserving rare successes without behavior probabilities. In a tabular analysis, DG suppresses the perpendicular second moment of high-surprisal failures by a policy-overlap factor that vanishes as the learner improves. The advantage sign is essential for surprisal-based filtering: any learner-probability-only gate that suppresses rare failures also suppresses rare successes. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG often achieves nearly order-of-magnitude lower error. When all four frictions act simultaneously, its sample-efficiency advantage is order-of-magnitude and grows with task complexity.

representative citing papers

Delightful Exploration

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Delight-gated exploration spends actions only when expected improvement times surprisal exceeds a gate price, recovers Pandora's reservation rule, and shows weaker regret growth than Thompson sampling or epsilon-greedy across bandits and MDPs with transferable hyperparameters.

citing papers explorer

Showing 1 of 1 citing paper.

Delightful Exploration cs.LG · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
Delight-gated exploration spends actions only when expected improvement times surprisal exceeds a gate price, recovers Pandora's reservation rule, and shows weaker regret growth than Thompson sampling or epsilon-greedy across bandits and MDPs with transferable hyperparameters.

Delightful Distributed Policy Gradient

fields

years

verdicts

representative citing papers

citing papers explorer