A.2 PROOF OFTHEOREM2ANDPROPOSITION1 Proof of Theorem 2.Step 1: RGPO variance upper bound.LetZ=∇ θ logπ θ(a|s)·A old

A FULLPROOFS A · 2002

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Beyond Importance Sampling: Rejection-Gated Policy Optimization

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond Importance Sampling: Rejection-Gated Policy Optimization cs.LG · 2026-04-16 · unverdicted · none · ref 12
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

A.2 PROOF OFTHEOREM2ANDPROPOSITION1 Proof of Theorem 2.Step 1: RGPO variance upper bound.LetZ=∇ θ logπ θ(a|s)·A old

fields

years

verdicts

representative citing papers

citing papers explorer