pith. sign in

A.2 PROOF OFTHEOREM2ANDPROPOSITION1 Proof of Theorem 2.Step 1: RGPO variance upper bound.LetZ=∇ θ logπ θ(a|s)·A old

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.LG 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

Beyond Importance Sampling: Rejection-Gated Policy Optimization

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.

citing papers explorer

Showing 1 of 1 citing paper.

  • Beyond Importance Sampling: Rejection-Gated Policy Optimization cs.LG · 2026-04-16 · unverdicted · none · ref 12

    RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.