R²VPO uses ratio-variance regularization as a distributional soft brake on policy updates, claiming better performance than PPO on math reasoning and robotic control without hard clipping.
A stochastic trust-region framework for policy optimization.arXiv preprint arXiv:1911.11640,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Ratio-Variance Regularized Policy Optimization
R²VPO uses ratio-variance regularization as a distributional soft brake on policy updates, claiming better performance than PPO on math reasoning and robotic control without hard clipping.