Differential privacy in policy optimization adds sample complexity costs that often appear as lower-order terms rather than dominating the bounds.
Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
citing papers explorer
-
On the Sample Complexity of Differentially Private Policy Optimization
Differential privacy in policy optimization adds sample complexity costs that often appear as lower-order terms rather than dominating the bounds.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.