OPPO augments PPO with optimistic policy evaluation driven by return uncertainty estimates and shows improved results over prior methods on a tabular sparse-reward task.
Regret analysis of stochastic and nonstochastic multi-armed bandit problems
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Optimistic Proximal Policy Optimization
OPPO augments PPO with optimistic policy evaluation driven by return uncertainty estimates and shows improved results over prior methods on a tabular sparse-reward task.