Optimistic Proximal Policy Optimization

Takahisa Imagawa; Takuya Hiraoka; Yoshimasa Tsuruoka

arxiv: 1906.11075 · v1 · pith:5RFBIAQVnew · submitted 2019-06-25 · 💻 cs.LG · cs.AI

Optimistic Proximal Policy Optimization

Takahisa Imagawa , Takuya Hiraoka , Yoshimasa Tsuruoka This is my paper

Pith reviewed 2026-05-25 16:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpolicy optimizationoptimismuncertaintysparse rewardsproximal policy optimizationtabular environments

0 comments

The pith

OPPO improves reinforcement learning with rare rewards by optimistically evaluating policies based on return uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Optimistic Proximal Policy Optimization to make it easier to learn good policies when rewards occur infrequently. OPPO accounts for uncertainty in the estimated total return and uses that to evaluate the policy more optimistically than standard methods would. This is shown to lead to better results than existing approaches in a tabular reinforcement learning task. A reader would care because many practical RL problems involve sparse or delayed rewards that make learning difficult. The method aims to leverage uncertainty estimates to guide the agent toward better policies without changing the core optimization framework.

Core claim

By incorporating an optimistic adjustment derived from the uncertainty of the estimated total return into the proximal policy optimization framework, OPPO achieves better performance than standard methods in a tabular task where rewards are rare.

What carries the argument

The optimistic policy evaluation that increases the value estimate in proportion to the uncertainty of the return.

If this is right

OPPO outperforms existing methods in tabular tasks with rare rewards.
The approach alleviates difficulty in learning policies when rewards are infrequent.
Considering uncertainty allows for more effective policy evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If uncertainty estimates are reliable, this optimism could encourage beneficial exploration in other RL settings.
The method might be extended to deep RL by using modern uncertainty quantification techniques.
Misestimation of uncertainty could lead to over-optimism and unstable learning.

Load-bearing premise

The uncertainty estimate of the total return must be accurate and unbiased so that the optimistic adjustment aids rather than disrupts policy learning.

What would settle it

If experiments with inaccurate or biased uncertainty estimates show that OPPO performs worse than or equal to standard PPO, the claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.11075 by Takahisa Imagawa, Takuya Hiraoka, Yoshimasa Tsuruoka.

**Figure 2.** Figure 2: Moving average ± standard deviation of epsode rewards in bandit tile domain with 10 seeds until 1M time-steps executed with the probability 1 − ζ while the most previous action is repeated with the probability ζ. We set ζ = 1/4. We chose six games (Frostbite, Freeway, Solaris, Venture, Montezuma’s Revenge, and Private Eye) to evaluate the proposed method and run algorithms until 100 million timesteps in F… view at source ↗

**Figure 3.** Figure 3: Moving average of the average of RND bonus 1/ns0 in batch data. Note that we use a frame skipping technique, and the number of the frame skips is four; so one time-step is equal to or less than four frames (it is less than four if the episode ends at a skipped frame) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Moving average ± standard deviation of episode rewards with 5 seeds until 50M time-steps (100M time-steps in Frostbite) [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Reinforcement Learning, a machine learning framework for training an autonomous agent based on rewards, has shown outstanding results in various domains. However, it is known that learning a good policy is difficult in a domain where rewards are rare. We propose a method, optimistic proximal policy optimization (OPPO) to alleviate this difficulty. OPPO considers the uncertainty of the estimated total return and optimistically evaluates the policy based on that amount. We show that OPPO outperforms the existing methods in a tabular task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPPO adds an optimistic adjustment based on return uncertainty to the PPO surrogate and reports better results on one tabular task, but the abstract supplies no task details, numbers, or ablations to back the claim.

read the letter

OPPO modifies PPO by factoring in uncertainty around the estimated total return and then evaluating the policy more optimistically. The goal is to make learning easier when rewards are sparse. The authors say this beats prior methods on a tabular task. That is the whole contribution as presented. The idea itself is a direct transplant of UCB-style optimism into the PPO objective, which is a routine move rather than a new framework. It is at least a clean way to target a known pain point in policy optimization. The paper keeps the claim narrow and does not over-reach on theory. The main problem is that the abstract gives no task definition, no baseline list, no quantitative results, and no mention of statistical tests or ablations. Without those pieces the central empirical statement has no visible support, so it is impossible to judge whether the improvement is real or meaningful. The assumption that the uncertainty estimate is accurate enough to help rather than hurt is left unexamined in the given text. This work is aimed at RL researchers who already use PPO and want a lightweight tweak for sparse-reward problems. Someone running experiments in that niche might try the idea, but the current write-up does not give enough to evaluate it. The paper deserves peer review so the experiments can be checked properly; the idea is straightforward and the limited claim is internally consistent even if the evidence is still missing.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Optimistic Proximal Policy Optimization (OPPO), an extension of PPO that incorporates uncertainty in the estimated total return to produce an optimistic policy evaluation. The central claim is that this adjustment alleviates difficulties in sparse-reward domains and yields outperformance over existing methods on a single tabular task.

Significance. If the tabular result is reproducible and the uncertainty estimate is shown to be well-behaved, the method offers a lightweight way to inject optimism into policy-gradient updates without additional parameters. The contribution is narrowly scoped and would primarily be of interest to researchers working on sparse-reward tabular or low-dimensional RL problems.

major comments (2)

[Abstract] Abstract: the assertion that OPPO 'outperforms the existing methods in a tabular task' is unsupported; no task definition, state-action space size, reward sparsity level, baseline algorithms, number of runs, or statistical test is supplied. This absence makes the central empirical claim impossible to evaluate.
[Abstract] Abstract (and throughout): no equations, pseudocode, or derivation is provided for how the uncertainty of the total return is estimated or how the optimistic adjustment is folded into the PPO surrogate objective. Without this, it is impossible to verify that the method is well-defined or distinct from existing optimistic RL variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that OPPO 'outperforms the existing methods in a tabular task' is unsupported; no task definition, state-action space size, reward sparsity level, baseline algorithms, number of runs, or statistical test is supplied. This absence makes the central empirical claim impossible to evaluate.

Authors: We agree that the abstract, as currently written, does not supply sufficient experimental context to support the performance claim. The revised version will expand the abstract to include a concise description of the tabular task (including state-action space size and reward sparsity), the baselines, the number of runs, and the statistical test used. revision: yes
Referee: [Abstract] Abstract (and throughout): no equations, pseudocode, or derivation is provided for how the uncertainty of the total return is estimated or how the optimistic adjustment is folded into the PPO surrogate objective. Without this, it is impossible to verify that the method is well-defined or distinct from existing optimistic RL variants.

Authors: We acknowledge that the manuscript does not contain the equations, derivation, or pseudocode describing the return-uncertainty estimator or its incorporation into the PPO surrogate. This is a material omission. The revision will add the missing formalization, a brief derivation of the optimistic adjustment, and pseudocode so that the method is fully specified and its relation to prior optimistic RL work can be assessed. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes the OPPO algorithm that incorporates uncertainty estimates of total return for optimistic policy evaluation and reports empirical outperformance versus baselines on one tabular task. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The central claim is a scoped empirical result rather than a mathematical reduction that collapses to its own inputs by construction; the derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.0 · 5602 in / 922 out tokens · 29735 ms · 2026-05-25T16:27:06.471369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page
[3]

Trust region policy optimization of pomdps

Kamyar Azizzadenesheli, Manish Kumar Bera, and Animashree Anandkumar. Trust region policy optimization of pomdps. arXiv preprint arXiv:1810.07900 , 2018

work page arXiv 2018
[4]

Unifying count-based exploration and intrinsic motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems , pages 1471--1479, 2016

work page 2016
[5]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S \'e bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122, 2012

work page 2012
[6]

Large-scale study of curiosity-driven learning

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. Seventh International Conference on Learning Representations , 2019

work page 2019
[7]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. Seventh International Conference on Learning Representations , 2019

work page 2019
[8]

Go-explore: a new approach for hard-exploration problems

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995 , 2019

work page arXiv 1901
[9]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of international conference on Machine learning , volume 2, pages 267--274, 2002

work page 2002
[10]

Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents

Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research , 61:523--562, 2018

work page 2018
[11]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature , 518(7540):529, 2015

work page 2015
[12]

The Uncertainty Bellman Equation and Exploration

Brendan O'Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty B ellman equation and exploration. arXiv preprint arXiv:1709.05380 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Deep exploration via bootstrapped DQN

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN . In Advances in neural information processing systems , pages 4026--4034, 2016

work page 2016
[14]

Count-based exploration with neural density models

Georg Ostrovski, Marc G Bellemare, A \"a ron van den Oord, and R \'e mi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 2721--2730. JMLR. org, 2017

work page 2017
[15]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning , volume 2017, 2017

work page 2017
[16]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of international conference on Machine learning , volume 37, pages 1889--1897, 2015

work page 2015
[17]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Mastering the game of go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature , 550(7676):354, 2017

work page 2017
[20]

Introduction to reinforcement learning , volume 135

Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning , volume 135. MIT press Cambridge, 1998

work page 1998
[21]

\# E xploration: A study of count-based exploration for deep reinforcement learning

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. \# E xploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems , pages 2753--2762, 2017

work page 2017
[22]

Learning values across many orders of magnitude

Hado P van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems , pages 4287--4295, 2016

work page 2016

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[3] [3]

Trust region policy optimization of pomdps

Kamyar Azizzadenesheli, Manish Kumar Bera, and Animashree Anandkumar. Trust region policy optimization of pomdps. arXiv preprint arXiv:1810.07900 , 2018

work page arXiv 2018

[4] [4]

Unifying count-based exploration and intrinsic motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems , pages 1471--1479, 2016

work page 2016

[5] [5]

Regret analysis of stochastic and nonstochastic multi-armed bandit problems

S \'e bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1--122, 2012

work page 2012

[6] [6]

Large-scale study of curiosity-driven learning

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning. Seventh International Conference on Learning Representations , 2019

work page 2019

[7] [7]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. Seventh International Conference on Learning Representations , 2019

work page 2019

[8] [8]

Go-explore: a new approach for hard-exploration problems

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995 , 2019

work page arXiv 1901

[9] [9]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of international conference on Machine learning , volume 2, pages 267--274, 2002

work page 2002

[10] [10]

Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents

Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research , 61:523--562, 2018

work page 2018

[11] [11]

Human-level control through deep reinforcement learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature , 518(7540):529, 2015

work page 2015

[12] [12]

The Uncertainty Bellman Equation and Exploration

Brendan O'Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty B ellman equation and exploration. arXiv preprint arXiv:1709.05380 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Deep exploration via bootstrapped DQN

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN . In Advances in neural information processing systems , pages 4026--4034, 2016

work page 2016

[14] [14]

Count-based exploration with neural density models

Georg Ostrovski, Marc G Bellemare, A \"a ron van den Oord, and R \'e mi Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 2721--2730. JMLR. org, 2017

work page 2017

[15] [15]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning , volume 2017, 2017

work page 2017

[16] [16]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of international conference on Machine learning , volume 37, pages 1889--1897, 2015

work page 2015

[17] [17]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Mastering the game of go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature , 550(7676):354, 2017

work page 2017

[20] [20]

Introduction to reinforcement learning , volume 135

Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning , volume 135. MIT press Cambridge, 1998

work page 1998

[21] [21]

\# E xploration: A study of count-based exploration for deep reinforcement learning

Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. \# E xploration: A study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems , pages 2753--2762, 2017

work page 2017

[22] [22]

Learning values across many orders of magnitude

Hado P van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems , pages 4287--4295, 2016

work page 2016