Generalizing from a few environments in safety-critical reinforcement learning

Angelos Filos; Owain Evans; Yarin Gal; Zachary Kenton

arxiv: 1907.01475 · v1 · pith:LDOSFEOGnew · submitted 2019-07-02 · 💻 cs.LG · cs.AI· stat.ML

Generalizing from a few environments in safety-critical reinforcement learning

Zachary Kenton , Angelos Filos , Owain Evans , Yarin Gal This is my paper

Pith reviewed 2026-05-25 10:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords reinforcement learninggeneralizationsafetycatastrophesensemble methodsuncertainty estimationgridworldCoinRun

0 comments

The pith

Reinforcement learning agents can perform perfectly on training environments yet produce catastrophes in unseen test environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that deep RL policies trained on a small number of environments can still generate dangerous failures when faced with novel situations, even if they achieve perfect scores during training. This matters because real-world safety requires reliable behavior outside the exact scenarios seen in training, where exhaustive coverage is often impossible. In a gridworld, the authors show that ensemble averaging of models and a blocking classifier can substantially cut the rate of catastrophes. In the harder CoinRun environment the same modifications do not reliably prevent failures, but the uncertainty signal from the ensemble successfully predicts imminent catastrophes and can therefore trigger a request for human intervention.

Core claim

Deep RL agents that achieve optimal performance across a limited set of training environments can still produce catastrophic outcomes in new test environments. In the gridworld domain, ensemble model averaging combined with a blocking classifier reduces these failures; in CoinRun the same interventions do not produce statistically significant reductions, yet the epistemic uncertainty derived from the ensemble remains predictive of near-term catastrophes and supports a human-in-the-loop safeguard.

What carries the argument

Ensemble of deep RL policies whose averaged action values and disagreement (uncertainty) are used both to improve robustness and to forecast impending catastrophes.

If this is right

Ensemble averaging and a simple blocking rule can lower catastrophe rates in low-complexity domains without requiring more training environments.
In richer visual environments the same ensemble and blocking techniques may leave catastrophe rates essentially unchanged.
Uncertainty estimates from the ensemble supply an early-warning signal that allows timely human intervention before a catastrophe occurs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety guarantees in RL may need explicit out-of-distribution detection rather than relying solely on average performance across training environments.
Uncertainty-based intervention could be combined with other generalization methods to create layered safety systems.
The predictive value of ensemble disagreement suggests that similar uncertainty signals might be useful in non-gridworld domains where direct catastrophe reduction is harder to achieve.

Load-bearing premise

The gridworld and CoinRun test environments are representative of the novel situations an agent would meet in safety-critical real-world use.

What would settle it

A controlled experiment showing that agents trained on the same few environments produce no increase in catastrophe rate when evaluated across a much larger and more diverse set of held-out environments.

Figures

Figures reproduced from arXiv: 1907.01475 by Angelos Filos, Owain Evans, Yarin Gal, Zachary Kenton.

**Figure 1.** Figure 1: Example trajectory from a Reveal environment. Agent: blue. Goal: green. Lava: red. Walls: grey. Mask: black. 3.2 Methods Deep Q-Networks (DQN). Deep Q-networks [24] do Q-learning [37] using a deep neural network as a function approximator to estimate the optimal value function Q(s, a; θ), where θ is a parameter vector. DQN is optimized by minimizing Li(θi) = Es,a,r,s0 [(yi − Q(s, a; θi))2 ], at each iterat… view at source ↗

**Figure 2.** Figure 2: Results on the Reveal setting, evaluated on unseen test environments for a range of methods. Nine random seeds are used for each algorithm and mean performances is shown here. Figure (a) shows that modified algorithms outperform the baseline DQN in terms of danger avoidance. The effect on return performance is observed in (b). The complete version is provided in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example transition by the Block&Ens-DQN in one unseen environment, in the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Two sample environments from our modified CoinRun setting. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Results in the CoinRun setting, evaluated on unseen [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: ROC Curves for binary classifier based on discrimination function [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Example trajectory from a Full environment. Agent: blue. Goal: green. Lava: red. Walls: grey. Further Results. Shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Complete quantitative experimental results on the [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Complete quantitative experimental results on the [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Complete quantitative experimental results on the CoinRun setting. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: ROC Curves for binary classifier based on discrimination function [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: ROC Curves evaluated on training levels. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

Before deploying autonomous agents in the real world, we need to be confident they will perform safely in novel situations. Ideally, we would expose agents to a very wide range of situations during training, allowing them to learn about every possible danger, but this is often impractical. This paper investigates safety and generalization from a limited number of training environments in deep reinforcement learning (RL). We find RL algorithms can fail dangerously on unseen test environments even when performing perfectly on training environments. Firstly, in a gridworld setting, we show that catastrophes can be significantly reduced with simple modifications, including ensemble model averaging and the use of a blocking classifier. In the more challenging CoinRun environment we find similar methods do not significantly reduce catastrophes. However, we do find that the uncertainty information from the ensemble is useful for predicting whether a catastrophe will occur within a few steps and hence whether human intervention should be requested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows RL agents can catastrophically fail on held-out environments despite perfect training, with ensembles helping in gridworld but not CoinRun while their uncertainty aids short-term catastrophe prediction, though the toy domains weaken claims about real safety-critical settings.

read the letter

The main thing to know is that this paper documents RL agents producing catastrophes in unseen test environments even after perfect training performance, then tests ensemble averaging, a blocking classifier, and uncertainty-based intervention prediction as responses. The gridworld results show clear reductions from those modifications, while CoinRun shows the main methods fail to cut catastrophes but uncertainty still flags imminent ones for human takeover. That differential outcome and the prediction use case are the concrete new pieces here. The work does a solid job of setting up the safety-generalization problem with held-out environments and running the interventions end-to-end on two distinct domains. The intervention-prediction angle follows directly from the ensemble outputs and gives a usable downstream application without extra machinery. The environments are the clear limitation. Gridworld hazards are hand-crafted and discrete, and CoinRun is a procedural platformer; neither includes continuous state spaces, sensor noise, actuator dynamics, or long-horizon physical interactions that define most safety-critical deployments. The stress-test concern lands because the observed failures and the limited success of the fixes could be artifacts of these stylized shifts rather than evidence that applies more broadly. Without quantitative effect sizes or statistical details in the abstract it is also hard to judge how large the gridworld gains actually are. Readers working on safe RL or uncertainty estimation would find the experiments worth looking at. The paper engages honestly with the literature on generalization failures and ships reproducible empirical claims, so it deserves a serious referee even with the domain caveats. I would send it to review.

Referee Report

3 major / 1 minor

Summary. The manuscript investigates generalization and safety in deep RL when agents are trained on a small number of environments. It claims that standard RL algorithms can produce dangerous failures on held-out test environments even when they achieve optimal performance on the training set. In a gridworld domain the authors report that ensemble averaging and a blocking classifier reduce the rate of catastrophes; in the CoinRun procedural platformer the same modifications do not yield significant reductions, but ensemble uncertainty is shown to be predictive of imminent catastrophes and therefore potentially useful for requesting human intervention.

Significance. If the reported patterns hold under more rigorous quantification, the work supplies concrete empirical evidence that perfect training performance does not guarantee safe behavior on novel environments, a point of direct relevance to safety-critical RL. The observation that ensemble uncertainty can flag impending failures offers a practical, if limited, mechanism for human oversight. The study is entirely empirical and contains no parameter-free derivations or machine-checked proofs.

major comments (3)

[Abstract] Abstract: the directional claims that catastrophes 'can be significantly reduced' in the gridworld and that uncertainty is 'useful for predicting' catastrophes in CoinRun are presented without effect sizes, confidence intervals, or statistical tests. Because these statements constitute the central empirical results, the absence of quantitative reporting is load-bearing for any assessment of practical significance.
[Abstract] The manuscript provides no details on training procedures, hyper-parameter selection, number of random seeds, or exact definitions of 'catastrophe' and 'perfect training performance.' These omissions prevent independent verification of the reported failure modes and mitigation effects.
[Abstract] The post-hoc nature of the intervention-prediction task is not addressed: it is unclear whether the ensemble uncertainty signal is evaluated on a rolling basis during an episode or only after the fact, and no baseline predictors (e.g., random or state-norm baselines) are reported.

minor comments (1)

[Abstract] The term 'blocking classifier' is introduced in the abstract without a one-sentence definition or pointer to its formal description, which hinders readability for readers outside the immediate sub-area.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We provide point-by-point responses to the major comments below. We agree that additional quantitative details and clarifications are needed and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the directional claims that catastrophes 'can be significantly reduced' in the gridworld and that uncertainty is 'useful for predicting' catastrophes in CoinRun are presented without effect sizes, confidence intervals, or statistical tests. Because these statements constitute the central empirical results, the absence of quantitative reporting is load-bearing for any assessment of practical significance.

Authors: We agree with this assessment. The abstract currently makes directional claims without supporting quantitative information. In the revised version, we will include specific effect sizes, such as the percentage reduction in catastrophe rates for the gridworld experiments, along with references to the number of trials and any statistical tests conducted. This will allow readers to better evaluate the practical significance of our findings. revision: yes
Referee: [Abstract] The manuscript provides no details on training procedures, hyper-parameter selection, number of random seeds, or exact definitions of 'catastrophe' and 'perfect training performance.' These omissions prevent independent verification of the reported failure modes and mitigation effects.

Authors: We acknowledge the lack of these details in the abstract and potentially insufficient detail in the main text. We will expand the experimental setup section to explicitly describe the training procedures, hyper-parameter choices, the use of multiple random seeds (typically 3-5), and precise definitions of a catastrophe (e.g., agent death or failure to complete the level) and perfect training performance (optimal reward on all training environments). We will also add a short reference in the abstract to these details being available in the methods. revision: yes
Referee: [Abstract] The post-hoc nature of the intervention-prediction task is not addressed: it is unclear whether the ensemble uncertainty signal is evaluated on a rolling basis during an episode or only after the fact, and no baseline predictors (e.g., random or state-norm baselines) are reported.

Authors: This is a valid point regarding the clarity of our evaluation protocol. Our experiments evaluated the ensemble uncertainty on a rolling basis at each timestep during the episode to predict imminent catastrophes. We will revise the relevant section to clearly state this methodology and include comparisons against baseline predictors, such as random prediction and simple state-norm thresholds, to demonstrate the added value of the uncertainty signal. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain or fitted predictions

full rationale

The paper reports experimental results on gridworld and CoinRun environments, demonstrating that RL agents can fail on held-out test environments despite perfect training performance, and testing modifications such as ensemble averaging and a blocking classifier. No mathematical derivations, first-principles results, or parameter-fitting steps are described that reduce to their own inputs by construction. All central observations rely on explicit held-out test environments rather than any self-referential prediction or self-citation load-bearing argument. The work is therefore self-contained against external benchmarks with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The implicit modeling choice that ensemble disagreement is a reliable proxy for future catastrophe risk is unexamined.

pith-pipeline@v0.9.0 · 5685 in / 968 out tokens · 17179 ms · 2026-05-25T10:55:33.578209+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find RL algorithms can fail dangerously on unseen test environments even when performing perfectly on training environments.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ensemble model averaging and the use of a blocking classifier

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chr...

work page 2015
[2]

Learning Dexterous In-Hand Manipulation

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Using conﬁdence bounds for exploitation-exploration trade-offs.J

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs.J. Mach. Learn. Res., 3:397–422, March 2003. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm? id=944919.944941

work page arXiv 2003
[4]

On the optimization of a synaptic learning rule

Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artiﬁcial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992

work page 1992
[5]

Quantifying Generalization in Reinforcement Learning

Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Ensemble methods in machine learning

Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classiﬁer systems, pages 1–15. Springer, 2000

work page 2000
[7]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018

work page 2018
[9]

Generalization and regularization in dqn

Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018

work page arXiv 2018
[10]

Model-agnostic meta-learning for fast adap- tation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap- tation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017

work page 2017
[11]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059, 2016

work page 2016
[12]

A comprehensive survey on safe reinforcement learning

Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015

work page 2015
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun

work page 2016
[14]

Deep residual learning for image recognition,

doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[15]

Learning to learn using gradient descent

Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artiﬁcial Neural Networks, pages 87–94. Springer, 2001

work page 2001
[16]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 9

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Uncertainty-Aware Reinforcement Learning for Collision Avoidance

Gregory Kahn, Adam Villaﬂor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty- aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017

work page 2017
[20]

AI Safety Gridworlds

Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016

work page 2016
[22]

End-to-End Task-Completion Neural Dialogue Systems

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

Zachary C Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, and Li Deng. Combating reinforce- ment learning’s sisyphean curse with intrinsic fear.(nov. 2016).arXiv preprint cs.LG/1611.01211, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Evaluating uncertainty quantiﬁca- tion in end-to-end autonomous driving control, 2018

Rhiannon Michelmore, Marta Kwiatkowska, and Yarin Gal. Evaluating uncertainty quantiﬁca- tion in end-to-end autonomous driving control, 2018

work page 2018
[25]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015
[26]

Asynchronous methods for deep reinforcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

work page 1928
[27]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017

work page 2017
[28]

Fingerprint Policy Optimisation for Robust Reinforcement Learning

Supratik Paul, Michael A Osborne, and Shimon Whiteson. Fingerprint policy optimisation for robust reinforcement learning. arXiv preprint arXiv:1805.10662, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Trial without error: Towards safe reinforcement learning via human intervention

William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2067–2069. International Foundation for Autonomous Agents and Multiagent Systems, 2018

work page 2067
[30]

Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-

Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987

work page 1987
[31]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[32]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016

work page 2016
[33]

Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929
[34]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. 10

work page 2018
[35]

Learning to learn

Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012

work page 2012
[36]

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning , 4(2): 26–31, 2012

work page 2012
[37]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Q-learning

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992

work page 1992
[39]

A Study on Overfitting in Deep Reinforcement Learning

Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overﬁtting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018. 11 A Supplementary Material Full Setting See Fig.7 for some frames from the Full gridworld setting. ot ot ot ot ot ot Figure 7: Example trajectory from a Full environment. Agent: blue. Goal: green. Lava: r...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chr...

work page 2015

[2] [2]

Learning Dexterous In-Hand Manipulation

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Using conﬁdence bounds for exploitation-exploration trade-offs.J

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs.J. Mach. Learn. Res., 3:397–422, March 2003. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm? id=944919.944941

work page arXiv 2003

[4] [4]

On the optimization of a synaptic learning rule

Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artiﬁcial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992

work page 1992

[5] [5]

Quantifying Generalization in Reinforcement Learning

Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Ensemble methods in machine learning

Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classiﬁer systems, pages 1–15. Springer, 2000

work page 2000

[7] [7]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018

work page 2018

[9] [9]

Generalization and regularization in dqn

Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018

work page arXiv 2018

[10] [10]

Model-agnostic meta-learning for fast adap- tation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap- tation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017

work page 2017

[11] [11]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059, 2016

work page 2016

[12] [12]

A comprehensive survey on safe reinforcement learning

Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015

work page 2015

[13] [13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun

work page 2016

[14] [14]

Deep residual learning for image recognition,

doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[15] [15]

Learning to learn using gradient descent

Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artiﬁcial Neural Networks, pages 87–94. Springer, 2001

work page 2001

[16] [16]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 9

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Uncertainty-Aware Reinforcement Learning for Collision Avoidance

Gregory Kahn, Adam Villaﬂor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty- aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017

work page 2017

[20] [20]

AI Safety Gridworlds

Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016

work page 2016

[22] [22]

End-to-End Task-Completion Neural Dialogue Systems

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

Zachary C Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, and Li Deng. Combating reinforce- ment learning’s sisyphean curse with intrinsic fear.(nov. 2016).arXiv preprint cs.LG/1611.01211, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Evaluating uncertainty quantiﬁca- tion in end-to-end autonomous driving control, 2018

Rhiannon Michelmore, Marta Kwiatkowska, and Yarin Gal. Evaluating uncertainty quantiﬁca- tion in end-to-end autonomous driving control, 2018

work page 2018

[25] [25]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015

[26] [26]

Asynchronous methods for deep reinforcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

work page 1928

[27] [27]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017

work page 2017

[28] [28]

Fingerprint Policy Optimisation for Robust Reinforcement Learning

Supratik Paul, Michael A Osborne, and Shimon Whiteson. Fingerprint policy optimisation for robust reinforcement learning. arXiv preprint arXiv:1805.10662, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Trial without error: Towards safe reinforcement learning via human intervention

William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2067–2069. International Foundation for Autonomous Agents and Multiagent Systems, 2018

work page 2067

[30] [30]

Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-

Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987

work page 1987

[31] [31]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[32] [32]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016

work page 2016

[33] [33]

Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929

[34] [34]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. 10

work page 2018

[35] [35]

Learning to learn

Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012

work page 2012

[36] [36]

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning , 4(2): 26–31, 2012

work page 2012

[37] [37]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Q-learning

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992

work page 1992

[39] [39]

A Study on Overfitting in Deep Reinforcement Learning

Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overﬁtting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018. 11 A Supplementary Material Full Setting See Fig.7 for some frames from the Full gridworld setting. ot ot ot ot ot ot Figure 7: Example trajectory from a Full environment. Agent: blue. Goal: green. Lava: r...

work page internal anchor Pith review Pith/arXiv arXiv 2018