Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Alex Trott; Caiming Xiong; Richard Socher; Stephan Zheng; Wenling Shang

arxiv: 1907.00664 · v1 · pith:2CI4RNTTnew · submitted 2019-07-01 · 💻 cs.LG · stat.ML

Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

Wenling Shang , Alex Trott , Stephan Zheng , Caiming Xiong , Richard Socher This is my paper

Pith reviewed 2026-05-25 12:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords hierarchical reinforcement learningworld graphpivotal statesgoal-conditioned policycuriosity-driven explorationmaze navigationtask transfer

0 comments

The pith

A learned world graph of pivotal states lets hierarchical agents solve new tasks by planning over the graph and traversing long paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to build a graph abstraction over an environment, with nodes as important pivotal states and edges as feasible traversals between them. This graph is learned in two stages by first jointly training a latent pivotal state model and a curiosity-driven goal-conditioned policy without any task-specific information. For new tasks, a high-level Manager uses the graph to quickly find solutions and set subgoals at pivotal states for a low-level Worker, which then traverses long distances and explores non-locally using the graph. Thorough ablation studies on a suite of challenging maze tasks show significant advantages in performance and efficiency over baselines without the world graph. A sympathetic reader would care because the approach reuses learned environment structure to handle multiple tasks without starting from scratch each time.

Core claim

The paper claims that a latent pivotal state model jointly trained with a curiosity-driven goal-conditioned policy in a task-agnostic manner produces a world graph abstraction. Provided with this graph, a high-level Manager quickly finds solutions to new tasks and expresses subgoals in reference to pivotal states to a low-level Worker, which leverages the graph to traverse to those states even across long distances and to explore non-locally, yielding better performance and efficiency than graph-free baselines on maze tasks.

What carries the argument

The world graph with nodes as pivotal states and edges as feasible traversals, produced by joint training of a latent pivotal state model and curiosity-driven goal-conditioned policy.

If this is right

A high-level Manager can quickly find solutions to new tasks by planning with reference to pivotal states.
A low-level Worker can traverse long distances to pivotal states and explore non-locally using the graph.
The framework produces significant advantages in performance and efficiency on maze tasks over methods lacking the graph.
The graph abstraction supports solving multiple tasks within one complex environment by reusing structure learned without task labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might scale to continuous state spaces if the latent model can identify useful pivotal states without discrete maze structure.
Combining the graph with other hierarchical methods could further reduce the cost of adapting to task variations.
Dynamically refining the graph during task solving might allow adaptation to changes in the environment.
Testing the method in environments with partial observability could reveal whether the graph still enables non-local exploration.

Load-bearing premise

The latent pivotal state model jointly trained with the curiosity-driven goal-conditioned policy in a task-agnostic manner produces nodes that form a useful graph abstraction for solving previously unseen tasks.

What would settle it

Training the model on maze environments and observing no improvement in success rate or sample efficiency on held-out tasks when the learned graph is provided to the Manager and Worker compared to baselines without it.

Figures

Figures reproduced from arXiv: 1907.00664 by Alex Trott, Caiming Xiong, Richard Socher, Stephan Zheng, Wenling Shang.

**Figure 1.** Figure 1: Top Left: Overall pipeline of our proposed 2-stage framework. Top Right (world graph discovery): a subgraph exemplifies how to forge edges and traverse between pivotal states (in blue). Bottom (Hierarhical RL): an example rollout from our proposed HRL policy with Wide-then-Narrow Manager instructions and world graph traversals, solving a challenging Door-Key task. At first glimpse, the world graph seems su… view at source ↗

**Figure 2.** Figure 2: Our recurrent latent model with differentiable binary latent units to discover pivotal states. A prior network (left) learns the state-conditioned prior in Beta distribution, pψ(zt|st) = Beta(αt, βt). An inference encoder learns an approximate posterior in HardKuma distribution [8] inferred from (st, at)’s, qφ(zt|at, st) = HardKuma( ˜αt, 1). A generation decoder reconstructs the action sequence from {st|zt… view at source ↗

**Figure 3.** Figure 3: Left: a general configuration of Feudal Netowrk; Manager and Worker are both A2C-LSTMs operating at different temporal resolutions. Right: proposed Wide-then-Narrow Manager instruction, where Manager first outputs a wide goal gw from a pre-defined set of candidate states V, e.g. Vp, and then zooms its attention to a closer up area around gw to narrow down the final subgoal gn. the shortest such actionable … view at source ↗

**Figure 4.** Figure 4: Validation curves during training (mean and standard-deviation of reward, 3 seeds) for MultiGoal. Left: Compare between Vp and Vrand, with or without traversal, all models here use WN and πg initialization. Observe that (1) traversal evidently speeds up convergence (2) Vrand carries higher variance and slightly inferior performance than Vp. Right: compare with or without πg initialization on Vp, all models… view at source ↗

read the original abstract

In many real-world scenarios, an autonomous agent often encounters various tasks within a single complex environment. We propose to build a graph abstraction over the environment structure to accelerate the learning of these tasks. Here, nodes are important points of interest (pivotal states) and edges represent feasible traversals between them. Our approach has two stages. First, we jointly train a latent pivotal state model and a curiosity-driven goal-conditioned policy in a task-agnostic manner. Second, provided with the information from the world graph, a high-level Manager quickly finds solution to new tasks and expresses subgoals in reference to pivotal states to a low-level Worker. The Worker can then also leverage the graph to easily traverse to the pivotal states of interest, even across long distance, and explore non-locally. We perform a thorough ablation study to evaluate our approach on a suite of challenging maze tasks, demonstrating significant advantages from the proposed framework over baselines that lack world graph knowledge in terms of performance and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage world graph method for HRL has solid ablations on mazes but the transferability of curiosity-discovered states to new tasks is the main open question.

read the letter

The paper's core claim is that a world graph built from task-agnostic training of a latent pivotal-state model and a curiosity-driven policy can speed up hierarchical RL on new tasks. The manager uses the graph to find solutions quickly and set subgoals, while the worker uses it for non-local traversal. What is new here is the two-stage setup that explicitly constructs the graph after the joint training and then integrates it into the manager-worker hierarchy. The paper does well by running a thorough ablation study on a suite of maze tasks and showing advantages over baselines that lack the graph. The soft spots are around the transfer assumption. The stress-test concern is valid on the surface: curiosity objectives often reward local novelty rather than global structure, so the discovered nodes might not form a reliable abstraction for unseen tasks. If the experiments do not include strong tests on distribution-shifted mazes or fail to show that the graph enables the claimed non-local exploration, the advantage could be overstated. The abstract does not include equations or specific training details, which makes it harder to assess the implementation. This paper is for researchers focused on hierarchical reinforcement learning and ways to incorporate environment structure. A reader looking for empirical results on maze navigation with graph abstractions would find it relevant. It deserves a serious referee because it has concrete experiments and addresses a practical problem in multi-task settings, even though the key assumption about pivotal states would benefit from closer examination in review. Recommendation: send it out for peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes a two-stage framework for accelerating hierarchical RL across multiple tasks in a shared environment by learning a 'world graph' whose nodes are pivotal states discovered via task-agnostic training. Stage 1 jointly optimizes a latent pivotal-state model together with a curiosity-driven goal-conditioned policy. Stage 2 supplies the resulting graph to a high-level Manager that plans over pivotal states and issues subgoals to a low-level Worker; the Worker in turn uses graph edges for long-range, non-local traversal. The authors report a thorough ablation study on a suite of maze navigation tasks demonstrating performance and sample-efficiency gains relative to baselines that lack the world-graph abstraction.

Significance. If the discovered pivotal states reliably form transferable abstractions rather than task-specific exploration artifacts, the approach would supply a concrete mechanism for reusable hierarchical structure in RL, directly addressing the sample-efficiency bottleneck in long-horizon, multi-task settings. The empirical claims on maze domains, if substantiated by the ablations, would constitute a practical demonstration that curiosity-driven discovery can yield planning-friendly graphs.

major comments (2)

[Abstract] Abstract (two-stage description): the central claim that the jointly trained latent pivotal-state model produces nodes usable for solving previously unseen tasks rests on an unverified assumption that curiosity-driven discovery aligns with task-relevant bottlenecks. No mechanism is stated that would prevent the nodes from being transient visitation artifacts, which would render the Manager/Worker transfer advantage void on held-out mazes.
[Ablation study] Ablation study paragraph: the manuscript asserts 'significant advantages … in terms of performance and efficiency' yet supplies neither quantitative metrics (e.g., success rate deltas, sample-complexity ratios) nor a direct comparison isolating the contribution of the world-graph transfer versus the curiosity policy alone. Without these numbers the load-bearing claim that the graph accelerates new-task learning cannot be evaluated.

minor comments (1)

[Abstract] Abstract: 'finds solution to new tasks' should read 'finds a solution to new tasks'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the mechanisms in our approach and indicating revisions where the presentation can be strengthened.

read point-by-point responses

Referee: [Abstract] Abstract (two-stage description): the central claim that the jointly trained latent pivotal-state model produces nodes usable for solving previously unseen tasks rests on an unverified assumption that curiosity-driven discovery aligns with task-relevant bottlenecks. No mechanism is stated that would prevent the nodes from being transient visitation artifacts, which would render the Manager/Worker transfer advantage void on held-out mazes.

Authors: The joint optimization of the latent pivotal-state model with the curiosity-driven goal-conditioned policy provides the mechanism: the model is trained to encode states that enable the policy to achieve diverse goals via intrinsic rewards, favoring states that serve as reliable exploration hubs rather than transient visitations. This task-agnostic process produces nodes that transfer to held-out tasks, as shown by the empirical results on unseen mazes. We will revise the abstract to explicitly articulate this joint-training mechanism. revision: yes
Referee: [Ablation study] Ablation study paragraph: the manuscript asserts 'significant advantages … in terms of performance and efficiency' yet supplies neither quantitative metrics (e.g., success rate deltas, sample-complexity ratios) nor a direct comparison isolating the contribution of the world-graph transfer versus the curiosity policy alone. Without these numbers the load-bearing claim that the graph accelerates new-task learning cannot be evaluated.

Authors: The full paper's ablation study reports quantitative results via success rates, learning curves, and efficiency comparisons against baselines. However, an explicit ablation isolating the world-graph transfer benefit from the curiosity policy alone is not present. We will add this comparison and include specific numerical deltas (e.g., success-rate improvements and sample-complexity ratios) in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical procedure is self-contained

full rationale

The paper presents a two-stage empirical method: joint task-agnostic training of a latent pivotal state model with a curiosity-driven goal-conditioned policy, followed by using the resulting graph for Manager/Worker hierarchical control on new tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that reduce the central claims to inputs by construction. The approach is validated via ablation on maze tasks rather than a closed derivation, so the derivation chain does not collapse and remains externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that environments contain learnable pivotal states whose graph abstraction transfers to new tasks; no free parameters or invented entities with independent evidence are specified in the abstract.

axioms (1)

domain assumption Environments contain identifiable pivotal states that can be discovered task-agnostically and assembled into a useful graph for subgoal planning.
Invoked as the foundation for both training stages and the subsequent use of the world graph.

invented entities (1)

world graph no independent evidence
purpose: Graph abstraction whose nodes are pivotal states and edges are feasible traversals, used to accelerate hierarchical task solving.
Newly introduced abstraction whose utility is asserted but not independently verified in the abstract.

pith-pipeline@v0.9.0 · 5705 in / 1357 out tokens · 44742 ms · 2026-05-25T12:28:49.825223+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Graph World Models: Concepts, Taxonomy, and Future Directions
cs.AI 2026-04 unverdicted novelty 7.0

The paper unifies emerging graph-based world models under a new paradigm and proposes a taxonomy organized by spatial, physical, and logical relational inductive biases.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 1 Pith paper · 25 internal anchors

[1]

Abbeel and A

P. Abbeel and A. Y . Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 1. ACM, 2004

work page 2004
[2]

Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Angeli, D

A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop- closure detection using bags of visual words. IEEE Transactions on Robotics, pages 1027–1037, 2008

work page 2008
[4]

M. G. Azar, B. Piot, B. A. Pires, J.-B. Gril, F. Altche, and R. Munos. World discovery model. arXiv, 2019

work page 2019
[5]

J. Ba, V . Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention.arXiv preprint arXiv:1412.7755, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[6]

Bacon, J

P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017
[7]

Barreto, W

A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017. 9

work page 2017
[8]

Bastings, W

J. Bastings, W. Aziz, and I. Titov. Interpretable neural predictions with differentiable binary vari- ables. In Proceedings of the 2019 Conference of the Association for Computational Linguistics, Volume 1 (Long Papers). Association for Computational Linguistics, 2019

work page 2019
[9]

D. P. Bertsekas. Dynamic programming and optimal control, volume 1. 1995

work page 1995
[10]

D. P. Bertsekas. Nonlinear Programming. 1999

work page 1999
[11]

N. Biggs. Algebraic Graph Theory. 1993

work page 1993
[12]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017

work page 2017
[13]

D. M. Blei and P. J. Moreno. Topic segmentation with an aspect hidden markov model. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 343–348. ACM, 2001

work page 2001
[14]

Large-Scale Study of Curiosity-Driven Learning

Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Bu¸ soniu, R

L. Bu¸ soniu, R. Babuška, and B. De Schutter. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1, pages 183–221. Springer, 2010

work page 2010
[16]

W. Chan, Y . Zhang, Q. Le, and N. Jaitly. Latent sequence decompositions. arXiv preprint arXiv:1610.03035, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Chatzigiorgaki and A

M. Chatzigiorgaki and A. N. Skodras. Real-time keyframe extraction towards video content identiﬁcation. In 2009 16th International conference on digital signal processing, pages 1–6. IEEE, 2009

work page 2009
[18]

Chevalier-Boisvert and L

M. Chevalier-Boisvert and L. Willems. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018

work page 2018
[19]

Chung, K

J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y . Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015

work page 2015
[20]

J. D. Co-Reyes, Y . Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Dayan and G

P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993

work page 1993
[22]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018

work page 2018
[23]

Donahue, Y

J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014

work page 2014
[24]

Dwiel, M

Z. Dwiel, M. Candadi, M. Phielipp, and A. Bansal. Hierarchical policy learning is sensitive to goal space design. arXiv preprint, (2), 2019

work page 2019
[25]

Go-explore: a new approach for hard-exploration problems

A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019

work page arXiv 1901
[27]

Z. Feng, R. Dearden, N. Meuleau, and R. Washington. Dynamic programming for structured continuous markov decision problems. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 154–161. AUAI Press, 2004

work page 2004
[28]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135, 2017

work page 2017
[29]

D. Fox, W. Burgard, and S. Thrun. Active markov localization for mobile robots. Robotics and Autonomous Systems, 25(3-4):195–207, 1998

work page 1998
[30]

R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Fritz, C

G. Fritz, C. Seifert, L. Paletta, and H. Bischof. Attentive object detection using an informa- tion theoretic saliency measure. In International workshop on attention and performance in computational vision, pages 29–41. Springer, 2004

work page 2004
[32]

Learning Actionable Representations with Goal-Conditioned Policies

D. Ghosh, A. Gupta, and S. Levine. Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Temporal Difference Variational Auto-Encoder

K. Gregor and F. Besse. Temporal difference variational auto-encoder. arXiv preprint arXiv:1806.03107, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Gregor, I

K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015

work page 2015
[35]

S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016

work page 2016
[36]

Z. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos. Neural predictive belief representations. arXiv preprint arXiv:1811.06407, 2018

work page arXiv 2018
[37]

World Models

D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

Latent Space Policies for Hierarchical Reinforcement Learning

T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Learning Latent Dynamics for Planning from Pixels

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Hausman, J

K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018

work page 2018
[41]

Henderson, R

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[42]

Multi-task Deep Reinforcement Learning with PopArt

M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt. Multi-task deep reinforcement learning with popart. arXiv preprint arXiv:1809.04474, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Higgins, L

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Ler- chner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR), 2017

work page 2017
[44]

SCAN: Learning Hierarchical Compositional Visual Concepts

I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. P. Burgess, M. Bosnjak, M. Shanahan, M. Botvinick, D. Hassabis, and A. Lerchner. Scan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

J. Hu, M. P. Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. Citeseer, 1998

work page 1998
[46]

Hussein, M

A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017

work page 2017
[47]

Time-Agnostic Prediction: Predicting Predictable Video Frames

D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine. Time-agnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

Discovering Options for Exploration by Minimizing Cover Time

Y . Jinnai, J. W. Park, D. Abel, and G. Konidaris. Discovering options for exploration by minimizing cover time. arXiv preprint arXiv:1903.00606, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[49]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial intelligence, 101(1-2):99–134, 1998

work page 1998
[50]

Model- based reinforcement learning for atari

L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019

work page arXiv 1903
[51]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2013

work page 2013
[52]

T. Kipf, Y . Li, H. Dai, V . Zambaldi, E. Grefenstette, P. Kohli, and P. Battaglia. Compositional im- itation learning: Explaining and executing one task at a time. arXiv preprint arXiv:1812.01483, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

Kroemer, C

O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters. Towards learning hierarchical skills for multi-phase manipulation tasks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1503–1510. IEEE, 2015. 11

work page 2015
[54]

T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016

work page 2016
[55]

Q. V . Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectiﬁed linear units. arXiv preprint arXiv:1504.00941, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Options Discovery with Budgeted Reinforcement Learning

A. Léon and L. Denoyer. Options discovery with budgeted reinforcement learning. arXiv preprint arXiv:1611.06824, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[57]

A. Levy, R. Platt, and K. Saenko. Hierarchical actor-critic. arXiv preprint arXiv:1712.00948, 2017

work page arXiv 2017
[58]

A. Q. Li, M. Xanthidis, J. M. O’Kane, and I. Rekleitis. Active localization with dynamic obstacles. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1902–1909. IEEE, 2016

work page 2016
[59]

M. L. Littman. Algorithms for sequential decision making. 1996

work page 1996
[60]

Learning Sparse Neural Networks through $L_0$ Regularization

C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

Lowry, N

S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2015

work page 2015
[62]

C. J. Maddison, A. Mnih, and Y . W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[63]

Marthi and C

B. Marthi and C. Guestrin. Concurrent hierarchical reinforcement learning. 2005

work page 2005
[64]

All you need is a good init

D. Mishkin and J. Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[65]

V . Mnih, J. Agapiou, S. Osindero, A. Graves, O. Vinyals, K. Kavukcuoglu, et al. Strategic attentive writer for learning macro-actions. arXiv preprint arXiv:1606.04695, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[66]

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016

work page 1928
[67]

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

Nachum, S

O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efﬁcient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018

work page 2018
[69]

A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018

work page 2018
[70]

Stick-Breaking Variational Autoencoders

E. Nalisnick and P. Smyth. Stick-breaking variational autoencoders. arXiv preprint arXiv:1605.06197, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[71]

The scientiﬁc objectives of the mars exploration rover

NASA. The scientiﬁc objectives of the mars exploration rover. 2015

work page 2015
[72]

Niekum and S

S. Niekum and S. Chitta. Incremental semantically grounded learning from demonstration. 2013

work page 2013
[73]

Ostrovski, M

G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos. Count-based exploration with neural density models. ICML, 2017

work page 2017
[74]

Y . P. Pane, S. P. Nageshrao, and R. Babuška. Actor-critic reinforcement learning for tracking control in robotics. In Decision and Control (CDC), 2016 IEEE 55th Conference on , pages 5819–5826. IEEE, 2016

work page 2016
[75]

Pathak, P

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017

work page 2017
[76]

Pertsch, O

K. Pertsch, O. Rybkin, J. Yang, K. Derpanis, J. Lim, K. Daniilidis, and A. Jeable. Keyin: Discovering subgoal structure with keyframe-based video prediction. arXiv, 2019

work page 2019
[77]

Racanière, T

S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y . Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pages 5690–5701, 2017. 12

work page 2017
[78]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[79]

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014

work page 2014
[80]

S. M. Ross. Introduction to stochastic dynamic programming. Academic press, 2014

work page 2014
[81]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015

work page 2015

Showing first 80 references.

[1] [1]

Abbeel and A

P. Abbeel and A. Y . Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 1. ACM, 2004

work page 2004

[2] [2]

Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Angeli, D

A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop- closure detection using bags of visual words. IEEE Transactions on Robotics, pages 1027–1037, 2008

work page 2008

[4] [4]

M. G. Azar, B. Piot, B. A. Pires, J.-B. Gril, F. Altche, and R. Munos. World discovery model. arXiv, 2019

work page 2019

[5] [5]

J. Ba, V . Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention.arXiv preprint arXiv:1412.7755, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[6] [6]

Bacon, J

P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017

[7] [7]

Barreto, W

A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017. 9

work page 2017

[8] [8]

Bastings, W

J. Bastings, W. Aziz, and I. Titov. Interpretable neural predictions with differentiable binary vari- ables. In Proceedings of the 2019 Conference of the Association for Computational Linguistics, Volume 1 (Long Papers). Association for Computational Linguistics, 2019

work page 2019

[9] [9]

D. P. Bertsekas. Dynamic programming and optimal control, volume 1. 1995

work page 1995

[10] [10]

D. P. Bertsekas. Nonlinear Programming. 1999

work page 1999

[11] [11]

N. Biggs. Algebraic Graph Theory. 1993

work page 1993

[12] [12]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017

work page 2017

[13] [13]

D. M. Blei and P. J. Moreno. Topic segmentation with an aspect hidden markov model. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 343–348. ACM, 2001

work page 2001

[14] [14]

Large-Scale Study of Curiosity-Driven Learning

Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Bu¸ soniu, R

L. Bu¸ soniu, R. Babuška, and B. De Schutter. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1, pages 183–221. Springer, 2010

work page 2010

[16] [16]

W. Chan, Y . Zhang, Q. Le, and N. Jaitly. Latent sequence decompositions. arXiv preprint arXiv:1610.03035, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Chatzigiorgaki and A

M. Chatzigiorgaki and A. N. Skodras. Real-time keyframe extraction towards video content identiﬁcation. In 2009 16th International conference on digital signal processing, pages 1–6. IEEE, 2009

work page 2009

[18] [18]

Chevalier-Boisvert and L

M. Chevalier-Boisvert and L. Willems. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018

work page 2018

[19] [19]

Chung, K

J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y . Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015

work page 2015

[20] [20]

J. D. Co-Reyes, Y . Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings. arXiv preprint arXiv:1806.02813, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Dayan and G

P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993

work page 1993

[22] [22]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018

work page 2018

[23] [23]

Donahue, Y

J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014

work page 2014

[24] [24]

Dwiel, M

Z. Dwiel, M. Candadi, M. Phielipp, and A. Bansal. Hierarchical policy learning is sensitive to goal space design. arXiv preprint, (2), 2019

work page 2019

[25] [25]

Go-explore: a new approach for hard-exploration problems

A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019

work page arXiv 1901

[26] [27]

Z. Feng, R. Dearden, N. Meuleau, and R. Washington. Dynamic programming for structured continuous markov decision problems. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 154–161. AUAI Press, 2004

work page 2004

[27] [28]

C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135, 2017

work page 2017

[28] [29]

D. Fox, W. Burgard, and S. Thrun. Active markov localization for mobile robots. Robotics and Autonomous Systems, 25(3-4):195–207, 1998

work page 1998

[29] [30]

R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [31]

Fritz, C

G. Fritz, C. Seifert, L. Paletta, and H. Bischof. Attentive object detection using an informa- tion theoretic saliency measure. In International workshop on attention and performance in computational vision, pages 29–41. Springer, 2004

work page 2004

[31] [32]

Learning Actionable Representations with Goal-Conditioned Policies

D. Ghosh, A. Gupta, and S. Levine. Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [33]

Temporal Difference Variational Auto-Encoder

K. Gregor and F. Besse. Temporal difference variational auto-encoder. arXiv preprint arXiv:1806.03107, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [34]

Gregor, I

K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015

work page 2015

[34] [35]

S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016

work page 2016

[35] [36]

Z. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos. Neural predictive belief representations. arXiv preprint arXiv:1811.06407, 2018

work page arXiv 2018

[36] [37]

World Models

D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [38]

Latent Space Policies for Hierarchical Reinforcement Learning

T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [39]

Learning Latent Dynamics for Planning from Pixels

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [40]

Hausman, J

K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018

work page 2018

[40] [41]

Henderson, R

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018

[41] [42]

Multi-task Deep Reinforcement Learning with PopArt

M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt. Multi-task deep reinforcement learning with popart. arXiv preprint arXiv:1809.04474, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [43]

Higgins, L

I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Ler- chner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR), 2017

work page 2017

[43] [44]

SCAN: Learning Hierarchical Compositional Visual Concepts

I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. P. Burgess, M. Bosnjak, M. Shanahan, M. Botvinick, D. Hassabis, and A. Lerchner. Scan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [45]

J. Hu, M. P. Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. Citeseer, 1998

work page 1998

[45] [46]

Hussein, M

A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017

work page 2017

[46] [47]

Time-Agnostic Prediction: Predicting Predictable Video Frames

D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine. Time-agnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[47] [48]

Discovering Options for Exploration by Minimizing Cover Time

Y . Jinnai, J. W. Park, D. Abel, and G. Konidaris. Discovering options for exploration by minimizing cover time. arXiv preprint arXiv:1903.00606, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[48] [49]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial intelligence, 101(1-2):99–134, 1998

work page 1998

[49] [50]

Model- based reinforcement learning for atari

L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019

work page arXiv 1903

[50] [51]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2013

work page 2013

[51] [52]

T. Kipf, Y . Li, H. Dai, V . Zambaldi, E. Grefenstette, P. Kohli, and P. Battaglia. Compositional im- itation learning: Explaining and executing one task at a time. arXiv preprint arXiv:1812.01483, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[52] [53]

Kroemer, C

O. Kroemer, C. Daniel, G. Neumann, H. Van Hoof, and J. Peters. Towards learning hierarchical skills for multi-phase manipulation tasks. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1503–1510. IEEE, 2015. 11

work page 2015

[53] [54]

T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016

work page 2016

[54] [55]

Q. V . Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectiﬁed linear units. arXiv preprint arXiv:1504.00941, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[55] [56]

Options Discovery with Budgeted Reinforcement Learning

A. Léon and L. Denoyer. Options discovery with budgeted reinforcement learning. arXiv preprint arXiv:1611.06824, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[56] [57]

A. Levy, R. Platt, and K. Saenko. Hierarchical actor-critic. arXiv preprint arXiv:1712.00948, 2017

work page arXiv 2017

[57] [58]

A. Q. Li, M. Xanthidis, J. M. O’Kane, and I. Rekleitis. Active localization with dynamic obstacles. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1902–1909. IEEE, 2016

work page 2016

[58] [59]

M. L. Littman. Algorithms for sequential decision making. 1996

work page 1996

[59] [60]

Learning Sparse Neural Networks through $L_0$ Regularization

C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through l_0 regularization. arXiv preprint arXiv:1712.01312, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[60] [61]

Lowry, N

S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2015

work page 2015

[61] [62]

C. J. Maddison, A. Mnih, and Y . W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[62] [63]

Marthi and C

B. Marthi and C. Guestrin. Concurrent hierarchical reinforcement learning. 2005

work page 2005

[63] [64]

All you need is a good init

D. Mishkin and J. Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[64] [65]

V . Mnih, J. Agapiou, S. Osindero, A. Graves, O. Vinyals, K. Kavukcuoglu, et al. Strategic attentive writer for learning macro-actions. arXiv preprint arXiv:1606.04695, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[65] [66]

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016

work page 1928

[66] [67]

Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[67] [68]

Nachum, S

O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efﬁcient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303–3313, 2018

work page 2018

[68] [69]

A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200, 2018

work page 2018

[69] [70]

Stick-Breaking Variational Autoencoders

E. Nalisnick and P. Smyth. Stick-breaking variational autoencoders. arXiv preprint arXiv:1605.06197, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[70] [71]

The scientiﬁc objectives of the mars exploration rover

NASA. The scientiﬁc objectives of the mars exploration rover. 2015

work page 2015

[71] [72]

Niekum and S

S. Niekum and S. Chitta. Incremental semantically grounded learning from demonstration. 2013

work page 2013

[72] [73]

Ostrovski, M

G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos. Count-based exploration with neural density models. ICML, 2017

work page 2017

[73] [74]

Y . P. Pane, S. P. Nageshrao, and R. Babuška. Actor-critic reinforcement learning for tracking control in robotics. In Decision and Control (CDC), 2016 IEEE 55th Conference on , pages 5819–5826. IEEE, 2016

work page 2016

[74] [75]

Pathak, P

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017

work page 2017

[75] [76]

Pertsch, O

K. Pertsch, O. Rybkin, J. Yang, K. Derpanis, J. Lim, K. Daniilidis, and A. Jeable. Keyin: Discovering subgoal structure with keyframe-based video prediction. arXiv, 2019

work page 2019

[76] [77]

Racanière, T

S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y . Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pages 5690–5701, 2017. 12

work page 2017

[77] [78]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019

[78] [79]

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014

work page 2014

[79] [80]

S. M. Ross. Introduction to stochastic dynamic programming. Academic press, 2014

work page 2014

[80] [81]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015

work page 2015