Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Anirudh Goyal; Jonathan Binas; Sergey Levine; Shagun Sodhani; Xue Bin Peng; Yoshua Bengio

arxiv: 1906.10667 · v1 · pith:7VDVBWG5new · submitted 2019-06-25 · 💻 cs.LG · cs.AI· stat.ML

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Anirudh Goyal , Shagun Sodhani , Jonathan Binas , Xue Bin Peng , Sergey Levine , Yoshua Bengio This is my paper

Pith reviewed 2026-05-25 16:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords reinforcement learninghierarchical policiespolicy decompositioninformation theorygeneralizationprimitivesensemblesoptions

0 comments

The pith

A reinforcement learning policy decomposes into primitives that compete by requesting different amounts of state information, with no meta-policy required.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a policy architecture made of multiple primitives where each one independently decides how much information about the current state it needs. The primitive that requests the most information is allowed to act, and all primitives are trained to request as little information as possible. This setup creates decentralized competition that leads to specialization among the primitives. Experiments show the resulting policies generalize better than both standard flat policies and hierarchical policies that rely on an explicit meta-controller. A sympathetic reader would care because the approach removes the need for a separate high-level decision maker while still achieving structured behavior in complex environments.

Core claim

The central claim is that an ensemble of information-constrained primitives can decompose behavior without a meta-policy: each primitive chooses its own information usage about the state, the one requesting the largest amount acts, and regularization to minimize information use induces natural competition and specialization, yielding improved generalization over flat and hierarchical baselines.

What carries the argument

The decentralized selection rule in which the primitive requesting the most state information is chosen to act, paired with per-primitive regularization that penalizes large information requests.

If this is right

Behavior can be structured through competition among primitives instead of explicit high-level selection.
The absence of a meta-policy removes one source of decision-making overhead in hierarchical reinforcement learning.
Specialization emerges from the information-minimization pressure rather than from hand-designed options.
The architecture applies to diverse environments where a single flat policy struggles to generalize.
Decentralized activation can still produce coherent overall behavior when the selection rule favors informative primitives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same competition mechanism might be tested in settings with continuous action spaces to check whether information requests remain a reliable selection signal.
One could replace the information request with other scalar signals, such as uncertainty estimates, to see if the competition dynamic persists.
The approach suggests a route to scaling structured policies by increasing the number of primitives while keeping the selection rule fixed.
Connection to multi-agent reinforcement learning could be explored by treating each primitive as an independent agent that bids for control via its information request.

Load-bearing premise

Regularizing each primitive to request as little state information as possible while always letting the primitive that requests the most act will automatically produce effective competition and specialization.

What would settle it

A controlled comparison on held-out tasks in which the proposed architecture shows no generalization advantage over a standard hierarchical policy with an explicit meta-controller would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.10667 by Anirudh Goyal, Jonathan Binas, Sergey Levine, Shagun Sodhani, Xue Bin Peng, Yoshua Bengio.

**Figure 2.** Figure 2: The primitive-selection mechanism of our model. The primitive with most information acts in the environment, and gets the reward. To encourage each primitive to encode information from a particular part of state space, we limit the amount of information each primitive can access from the state. In particular, each primitive has an information bottleneck with respect to the input state, preventing it from… view at source ↗

**Figure 3.** Figure 3: Multitask training. Each panel corresponds to a different training setup, where different tasks are denoted A, B, C, ..., and a rectangle with n circles corresponds to an agent composed of n primitives trained on the respective tasks. Top row: activation of primitives for agents trained on single tasks. Bottom row: Retrain: two primitives are trained on A and transferred to B. The results (success rates) i… view at source ↗

**Figure 4.** Figure 4: Continual Learning Scenario: We consider a continual learning scenario where we train 2 primitives for 2 goal positions, then transfer (and finetune) on 4 goal positions then transfer (and finetune) on 8 goals positions. The plot on the left shows the primitives remain activated. The solid green line shows the boundary between the tasks, The plot on the right shows the number of samples taken by our model … view at source ↗

**Figure 5.** Figure 5: Left: Multitask setup, where we show that we are able to train 8 primitives when training on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Snapshots of motions learned by the policy. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Embeddings visualizing the states (S) and goals (G) which each primitive is active in, and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Performance on the 2D bandits task. Left: The comparison of our model (blue curve - [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: RGB view of the Fetch environment [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: RGB view of the Unlock environment. D.2 Tasks in MiniGrid Environment We consider the following tasks in the MiniGrid environment: 1. Fetch: In the Fetch task, the agent spawns at an arbitrary position in a 8 × 8 grid (figure 9 ). It is provided with a natural language goal description of the form “go fetch a yellow box”. The agent has to navigate to the object being referred to in the goal description an… view at source ↗

**Figure 11.** Figure 11: RGB view of the UnlockPickup environment. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: View of the four-room environment F.1.1 Reward Function We consider the sparse reward setup where the agent gets a reward (of 1) only if it completes the task (and reaches the goal position) and 0 at all other time steps. We also apply a time limit of 300 steps on all the tasks ie the agent must complete the task in 300 steps. A task is terminate either when the agent solves the task or when the time limi… view at source ↗

**Figure 13.** Figure 13: View of the Ant Maze environment with 3 goals [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

read the original abstract

Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces the meta-policy with info-based competition among primitives, but the experiments leave open whether that competition actually produces stable specialization.

read the letter

The central move is to drop the high-level meta-policy and let each primitive decide how much state information it needs; the one that requests the most gets to act, while all are regularized to request as little as possible. This produces a decentralized selection rule that is distinct from the hierarchical RL setups cited in the abstract. The idea is straightforward and sidesteps the problem of training a meta-policy that still has to handle every state. That part is new and worth noting. The experiments report better generalization than both flat policies and standard hierarchical ones, which would be useful if the mechanism works as described. The stress-test concern is reasonable on the evidence given. Nothing in the selection rule or the uniform regularization prevents all primitives from converging to the same low-information policy; in that case the argmax just adds noise and the architecture collapses to something closer to a flat policy with extra variance. The abstract gives no information on initialization, extra diversity terms, or post-training statistics on primitive usage, so it is not possible to tell whether the reported gains are driven by the claimed competition. The paper is aimed at people working on options and hierarchical methods who are looking for alternatives to explicit meta-controllers. It shows clear thinking about the problem and engages the literature directly. It deserves a serious referee even if the experiments will need more scrutiny on whether the specialization actually occurs.

Referee Report

2 major / 2 minor

Summary. The paper proposes a decentralized policy architecture for reinforcement learning consisting of an ensemble of primitives. Each primitive is regularized to minimize the mutual information I(s; a) it requires about the state to produce an action; at each step the primitive with the largest I(s; a) is selected to act. No meta-policy is used. The authors claim that the resulting competition induces specialization among the primitives and experimentally demonstrate improved generalization relative to both flat policies and standard hierarchical RL baselines.

Significance. If the claimed specialization mechanism is shown to be stable and the generalization gains are reproducible, the architecture would offer a parameter-light alternative to hierarchical RL that avoids explicit high-level control. The information-theoretic regularization is a clean idea that could be useful in other structured-policy settings.

major comments (2)

[Method (information-theoretic selection and regularization)] The central experimental claim (improved generalization) rests on the primitives developing distinct competencies. The selection rule (argmax over per-primitive I(s;a)) together with uniform regularization does not obviously preclude symmetric convergence to identical low-information policies; in that case the architecture reduces to a flat policy plus selection noise. No analysis of equilibrium stability or primitive-usage statistics is provided to rule this out.
[Experiments] Table or figure reporting generalization results: the reported gains over flat and hierarchical baselines cannot be attributed to the claimed competition without accompanying diagnostics (e.g., histograms of which primitive is selected per state class or mutual information between primitive identities and task-relevant state features).

minor comments (2)

[Method] Notation for the variational information request I(s;a) should be defined explicitly with the variational distribution used; it is unclear whether the same encoder is shared or per-primitive.
[Abstract / Introduction] The abstract states the method 'improves over both flat and hierarchical policies' but does not specify the precise hierarchical baseline (options with learned meta-policy, feudal networks, etc.) or the environments used; these details belong in the introduction or experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The concerns about equilibrium stability and the need for supporting diagnostics are well-taken; we address both points below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central experimental claim (improved generalization) rests on the primitives developing distinct competencies. The selection rule (argmax over per-primitive I(s;a)) together with uniform regularization does not obviously preclude symmetric convergence to identical low-information policies; in that case the architecture reduces to a flat policy plus selection noise. No analysis of equilibrium stability or primitive-usage statistics is provided to rule this out.

Authors: We acknowledge that the manuscript provides no formal stability analysis or usage statistics. The competitive selection by argmax I(s;a) combined with per-primitive information minimization creates an implicit pressure against identical low-information policies, because a primitive that can solve a sub-task with higher I(s;a) will be selected more often and thereby receive gradient updates that further differentiate it; identical policies would yield no such differentiation and would underperform on the diverse tasks used in the experiments. Nevertheless, because this argument is only informal, we will add primitive-usage histograms and a brief discussion of the selection dynamics in the revision. revision: yes
Referee: Table or figure reporting generalization results: the reported gains over flat and hierarchical baselines cannot be attributed to the claimed competition without accompanying diagnostics (e.g., histograms of which primitive is selected per state class or mutual information between primitive identities and task-relevant state features).

Authors: We agree that the generalization tables alone do not directly demonstrate that the performance lift arises from competition-induced specialization. We will therefore augment the experimental section with the requested diagnostics: per-state-class selection histograms and mutual-information values between primitive identity and task-relevant state features. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; experimental claim is self-contained

full rationale

The paper defines a policy architecture using per-primitive variational information regularization and decentralized argmax selection on requested information, then reports experimental generalization gains versus flat and hierarchical baselines. No derivation chain exists that reduces a claimed result to its own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The architecture is stated directly via information-theoretic terms, and the central claim is empirical performance rather than a mathematical identity or uniqueness theorem. The skeptic concern addresses mechanism validity, not circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is abstract-only; no concrete free parameters, axioms, or invented entities can be extracted beyond the high-level concepts mentioned.

pith-pipeline@v0.9.0 · 5735 in / 1003 out tokens · 31055 ms · 2026-05-25T16:20:16.088799+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 16 internal anchors

[1]

Information Dropout: Learning Optimal Representations Through Noisy Computation

Alessandro Achille and Stefano Soatto. Information dropout: learning optimal representations through noise. CoRR, abs/1611.01353, 2016. URL http://arxiv.org/abs/1611.01353

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Deep Variational Information Bottleneck

Alexander A. Alemi, Ian Fischer, Joshua V . Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612. 00410

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Modular multitask reinforcement learning with policy sketches

Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 166–175. JMLR. org, 2017

work page 2017
[4]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017

work page 2017
[5]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

work page 2016
[6]

Minimalistic gridworld environ- ment for openai gym

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environ- ment for openai gym. https://github.com/maximecb/gym-minigrid, 2018

work page 2018
[7]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

Hierarchical relative entropy policy search

Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artiﬁcial Intelligence and Statistics, pages 273–281, 2012

work page 2012
[9]

Feudal reinforcement learning

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993

work page 1993
[10]

Hierarchical reinforcement learning with the maxq value function decomposition

Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artiﬁcial Intelligence Research, 13:227–303, 2000

work page 2000
[11]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Stochastic Neural Networks for Hierarchical Reinforcement Learning

Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Frans, J

K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta Learning Shared Hierarchies. arXiv e-prints, October 2017

work page 2017
[14]

Meta Learning Shared Hierarchies

Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Latent Space Policies for Hierarchical Reinforcement Learning

Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Emergence of Locomotion Behaviours in Rich Environments

Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Inferring and executing programs for visual rea- soning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual rea- soning. In Proceedings of the IEEE International Conference on Computer Vision , pages 2989–2998, 2017

work page 2017
[18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation

Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016. 10

work page 2016
[20]

Learning to schedule control fragments for physics-based characters using deep q-learning

Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics, 36(3), 2017

work page 2017
[21]

A Laplacian Framework for Option Discovery in Reinforcement Learning

Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Hierarchical visuomotor control of humanoids

Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess, and Greg Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=BJfYvo09Y7

work page 2019
[23]

Neural probabilistic motor primitives for humanoid control

Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=BJl6TjRcY7

work page 2019
[24]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015
[25]

Asynchronous methods for deep reinforcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

work page 1928
[26]

Learning Independent Causal Mechanisms

Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. arXiv preprint arXiv:1712.00961, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017

work page 2017
[28]

Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 (4):41:1–41:13, July 2017. ISSN 0730-0301. doi: 10.1145/3072959.3073602. URL http: //doi.acm.org/10.1145/3072959.3073602

work page doi:10.1145/3072959.3073602 2017
[29]

Deepmimic: Example- guided deep reinforcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4):143:1–143:14, July 2018. ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL http://doi.acm.org/10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311 2018
[30]

Routing Networks and the Challenges of Modular and Compositional Computation

Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[31]

FeUdal Networks for Hierarchical Reinforcement Learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Reinforcement learning: An introduction

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998

work page 1998
[35]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057– 1063, Cambridge, MA, USA, 1999. MIT Press. URL http://dl.acm.org/citation.cfm? id=3009657...

work page arXiv 1999
[37]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2): 181–211, 1999

work page 1999
[38]

Lecture 6.5-rmsprop, coursera: Neural networks for machine learning

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012

work page 2012
[39]

Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. CoRR, physics/0004057, 2000. URL http://arxiv.org/abs/physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 2000
[40]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

work page 2012
[41]

Visualizing data using t-SNE

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/ vandermaaten08a.html

work page 2008
[42]

Williams

Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4):229–256, 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[43]

Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation

Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017. A Interpretation of the regularization term The regularization term is given by Lreg = ∑ k αkLk, where αk =eLk/...

work page 2017
[44]

Can our proposed approach learn primitives which remain active throughout training?

work page
[45]

It is important that both the primitives remain active

Can our proposed approach learn primitives which can solve the task? We train two primitives on the 2D Bandits tasks and evaluate the relative frequency of activation of the primitives throughout the training. It is important that both the primitives remain active. If only 1 primitive is acting most of the time, its effect would be the same as training a ...

work page
[46]

Can our proposed approach learn primitives that remain active when training the agent over a sequence of tasks?

work page
[47]

In the baseline setup, we train a ﬂat A2C policy on Fourrooms-v0till it achieves a 100 % success rate during evaluation

Can our proposed approach be used to improve the sample efﬁciency of the agent over a sequence of tasks? To answer these questions, we consider two setups. In the baseline setup, we train a ﬂat A2C policy on Fourrooms-v0till it achieves a 100 % success rate during evaluation. Then we transfer this policy to Fourrooms-v1 and continue to train till it achie...

work page
[48]

All the models (proposed as well as the baselines) are implemented in Pytorch 1.1 unless stated otherwise. [27]

work page
[49]

For Meta-Learning Shared Hierarchies [14] and Option-Critic [4], we adapted the author’s implementations 5for our environments

work page
[50]

success rate

During the evaluation, we use 10 processes in parallel to run 500 episodes and compute the percentage of times the agent solves the task within the prescribed time limit. This metric is referred to as the “success rate”

work page
[51]

The default time limit is 500 steps for all the tasks unless speciﬁed otherwise

work page
[52]

All the feedforward networks are initialized with the orthogonal initialization where the input tensor is ﬁlled with a (semi) orthogonal matrix

work page
[53]

For all the embedding layers, the weights are initialized using the unit-Gaussian distribution

work page
[54]

The weights and biases for all the GRU model are initialized using the uniform distribution fromU (− √ k, √ k) wherek = 1 hidden_size

work page
[55]

During training, we perform 64 rollouts in parallel to collect 5-step trajectories

work page
[56]

Unlock the door and pick up the red ball

The βind and βreg parameters are both selected from the set {0.001, 0.005, 0.009} by performing validation. In section D.4.2, we explain all the components of the model architecture along with the implemen- tation details in the context of the MiniGrid Environment. For the subsequent environments, we describe only those components and implementation detai...

work page
[57]

go fetch a yellow box

Fetch: In the Fetch task, the agent spawns at an arbitrary position in a 8× 8 grid (ﬁgure 9 ). It is provided with a natural language goal description of the form “go fetch a yellow box”. The agent has to navigate to the object being referred to in the goal description and pick it up

work page
[58]

open the door

Unlock: In the Unlock task, the agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (ﬁgure 10 ). It is provided with a natural language 6https://github.com/maximecb/gym-minigrid Figure 11: RGB view of the UnlockPickup environment. 16 goal description of the form “open the door”. The agent has to ﬁnd the key that ...

work page
[59]

open the door and pick up the yellow box

UnlockPickup: This task is basically a union of the Unlock and the Fetch tasks. The agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (ﬁgure 11 ). It is provided with a natural language goal description of the form “open the door and pick up the yellow box”. The agent has to ﬁnd the key that corresponds to the ...

work page
[60]

Two-layer feedforward network with the tanh non-linearity

work page
[61]

Input: Concatenation of z and the current hidden state of the observation-rnn

work page
[62]

Size of the input to the ﬁrst layer and the second layer of thepolicy network are 320 and 64 respectively

work page
[63]

D.4 Components speciﬁc to the proposed model The components that we described so far are used by both the baselines as well as our proposed model

Produces a scalar output. D.4 Components speciﬁc to the proposed model The components that we described so far are used by both the baselines as well as our proposed model. We now describe the components that are speciﬁc to our proposed model. Our proposed model consists of an ensemble of primitives and the components we describe apply to each of those pr...

work page

[1] [1]

Information Dropout: Learning Optimal Representations Through Noisy Computation

Alessandro Achille and Stefano Soatto. Information dropout: learning optimal representations through noise. CoRR, abs/1611.01353, 2016. URL http://arxiv.org/abs/1611.01353

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Deep Variational Information Bottleneck

Alexander A. Alemi, Ian Fischer, Joshua V . Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612. 00410

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Modular multitask reinforcement learning with policy sketches

Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 166–175. JMLR. org, 2017

work page 2017

[4] [4]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017

work page 2017

[5] [5]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

work page 2016

[6] [6]

Minimalistic gridworld environ- ment for openai gym

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environ- ment for openai gym. https://github.com/maximecb/gym-minigrid, 2018

work page 2018

[7] [7]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

Hierarchical relative entropy policy search

Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artiﬁcial Intelligence and Statistics, pages 273–281, 2012

work page 2012

[9] [9]

Feudal reinforcement learning

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993

work page 1993

[10] [10]

Hierarchical reinforcement learning with the maxq value function decomposition

Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artiﬁcial Intelligence Research, 13:227–303, 2000

work page 2000

[11] [11]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Stochastic Neural Networks for Hierarchical Reinforcement Learning

Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Frans, J

K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta Learning Shared Hierarchies. arXiv e-prints, October 2017

work page 2017

[14] [14]

Meta Learning Shared Hierarchies

Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Latent Space Policies for Hierarchical Reinforcement Learning

Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Emergence of Locomotion Behaviours in Rich Environments

Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Inferring and executing programs for visual rea- soning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual rea- soning. In Proceedings of the IEEE International Conference on Computer Vision , pages 2989–2998, 2017

work page 2017

[18] [18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation

Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016. 10

work page 2016

[20] [20]

Learning to schedule control fragments for physics-based characters using deep q-learning

Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics, 36(3), 2017

work page 2017

[21] [21]

A Laplacian Framework for Option Discovery in Reinforcement Learning

Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Hierarchical visuomotor control of humanoids

Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess, and Greg Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=BJfYvo09Y7

work page 2019

[23] [23]

Neural probabilistic motor primitives for humanoid control

Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=BJl6TjRcY7

work page 2019

[24] [24]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015

[25] [25]

Asynchronous methods for deep reinforcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

work page 1928

[26] [26]

Learning Independent Causal Mechanisms

Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. arXiv preprint arXiv:1712.00961, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017

work page 2017

[28] [28]

Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 (4):41:1–41:13, July 2017. ISSN 0730-0301. doi: 10.1145/3072959.3073602. URL http: //doi.acm.org/10.1145/3072959.3073602

work page doi:10.1145/3072959.3073602 2017

[29] [29]

Deepmimic: Example- guided deep reinforcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4):143:1–143:14, July 2018. ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL http://doi.acm.org/10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311 2018

[30] [30]

Routing Networks and the Challenges of Modular and Compositional Computation

Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[31] [31]

FeUdal Networks for Hierarchical Reinforcement Learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Reinforcement learning: An introduction

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998

work page 1998

[35] [35]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057– 1063, Cambridge, MA, USA, 1999. MIT Press. URL http://dl.acm.org/citation.cfm? id=3009657...

work page arXiv 1999

[36] [37]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artiﬁcial intelligence, 112(1-2): 181–211, 1999

work page 1999

[37] [38]

Lecture 6.5-rmsprop, coursera: Neural networks for machine learning

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012

work page 2012

[38] [39]

Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. CoRR, physics/0004057, 2000. URL http://arxiv.org/abs/physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 2000

[39] [40]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

work page 2012

[40] [41]

Visualizing data using t-SNE

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/ vandermaaten08a.html

work page 2008

[41] [42]

Williams

Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4):229–256, 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[42] [43]

Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation

Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017. A Interpretation of the regularization term The regularization term is given by Lreg = ∑ k αkLk, where αk =eLk/...

work page 2017

[43] [44]

Can our proposed approach learn primitives which remain active throughout training?

work page

[44] [45]

It is important that both the primitives remain active

Can our proposed approach learn primitives which can solve the task? We train two primitives on the 2D Bandits tasks and evaluate the relative frequency of activation of the primitives throughout the training. It is important that both the primitives remain active. If only 1 primitive is acting most of the time, its effect would be the same as training a ...

work page

[45] [46]

Can our proposed approach learn primitives that remain active when training the agent over a sequence of tasks?

work page

[46] [47]

In the baseline setup, we train a ﬂat A2C policy on Fourrooms-v0till it achieves a 100 % success rate during evaluation

Can our proposed approach be used to improve the sample efﬁciency of the agent over a sequence of tasks? To answer these questions, we consider two setups. In the baseline setup, we train a ﬂat A2C policy on Fourrooms-v0till it achieves a 100 % success rate during evaluation. Then we transfer this policy to Fourrooms-v1 and continue to train till it achie...

work page

[47] [48]

All the models (proposed as well as the baselines) are implemented in Pytorch 1.1 unless stated otherwise. [27]

work page

[48] [49]

For Meta-Learning Shared Hierarchies [14] and Option-Critic [4], we adapted the author’s implementations 5for our environments

work page

[49] [50]

success rate

During the evaluation, we use 10 processes in parallel to run 500 episodes and compute the percentage of times the agent solves the task within the prescribed time limit. This metric is referred to as the “success rate”

work page

[50] [51]

The default time limit is 500 steps for all the tasks unless speciﬁed otherwise

work page

[51] [52]

All the feedforward networks are initialized with the orthogonal initialization where the input tensor is ﬁlled with a (semi) orthogonal matrix

work page

[52] [53]

For all the embedding layers, the weights are initialized using the unit-Gaussian distribution

work page

[53] [54]

The weights and biases for all the GRU model are initialized using the uniform distribution fromU (− √ k, √ k) wherek = 1 hidden_size

work page

[54] [55]

During training, we perform 64 rollouts in parallel to collect 5-step trajectories

work page

[55] [56]

Unlock the door and pick up the red ball

The βind and βreg parameters are both selected from the set {0.001, 0.005, 0.009} by performing validation. In section D.4.2, we explain all the components of the model architecture along with the implemen- tation details in the context of the MiniGrid Environment. For the subsequent environments, we describe only those components and implementation detai...

work page

[56] [57]

go fetch a yellow box

Fetch: In the Fetch task, the agent spawns at an arbitrary position in a 8× 8 grid (ﬁgure 9 ). It is provided with a natural language goal description of the form “go fetch a yellow box”. The agent has to navigate to the object being referred to in the goal description and pick it up

work page

[57] [58]

open the door

Unlock: In the Unlock task, the agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (ﬁgure 10 ). It is provided with a natural language 6https://github.com/maximecb/gym-minigrid Figure 11: RGB view of the UnlockPickup environment. 16 goal description of the form “open the door”. The agent has to ﬁnd the key that ...

work page

[58] [59]

open the door and pick up the yellow box

UnlockPickup: This task is basically a union of the Unlock and the Fetch tasks. The agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (ﬁgure 11 ). It is provided with a natural language goal description of the form “open the door and pick up the yellow box”. The agent has to ﬁnd the key that corresponds to the ...

work page

[59] [60]

Two-layer feedforward network with the tanh non-linearity

work page

[60] [61]

Input: Concatenation of z and the current hidden state of the observation-rnn

work page

[61] [62]

Size of the input to the ﬁrst layer and the second layer of thepolicy network are 320 and 64 respectively

work page

[62] [63]

D.4 Components speciﬁc to the proposed model The components that we described so far are used by both the baselines as well as our proposed model

Produces a scalar output. D.4 Components speciﬁc to the proposed model The components that we described so far are used by both the baselines as well as our proposed model. We now describe the components that are speciﬁc to our proposed model. Our proposed model consists of an ensemble of primitives and the components we describe apply to each of those pr...

work page