pith. sign in

arxiv: 1906.10667 · v1 · pith:7VDVBWG5new · submitted 2019-06-25 · 💻 cs.LG · cs.AI· stat.ML

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Pith reviewed 2026-05-25 16:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords reinforcement learninghierarchical policiespolicy decompositioninformation theorygeneralizationprimitivesensemblesoptions
0
0 comments X

The pith

A reinforcement learning policy decomposes into primitives that compete by requesting different amounts of state information, with no meta-policy required.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a policy architecture made of multiple primitives where each one independently decides how much information about the current state it needs. The primitive that requests the most information is allowed to act, and all primitives are trained to request as little information as possible. This setup creates decentralized competition that leads to specialization among the primitives. Experiments show the resulting policies generalize better than both standard flat policies and hierarchical policies that rely on an explicit meta-controller. A sympathetic reader would care because the approach removes the need for a separate high-level decision maker while still achieving structured behavior in complex environments.

Core claim

The central claim is that an ensemble of information-constrained primitives can decompose behavior without a meta-policy: each primitive chooses its own information usage about the state, the one requesting the largest amount acts, and regularization to minimize information use induces natural competition and specialization, yielding improved generalization over flat and hierarchical baselines.

What carries the argument

The decentralized selection rule in which the primitive requesting the most state information is chosen to act, paired with per-primitive regularization that penalizes large information requests.

If this is right

  • Behavior can be structured through competition among primitives instead of explicit high-level selection.
  • The absence of a meta-policy removes one source of decision-making overhead in hierarchical reinforcement learning.
  • Specialization emerges from the information-minimization pressure rather than from hand-designed options.
  • The architecture applies to diverse environments where a single flat policy struggles to generalize.
  • Decentralized activation can still produce coherent overall behavior when the selection rule favors informative primitives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same competition mechanism might be tested in settings with continuous action spaces to check whether information requests remain a reliable selection signal.
  • One could replace the information request with other scalar signals, such as uncertainty estimates, to see if the competition dynamic persists.
  • The approach suggests a route to scaling structured policies by increasing the number of primitives while keeping the selection rule fixed.
  • Connection to multi-agent reinforcement learning could be explored by treating each primitive as an independent agent that bids for control via its information request.

Load-bearing premise

Regularizing each primitive to request as little state information as possible while always letting the primitive that requests the most act will automatically produce effective competition and specialization.

What would settle it

A controlled comparison on held-out tasks in which the proposed architecture shows no generalization advantage over a standard hierarchical policy with an explicit meta-controller would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.10667 by Anirudh Goyal, Jonathan Binas, Sergey Levine, Shagun Sodhani, Xue Bin Peng, Yoshua Bengio.

Figure 1
Figure 1. Figure 1: Illustration of our model. An intrinsic competition mechanism, based on the amount of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The primitive-selection mechanism of our model. The primi￾tive with most information acts in the environment, and gets the reward. To encourage each primitive to encode information from a particular part of state space, we limit the amount of informa￾tion each primitive can access from the state. In particular, each primitive has an information bottleneck with respect to the input state, preventing it from… view at source ↗
Figure 3
Figure 3. Figure 3: Multitask training. Each panel corresponds to a different training setup, where different tasks are denoted A, B, C, ..., and a rectangle with n circles corresponds to an agent composed of n primitives trained on the respective tasks. Top row: activation of primitives for agents trained on single tasks. Bottom row: Retrain: two primitives are trained on A and transferred to B. The results (success rates) i… view at source ↗
Figure 4
Figure 4. Figure 4: Continual Learning Scenario: We consider a continual learning scenario where we train 2 primitives for 2 goal positions, then transfer (and finetune) on 4 goal positions then transfer (and finetune) on 8 goals positions. The plot on the left shows the primitives remain activated. The solid green line shows the boundary between the tasks, The plot on the right shows the number of samples taken by our model … view at source ↗
Figure 5
Figure 5. Figure 5: Left: Multitask setup, where we show that we are able to train 8 primitives when training on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Snapshots of motions learned by the policy. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Embeddings visualizing the states (S) and goals (G) which each primitive is active in, and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance on the 2D bandits task. Left: The comparison of our model (blue curve - [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RGB view of the Fetch environment [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: RGB view of the Unlock environment. D.2 Tasks in MiniGrid Environment We consider the following tasks in the MiniGrid environment: 1. Fetch: In the Fetch task, the agent spawns at an arbitrary position in a 8 × 8 grid (figure 9 ). It is provided with a natural language goal description of the form “go fetch a yellow box”. The agent has to navigate to the object being referred to in the goal description an… view at source ↗
Figure 11
Figure 11. Figure 11: RGB view of the UnlockPickup environment. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: View of the four-room environment F.1.1 Reward Function We consider the sparse reward setup where the agent gets a reward (of 1) only if it completes the task (and reaches the goal position) and 0 at all other time steps. We also apply a time limit of 300 steps on all the tasks ie the agent must complete the task in 300 steps. A task is terminate either when the agent solves the task or when the time limi… view at source ↗
Figure 13
Figure 13. Figure 13: View of the Ant Maze environment with 3 goals [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a decentralized policy architecture for reinforcement learning consisting of an ensemble of primitives. Each primitive is regularized to minimize the mutual information I(s; a) it requires about the state to produce an action; at each step the primitive with the largest I(s; a) is selected to act. No meta-policy is used. The authors claim that the resulting competition induces specialization among the primitives and experimentally demonstrate improved generalization relative to both flat policies and standard hierarchical RL baselines.

Significance. If the claimed specialization mechanism is shown to be stable and the generalization gains are reproducible, the architecture would offer a parameter-light alternative to hierarchical RL that avoids explicit high-level control. The information-theoretic regularization is a clean idea that could be useful in other structured-policy settings.

major comments (2)
  1. [Method (information-theoretic selection and regularization)] The central experimental claim (improved generalization) rests on the primitives developing distinct competencies. The selection rule (argmax over per-primitive I(s;a)) together with uniform regularization does not obviously preclude symmetric convergence to identical low-information policies; in that case the architecture reduces to a flat policy plus selection noise. No analysis of equilibrium stability or primitive-usage statistics is provided to rule this out.
  2. [Experiments] Table or figure reporting generalization results: the reported gains over flat and hierarchical baselines cannot be attributed to the claimed competition without accompanying diagnostics (e.g., histograms of which primitive is selected per state class or mutual information between primitive identities and task-relevant state features).
minor comments (2)
  1. [Method] Notation for the variational information request I(s;a) should be defined explicitly with the variational distribution used; it is unclear whether the same encoder is shared or per-primitive.
  2. [Abstract / Introduction] The abstract states the method 'improves over both flat and hierarchical policies' but does not specify the precise hierarchical baseline (options with learned meta-policy, feudal networks, etc.) or the environments used; these details belong in the introduction or experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The concerns about equilibrium stability and the need for supporting diagnostics are well-taken; we address both points below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The central experimental claim (improved generalization) rests on the primitives developing distinct competencies. The selection rule (argmax over per-primitive I(s;a)) together with uniform regularization does not obviously preclude symmetric convergence to identical low-information policies; in that case the architecture reduces to a flat policy plus selection noise. No analysis of equilibrium stability or primitive-usage statistics is provided to rule this out.

    Authors: We acknowledge that the manuscript provides no formal stability analysis or usage statistics. The competitive selection by argmax I(s;a) combined with per-primitive information minimization creates an implicit pressure against identical low-information policies, because a primitive that can solve a sub-task with higher I(s;a) will be selected more often and thereby receive gradient updates that further differentiate it; identical policies would yield no such differentiation and would underperform on the diverse tasks used in the experiments. Nevertheless, because this argument is only informal, we will add primitive-usage histograms and a brief discussion of the selection dynamics in the revision. revision: yes

  2. Referee: Table or figure reporting generalization results: the reported gains over flat and hierarchical baselines cannot be attributed to the claimed competition without accompanying diagnostics (e.g., histograms of which primitive is selected per state class or mutual information between primitive identities and task-relevant state features).

    Authors: We agree that the generalization tables alone do not directly demonstrate that the performance lift arises from competition-induced specialization. We will therefore augment the experimental section with the requested diagnostics: per-state-class selection histograms and mutual-information values between primitive identity and task-relevant state features. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; experimental claim is self-contained

full rationale

The paper defines a policy architecture using per-primitive variational information regularization and decentralized argmax selection on requested information, then reports experimental generalization gains versus flat and hierarchical baselines. No derivation chain exists that reduces a claimed result to its own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The architecture is stated directly via information-theoretic terms, and the central claim is empirical performance rather than a mathematical identity or uniqueness theorem. The skeptic concern addresses mechanism validity, not circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is abstract-only; no concrete free parameters, axioms, or invented entities can be extracted beyond the high-level concepts mentioned.

pith-pipeline@v0.9.0 · 5735 in / 1003 out tokens · 31055 ms · 2026-05-25T16:20:16.088799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 16 internal anchors

  1. [1]

    Information Dropout: Learning Optimal Representations Through Noisy Computation

    Alessandro Achille and Stefano Soatto. Information dropout: learning optimal representations through noise. CoRR, abs/1611.01353, 2016. URL http://arxiv.org/abs/1611.01353

  2. [2]

    Deep Variational Information Bottleneck

    Alexander A. Alemi, Ian Fischer, Joshua V . Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612. 00410

  3. [3]

    Modular multitask reinforcement learning with policy sketches

    Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 166–175. JMLR. org, 2017

  4. [4]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017

  5. [5]

    Openai gym, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

  6. [6]

    Minimalistic gridworld environ- ment for openai gym

    Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environ- ment for openai gym. https://github.com/maximecb/gym-minigrid, 2018

  7. [7]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

  8. [8]

    Hierarchical relative entropy policy search

    Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pages 273–281, 2012

  9. [9]

    Feudal reinforcement learning

    Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993

  10. [10]

    Hierarchical reinforcement learning with the maxq value function decomposition

    Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000

  11. [11]

    Diversity is All You Need: Learning Skills without a Reward Function

    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

  12. [12]

    Stochastic Neural Networks for Hierarchical Reinforcement Learning

    Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017

  13. [13]

    Frans, J

    K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta Learning Shared Hierarchies. arXiv e-prints, October 2017

  14. [14]

    Meta Learning Shared Hierarchies

    Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

  15. [15]

    Latent Space Policies for Hierarchical Reinforcement Learning

    Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018

  16. [16]

    Emergence of Locomotion Behaviours in Rich Environments

    Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017

  17. [17]

    Inferring and executing programs for visual rea- soning

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual rea- soning. In Proceedings of the IEEE International Conference on Computer Vision , pages 2989–2998, 2017

  18. [18]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  19. [19]

    Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation

    Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016. 10

  20. [20]

    Learning to schedule control fragments for physics-based characters using deep q-learning

    Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics, 36(3), 2017

  21. [21]

    A Laplacian Framework for Option Discovery in Reinforcement Learning

    Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956, 2017

  22. [22]

    Hierarchical visuomotor control of humanoids

    Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess, and Greg Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=BJfYvo09Y7

  23. [23]

    Neural probabilistic motor primitives for humanoid control

    Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=BJl6TjRcY7

  24. [24]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

  25. [25]

    Asynchronous methods for deep reinforcement learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016

  26. [26]

    Learning Independent Causal Mechanisms

    Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. arXiv preprint arXiv:1712.00961, 2017

  27. [27]

    Automatic differentiation in PyTorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017

  28. [28]

    Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

    Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 (4):41:1–41:13, July 2017. ISSN 0730-0301. doi: 10.1145/3072959.3073602. URL http: //doi.acm.org/10.1145/3072959.3073602

  29. [29]

    Deepmimic: Example- guided deep reinforcement learning of physics-based character skills

    Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4):143:1–143:14, July 2018. ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL http://doi.acm.org/10.1145/3197517.3201311

  30. [30]

    Routing Networks and the Challenges of Modular and Compositional Computation

    Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774, 2019

  31. [31]

    FeUdal Networks for Hierarchical Reinforcement Learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017

  32. [32]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

  33. [33]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  34. [34]

    Reinforcement learning: An introduction

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998

  35. [35]

    Sutton, David McAllester, Satinder Singh, and Yishay Mansour

    Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057– 1063, Cambridge, MA, USA, 1999. MIT Press. URL http://dl.acm.org/citation.cfm? id=3009657...

  36. [37]

    Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning

    Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999

  37. [38]

    Lecture 6.5-rmsprop, coursera: Neural networks for machine learning

    Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012

  38. [39]

    Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. CoRR, physics/0004057, 2000. URL http://arxiv.org/abs/physics/0004057

  39. [40]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

  40. [41]

    Visualizing data using t-SNE

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/ vandermaaten08a.html

  41. [42]

    Williams

    Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4):229–256, 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

  42. [43]

    Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation

    Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017. A Interpretation of the regularization term The regularization term is given by Lreg = ∑ k αkLk, where αk =eLk/...

  43. [44]

    Can our proposed approach learn primitives which remain active throughout training?

  44. [45]

    It is important that both the primitives remain active

    Can our proposed approach learn primitives which can solve the task? We train two primitives on the 2D Bandits tasks and evaluate the relative frequency of activation of the primitives throughout the training. It is important that both the primitives remain active. If only 1 primitive is acting most of the time, its effect would be the same as training a ...

  45. [46]

    Can our proposed approach learn primitives that remain active when training the agent over a sequence of tasks?

  46. [47]

    In the baseline setup, we train a flat A2C policy on Fourrooms-v0till it achieves a 100 % success rate during evaluation

    Can our proposed approach be used to improve the sample efficiency of the agent over a sequence of tasks? To answer these questions, we consider two setups. In the baseline setup, we train a flat A2C policy on Fourrooms-v0till it achieves a 100 % success rate during evaluation. Then we transfer this policy to Fourrooms-v1 and continue to train till it achie...

  47. [48]

    All the models (proposed as well as the baselines) are implemented in Pytorch 1.1 unless stated otherwise. [27]

  48. [49]

    For Meta-Learning Shared Hierarchies [14] and Option-Critic [4], we adapted the author’s implementations 5for our environments

  49. [50]

    success rate

    During the evaluation, we use 10 processes in parallel to run 500 episodes and compute the percentage of times the agent solves the task within the prescribed time limit. This metric is referred to as the “success rate”

  50. [51]

    The default time limit is 500 steps for all the tasks unless specified otherwise

  51. [52]

    All the feedforward networks are initialized with the orthogonal initialization where the input tensor is filled with a (semi) orthogonal matrix

  52. [53]

    For all the embedding layers, the weights are initialized using the unit-Gaussian distribution

  53. [54]

    The weights and biases for all the GRU model are initialized using the uniform distribution fromU (− √ k, √ k) wherek = 1 hidden_size

  54. [55]

    During training, we perform 64 rollouts in parallel to collect 5-step trajectories

  55. [56]

    Unlock the door and pick up the red ball

    The βind and βreg parameters are both selected from the set {0.001, 0.005, 0.009} by performing validation. In section D.4.2, we explain all the components of the model architecture along with the implemen- tation details in the context of the MiniGrid Environment. For the subsequent environments, we describe only those components and implementation detai...

  56. [57]

    go fetch a yellow box

    Fetch: In the Fetch task, the agent spawns at an arbitrary position in a 8× 8 grid (figure 9 ). It is provided with a natural language goal description of the form “go fetch a yellow box”. The agent has to navigate to the object being referred to in the goal description and pick it up

  57. [58]

    open the door

    Unlock: In the Unlock task, the agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (figure 10 ). It is provided with a natural language 6https://github.com/maximecb/gym-minigrid Figure 11: RGB view of the UnlockPickup environment. 16 goal description of the form “open the door”. The agent has to find the key that ...

  58. [59]

    open the door and pick up the yellow box

    UnlockPickup: This task is basically a union of the Unlock and the Fetch tasks. The agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (figure 11 ). It is provided with a natural language goal description of the form “open the door and pick up the yellow box”. The agent has to find the key that corresponds to the ...

  59. [60]

    Two-layer feedforward network with the tanh non-linearity

  60. [61]

    Input: Concatenation of z and the current hidden state of the observation-rnn

  61. [62]

    Size of the input to the first layer and the second layer of thepolicy network are 320 and 64 respectively

  62. [63]

    D.4 Components specific to the proposed model The components that we described so far are used by both the baselines as well as our proposed model

    Produces a scalar output. D.4 Components specific to the proposed model The components that we described so far are used by both the baselines as well as our proposed model. We now describe the components that are specific to our proposed model. Our proposed model consists of an ensemble of primitives and the components we describe apply to each of those pr...