Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
Pith reviewed 2026-05-25 16:20 UTC · model grok-4.3
The pith
A reinforcement learning policy decomposes into primitives that compete by requesting different amounts of state information, with no meta-policy required.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an ensemble of information-constrained primitives can decompose behavior without a meta-policy: each primitive chooses its own information usage about the state, the one requesting the largest amount acts, and regularization to minimize information use induces natural competition and specialization, yielding improved generalization over flat and hierarchical baselines.
What carries the argument
The decentralized selection rule in which the primitive requesting the most state information is chosen to act, paired with per-primitive regularization that penalizes large information requests.
If this is right
- Behavior can be structured through competition among primitives instead of explicit high-level selection.
- The absence of a meta-policy removes one source of decision-making overhead in hierarchical reinforcement learning.
- Specialization emerges from the information-minimization pressure rather than from hand-designed options.
- The architecture applies to diverse environments where a single flat policy struggles to generalize.
- Decentralized activation can still produce coherent overall behavior when the selection rule favors informative primitives.
Where Pith is reading between the lines
- The same competition mechanism might be tested in settings with continuous action spaces to check whether information requests remain a reliable selection signal.
- One could replace the information request with other scalar signals, such as uncertainty estimates, to see if the competition dynamic persists.
- The approach suggests a route to scaling structured policies by increasing the number of primitives while keeping the selection rule fixed.
- Connection to multi-agent reinforcement learning could be explored by treating each primitive as an independent agent that bids for control via its information request.
Load-bearing premise
Regularizing each primitive to request as little state information as possible while always letting the primitive that requests the most act will automatically produce effective competition and specialization.
What would settle it
A controlled comparison on held-out tasks in which the proposed architecture shows no generalization advantage over a standard hierarchical policy with an explicit meta-controller would falsify the central claim.
Figures
read the original abstract
Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a decentralized policy architecture for reinforcement learning consisting of an ensemble of primitives. Each primitive is regularized to minimize the mutual information I(s; a) it requires about the state to produce an action; at each step the primitive with the largest I(s; a) is selected to act. No meta-policy is used. The authors claim that the resulting competition induces specialization among the primitives and experimentally demonstrate improved generalization relative to both flat policies and standard hierarchical RL baselines.
Significance. If the claimed specialization mechanism is shown to be stable and the generalization gains are reproducible, the architecture would offer a parameter-light alternative to hierarchical RL that avoids explicit high-level control. The information-theoretic regularization is a clean idea that could be useful in other structured-policy settings.
major comments (2)
- [Method (information-theoretic selection and regularization)] The central experimental claim (improved generalization) rests on the primitives developing distinct competencies. The selection rule (argmax over per-primitive I(s;a)) together with uniform regularization does not obviously preclude symmetric convergence to identical low-information policies; in that case the architecture reduces to a flat policy plus selection noise. No analysis of equilibrium stability or primitive-usage statistics is provided to rule this out.
- [Experiments] Table or figure reporting generalization results: the reported gains over flat and hierarchical baselines cannot be attributed to the claimed competition without accompanying diagnostics (e.g., histograms of which primitive is selected per state class or mutual information between primitive identities and task-relevant state features).
minor comments (2)
- [Method] Notation for the variational information request I(s;a) should be defined explicitly with the variational distribution used; it is unclear whether the same encoder is shared or per-primitive.
- [Abstract / Introduction] The abstract states the method 'improves over both flat and hierarchical policies' but does not specify the precise hierarchical baseline (options with learned meta-policy, feudal networks, etc.) or the environments used; these details belong in the introduction or experimental section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The concerns about equilibrium stability and the need for supporting diagnostics are well-taken; we address both points below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The central experimental claim (improved generalization) rests on the primitives developing distinct competencies. The selection rule (argmax over per-primitive I(s;a)) together with uniform regularization does not obviously preclude symmetric convergence to identical low-information policies; in that case the architecture reduces to a flat policy plus selection noise. No analysis of equilibrium stability or primitive-usage statistics is provided to rule this out.
Authors: We acknowledge that the manuscript provides no formal stability analysis or usage statistics. The competitive selection by argmax I(s;a) combined with per-primitive information minimization creates an implicit pressure against identical low-information policies, because a primitive that can solve a sub-task with higher I(s;a) will be selected more often and thereby receive gradient updates that further differentiate it; identical policies would yield no such differentiation and would underperform on the diverse tasks used in the experiments. Nevertheless, because this argument is only informal, we will add primitive-usage histograms and a brief discussion of the selection dynamics in the revision. revision: yes
-
Referee: Table or figure reporting generalization results: the reported gains over flat and hierarchical baselines cannot be attributed to the claimed competition without accompanying diagnostics (e.g., histograms of which primitive is selected per state class or mutual information between primitive identities and task-relevant state features).
Authors: We agree that the generalization tables alone do not directly demonstrate that the performance lift arises from competition-induced specialization. We will therefore augment the experimental section with the requested diagnostics: per-state-class selection histograms and mutual-information values between primitive identity and task-relevant state features. These additions will be included in the revised manuscript. revision: yes
Circularity Check
No circularity; experimental claim is self-contained
full rationale
The paper defines a policy architecture using per-primitive variational information regularization and decentralized argmax selection on requested information, then reports experimental generalization gains versus flat and hierarchical baselines. No derivation chain exists that reduces a claimed result to its own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The architecture is stated directly via information-theoretic terms, and the central claim is empirical performance rather than a mathematical identity or uniqueness theorem. The skeptic concern addresses mechanism validity, not circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Information Dropout: Learning Optimal Representations Through Noisy Computation
Alessandro Achille and Stefano Soatto. Information dropout: learning optimal representations through noise. CoRR, abs/1611.01353, 2016. URL http://arxiv.org/abs/1611.01353
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Deep Variational Information Bottleneck
Alexander A. Alemi, Ian Fischer, Joshua V . Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612. 00410
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Modular multitask reinforcement learning with policy sketches
Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning- Volume 70, pages 166–175. JMLR. org, 2017
work page 2017
-
[4]
The option-critic architecture
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017
work page 2017
-
[5]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016
work page 2016
-
[6]
Minimalistic gridworld environ- ment for openai gym
Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environ- ment for openai gym. https://github.com/maximecb/gym-minigrid, 2018
work page 2018
-
[7]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[8]
Hierarchical relative entropy policy search
Christian Daniel, Gerhard Neumann, and Jan Peters. Hierarchical relative entropy policy search. In Artificial Intelligence and Statistics, pages 273–281, 2012
work page 2012
-
[9]
Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information processing systems, pages 271–278, 1993
work page 1993
-
[10]
Hierarchical reinforcement learning with the maxq value function decomposition
Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000
work page 2000
-
[11]
Diversity is All You Need: Learning Skills without a Reward Function
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Stochastic Neural Networks for Hierarchical Reinforcement Learning
Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [13]
-
[14]
Meta Learning Shared Hierarchies
Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Latent Space Policies for Hierarchical Reinforcement Learning
Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Emergence of Locomotion Behaviours in Rich Environments
Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Inferring and executing programs for visual rea- soning
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual rea- soning. In Proceedings of the IEEE International Conference on Computer Vision , pages 2989–2998, 2017
work page 2017
-
[18]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[19]
Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation
Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016. 10
work page 2016
-
[20]
Learning to schedule control fragments for physics-based characters using deep q-learning
Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics, 36(3), 2017
work page 2017
-
[21]
A Laplacian Framework for Option Discovery in Reinforcement Learning
Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Hierarchical visuomotor control of humanoids
Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess, and Greg Wayne. Hierarchical visuomotor control of humanoids. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=BJfYvo09Y7
work page 2019
-
[23]
Neural probabilistic motor primitives for humanoid control
Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=BJl6TjRcY7
work page 2019
-
[24]
Human-level control through deep reinforcement learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015
work page 2015
-
[25]
Asynchronous methods for deep reinforcement learning
V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016
work page 1928
-
[26]
Learning Independent Causal Mechanisms
Giambattista Parascandolo, Niki Kilbertus, Mateo Rojas-Carulla, and Bernhard Schölkopf. Learning independent causal mechanisms. arXiv preprint arXiv:1712.00961, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Automatic differentiation in PyTorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017
work page 2017
-
[28]
Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning
Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 (4):41:1–41:13, July 2017. ISSN 0730-0301. doi: 10.1145/3072959.3073602. URL http: //doi.acm.org/10.1145/3072959.3073602
-
[29]
Deepmimic: Example- guided deep reinforcement learning of physics-based character skills
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4):143:1–143:14, July 2018. ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL http://doi.acm.org/10.1145/3197517.3201311
-
[30]
Routing Networks and the Challenges of Modular and Compositional Computation
Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, and Tim Klinger. Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[31]
FeUdal Networks for Hierarchical Reinforcement Learning
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[33]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Reinforcement learning: An introduction
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998
work page 1998
-
[35]
Sutton, David McAllester, Satinder Singh, and Yishay Mansour
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057– 1063, Cambridge, MA, USA, 1999. MIT Press. URL http://dl.acm.org/citation.cfm? id=3009657...
-
[37]
Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning
Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999
work page 1999
-
[38]
Lecture 6.5-rmsprop, coursera: Neural networks for machine learning
Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012
work page 2012
-
[39]
Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. CoRR, physics/0004057, 2000. URL http://arxiv.org/abs/physics/0004057
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[40]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012
work page 2012
-
[41]
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/ vandermaaten08a.html
work page 2008
-
[42]
Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4):229–256, 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696
-
[43]
Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation
Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pages 5279–5288, 2017. A Interpretation of the regularization term The regularization term is given by Lreg = ∑ k αkLk, where αk =eLk/...
work page 2017
-
[44]
Can our proposed approach learn primitives which remain active throughout training?
-
[45]
It is important that both the primitives remain active
Can our proposed approach learn primitives which can solve the task? We train two primitives on the 2D Bandits tasks and evaluate the relative frequency of activation of the primitives throughout the training. It is important that both the primitives remain active. If only 1 primitive is acting most of the time, its effect would be the same as training a ...
-
[46]
Can our proposed approach learn primitives that remain active when training the agent over a sequence of tasks?
-
[47]
Can our proposed approach be used to improve the sample efficiency of the agent over a sequence of tasks? To answer these questions, we consider two setups. In the baseline setup, we train a flat A2C policy on Fourrooms-v0till it achieves a 100 % success rate during evaluation. Then we transfer this policy to Fourrooms-v1 and continue to train till it achie...
-
[48]
All the models (proposed as well as the baselines) are implemented in Pytorch 1.1 unless stated otherwise. [27]
-
[49]
For Meta-Learning Shared Hierarchies [14] and Option-Critic [4], we adapted the author’s implementations 5for our environments
-
[50]
During the evaluation, we use 10 processes in parallel to run 500 episodes and compute the percentage of times the agent solves the task within the prescribed time limit. This metric is referred to as the “success rate”
-
[51]
The default time limit is 500 steps for all the tasks unless specified otherwise
-
[52]
All the feedforward networks are initialized with the orthogonal initialization where the input tensor is filled with a (semi) orthogonal matrix
-
[53]
For all the embedding layers, the weights are initialized using the unit-Gaussian distribution
-
[54]
The weights and biases for all the GRU model are initialized using the uniform distribution fromU (− √ k, √ k) wherek = 1 hidden_size
-
[55]
During training, we perform 64 rollouts in parallel to collect 5-step trajectories
-
[56]
Unlock the door and pick up the red ball
The βind and βreg parameters are both selected from the set {0.001, 0.005, 0.009} by performing validation. In section D.4.2, we explain all the components of the model architecture along with the implemen- tation details in the context of the MiniGrid Environment. For the subsequent environments, we describe only those components and implementation detai...
-
[57]
Fetch: In the Fetch task, the agent spawns at an arbitrary position in a 8× 8 grid (figure 9 ). It is provided with a natural language goal description of the form “go fetch a yellow box”. The agent has to navigate to the object being referred to in the goal description and pick it up
-
[58]
Unlock: In the Unlock task, the agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (figure 10 ). It is provided with a natural language 6https://github.com/maximecb/gym-minigrid Figure 11: RGB view of the UnlockPickup environment. 16 goal description of the form “open the door”. The agent has to find the key that ...
-
[59]
open the door and pick up the yellow box
UnlockPickup: This task is basically a union of the Unlock and the Fetch tasks. The agent spawns at an arbitrary position in a two-room grid environment. Each room is 8× 8 square (figure 11 ). It is provided with a natural language goal description of the form “open the door and pick up the yellow box”. The agent has to find the key that corresponds to the ...
-
[60]
Two-layer feedforward network with the tanh non-linearity
-
[61]
Input: Concatenation of z and the current hidden state of the observation-rnn
-
[62]
Size of the input to the first layer and the second layer of thepolicy network are 320 and 64 respectively
-
[63]
Produces a scalar output. D.4 Components specific to the proposed model The components that we described so far are used by both the baselines as well as our proposed model. We now describe the components that are specific to our proposed model. Our proposed model consists of an ensemble of primitives and the components we describe apply to each of those pr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.