Neural Policy Composition from Free Energy Minimization
Pith reviewed 2026-05-21 17:10 UTC · model grok-4.3
The pith
Policy composition arises from minimizing a variational free energy, producing a convergent gradient flow that a neural circuit can implement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Minimization of a suitably defined variational free energy over policy combinations induces a continuous-time gradient flow on the space of mixing weights; the trajectories of this flow converge, at an explicit rate, to the weights that realize the optimal composition of given primitives, and the flow itself is realized by a soft-competitive recurrent neural circuit with context-sensitive local interactions.
What carries the argument
The variational free energy functional whose gradient flow with respect to policy mixing weights yields both the convergence guarantee and the soft-competitive recurrent circuit.
If this is right
- The composition dynamics converges to the optimal mixing at an explicit, provable rate.
- The dynamics admits an exact implementation as a recurrent neural circuit without additional architectural constraints.
- The same objective reproduces key behavioral signatures across flocking, bandit decision-making, and layered control tasks.
- Gating rules emerge mechanistically from free-energy minimization rather than from prespecified design choices.
Where Pith is reading between the lines
- The framework supplies a candidate normative principle that could unify gating mechanisms across reinforcement learning and active inference models.
- Because the dynamics is continuous-time and local, it offers a natural starting point for analyzing how biological circuits might implement skill composition on short timescales.
- The explicit convergence rate could be used to predict how quickly an agent should switch between primitive policies when the context changes.
Load-bearing premise
A variational free energy can be defined over policy combinations so that its minimization simultaneously guarantees convergence to an optimal composition and supplies a direct mechanistic neural implementation.
What would settle it
A numerical integration of the derived gradient flow on the flocking or bandit benchmark that fails to converge to the composition minimizing the free energy, or a circuit simulation whose activity patterns deviate from the predicted soft-competitive interactions.
Figures
read the original abstract
The ability to flexibly compose previously acquired skills to execute intelligent behaviors is a hallmark of natural intelligence. Such compositional flexibility is often attributed to context-dependent gating mechanisms that determine how multiple policies or behavioral primitives are combined. Yet, despite remarkable efforts, the normative objective from which such gating rules should arise, and the neural computations capable of implementing them, remain unclear. Existing approaches typically rely on prespecified design choices for the gating rules, and remain tied to specific architectures, learning paradigms, or datasets. Here, we introduce a normative framework in which policy composition emerges from the minimization of a variational free energy, providing a principled and broadly applicable objective for gating. Based on this framework, we derive a continuous-time gradient flow whose trajectories are guaranteed to converge, with explicit rate, to the optimal composition of primitives. We further show that this dynamics admits a mechanistic neural implementation as a soft-competitive recurrent circuit with context-sensitive local interactions. We evaluate the model on emerging flocking behaviors in multi-agent systems, human decision-making in bandit tasks, and control benchmarks in layered architectures. Across these settings, the model provides interpretable mechanistic accounts of policy composition, reproduces key behavioral signatures, yields insights into data, and matches or outperforms established models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a normative framework in which policy composition emerges from minimization of a variational free energy functional over combinations of primitive policies. From this objective the authors derive a continuous-time gradient flow on the policy simplex whose trajectories converge to the optimal composition with an explicit rate; they further show that the dynamics admits a mechanistic implementation as a soft-competitive recurrent neural circuit with context-sensitive interactions. The framework is evaluated on multi-agent flocking, bandit decision-making, and layered control tasks, where it reproduces behavioral signatures and matches or exceeds baseline models.
Significance. If the claimed convergence guarantees and rate hold for general primitive policies without additional convexity or Lipschitz restrictions, the work would supply a principled, architecture-agnostic objective for gating that links free-energy minimization to both dynamical systems and neural implementation. The explicit rate, mechanistic circuit, and cross-domain evaluations would constitute a substantive contribution to normative modeling of compositional control.
major comments (2)
- [§3] §3 (Gradient-flow derivation): The abstract and framework claim an explicit convergence rate for the continuous-time dynamics, yet the manuscript does not state or verify the strong-convexity (or geodesic-convexity) condition on the free-energy functional over the policy simplex that would be required for a uniform rate independent of the choice of primitives. Without this, the rate may hold only for restricted classes of gating variables or primitive policies.
- [§4] §4 (Neural implementation): The mapping from the gradient flow to the soft-competitive recurrent circuit is presented as direct, but the derivation appears to introduce local interaction weights whose stability under the claimed dynamics is not shown to follow automatically from the free-energy objective; an explicit Lyapunov or contraction argument linking the circuit equations back to the variational functional is needed.
minor comments (2)
- [§2] Notation for the policy simplex and the variational free-energy functional should be introduced with a single consistent definition early in the paper rather than piecemeal across sections.
- [Figure 2] Figure 2 (neural circuit diagram) would benefit from explicit labeling of which variables correspond to the gating weights versus the primitive policy outputs.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major point below, indicating revisions that will be incorporated to clarify the conditions and strengthen the arguments.
read point-by-point responses
-
Referee: [§3] §3 (Gradient-flow derivation): The abstract and framework claim an explicit convergence rate for the continuous-time dynamics, yet the manuscript does not state or verify the strong-convexity (or geodesic-convexity) condition on the free-energy functional over the policy simplex that would be required for a uniform rate independent of the choice of primitives. Without this, the rate may hold only for restricted classes of gating variables or primitive policies.
Authors: We thank the referee for this observation. The explicit convergence rate in Section 3 is derived under the assumption that the variational free-energy functional is strongly convex with respect to the Fisher-Rao metric on the policy simplex. This property holds when the primitive policies satisfy suitable regularity conditions, such as bounded second derivatives or sufficient separation in the policy space. While the evaluated tasks satisfy these conditions, we agree that the assumption should be stated explicitly. In the revision we will update the statement of the main theorem to include the strong-convexity requirement and add a short discussion of sufficient conditions on the primitives, together with a verification for the bandit and layered-control examples. revision: yes
-
Referee: [§4] §4 (Neural implementation): The mapping from the gradient flow to the soft-competitive recurrent circuit is presented as direct, but the derivation appears to introduce local interaction weights whose stability under the claimed dynamics is not shown to follow automatically from the free-energy objective; an explicit Lyapunov or contraction argument linking the circuit equations back to the variational functional is needed.
Authors: We agree that an explicit stability argument would improve the presentation. The soft-competitive circuit is obtained by rewriting the continuous-time gradient flow in terms of local, context-dependent interactions that arise directly from the variational derivatives. To make the link rigorous, we will add a new proposition in Section 4 that constructs a Lyapunov function given by the free-energy functional itself. We will show that the time derivative of this function along the circuit trajectories is non-positive, thereby establishing that the circuit dynamics inherit the convergence guarantees of the original variational objective. The argument and its proof will be included in the revised manuscript. revision: yes
Circularity Check
Derivation self-contained from external variational free energy principle with no reduction to fitted inputs or self-citations
full rationale
The paper presents policy composition as emerging directly from minimization of a variational free energy functional, followed by derivation of a continuous-time gradient flow with stated convergence guarantees. No equations or steps in the provided abstract or framework description reduce the claimed results to a self-referential definition, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. The free-energy objective is invoked as an external normative principle rather than constructed from the target gating dynamics, and the neural implementation is presented as a consequence rather than an input. This satisfies the criteria for a self-contained derivation against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A variational free energy functional can be defined over combinations of existing policies such that its minimization produces the optimal composition.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leandAlembert_to_ODE_general_theorem and washburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
GateFrame casts gating as minimization of statistical complexity (KL) minus ε-entropy subject to mixture-of-primitives constraint; this is shown equivalent to expected free energy when q ∝ q̃ exp(−c).
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates and z_monotone_absolute echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
GateFlow is the continuous-time proximal-gradient dynamics whose equilibrium is the GateFrame optimum and whose trajectories converge globally and exponentially by contractivity of the Jacobian.
-
IndisputableMonolith/Foundation/BranchSelection.leanRCLCombiner_isCoupling_iff and branch_selection refines?
refinesRelation between the paper passage and the cited Recognition theorem.
The softmax gating rule emerges exactly from the proximal operator of the entropic barrier; as ε→0 the rule recovers argmax (sparse) or Gumbel-softmax (dense) gating.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Energy-Based Dynamical Models for Neurocomputation, Learning, and Optimization
The paper reviews and extends energy-based dynamical models that use gradient flows and energy landscapes for neurocomputation, learning, and optimization tasks.
Reference graph
Works this paper leans on
-
[1]
B. Abbas and H. Attouch. Dynamical systems and forward-backward algorithm s associated with the sum of a convex subdifferential and a monotone cocoercive operator. Optimization, 64(10):2223–2252, 2014
work page 2014
-
[2]
H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2 edition, 2017
work page 2017
-
[3]
A. Beck. First-Order Methods in Optimization . SIAM, 2017
work page 2017
-
[4]
A. Beck and M. Teboulle. Mirror descent and nonlinear projected su bgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003
work page 2003
-
[5]
F. Bullo. Contraction Theory for Dynamical Systems . Kindle Direct Publishing, 1.2 edition, 2024
work page 2024
- [6]
-
[7]
V. Centorrino, A. Gokhale, A. Davydov, G. Russo, and F. Bullo. Positive competitive networks for sparse reconstruction. Neural Computation , 36(6):1163–1197, 2024
work page 2024
-
[8]
P. L. Combettes and J.-C. Pesquet. Deep neural network structur es solving variational inequalities. Set-Valued and Variational Analysis , 28(3):491–518, 2020
work page 2020
-
[9]
R. Cominetti, E. Melo, and S. Sorin. A payoff-based learning proced ure and its application to traffic games. Games and Economic Behavior , 70(1):71–83, September 2010
work page 2010
-
[10]
P. Coucheney, B. Gaujal, and P. Mertikopoulos. Penalty-regulated dy namics and robust learning procedures in games. Mathematics of Operations Research , 40(3):611–633, August 2015
work page 2015
-
[11]
I. D. Couzin, J. Krause, R. James, G. D. Ruxton, and N. R. Franks. Coll ective Memory and Spatial Sorting in Animal Groups. Journal of Theoretical Biology , 218(1):1–11, 2002
work page 2002
-
[12]
T. M. Cover and J. A. Thomas. Elements of Information Theory . John Wiley & Sons, USA, 2006
work page 2006
-
[13]
F. Cucker and S. Smale. Emergent behavior in flocks. IEEE Transactions on Automatic Control , 52(5):852–862, 2007
work page 2007
-
[14]
A. Davydov, V. Centorrino, A. Gokhale, G. Russo, and F. Bullo. Time-var ying convex optimiza- tion: A contraction and equilibrium tracking approach. IEEE Transactions on Automatic Control , 70(11):7446–7460, 2025
work page 2025
-
[15]
A. Davydov, S. Jafarpour, and F. Bullo. Non-Euclidean contraction theor y for robust nonlinear stability. IEEE Transactions on Automatic Control , 67(12):6667–6681, 2022
work page 2022
-
[16]
A. Davydov, A. V. Proskurnikov, and F. Bullo. Non-Euclidean contracti on analysis of continuous-time neural networks. IEEE Transactions on Automatic Control , 70(1):235–250, 2025. 30
work page 2025
-
[17]
I. M. Elfadel and J. L. Wyatt Jr. The” softmax” nonlinearity: Derivat ion using statistical mechanics and useful properties as a multiterminal analog circuit element. Advances in neural information processing systems, 6, 1993
work page 1993
-
[18]
M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1–2):95–110, March 1956
work page 1956
-
[19]
On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning
B. Gao and L. Pavel. On the properties of the softmax function with app lication in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
E. Garrab´ e and Giovanni Russo. Probabilistic design of optimal seque ntial decision-making algorithms in learning and control. Annual Reviews in Control , 54:81–102, 2022
work page 2022
-
[21]
S. J. Gershman. Deconstructing the human algorithms for exploration . Cognition, 173:34–42, 2018
work page 2018
-
[22]
A. Gokhale, A. Davydov, and F. Bullo. Proximal gradient dynamics: Monot onicity, exponential convergence, and applications. IEEE Control Systems Letters , 8:2853–2858, 2024
work page 2024
-
[23]
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016
work page 2016
-
[24]
P. Guan, M. Raginsky, and R. M. Willett. Online markov decision pr ocesses with kullback–leibler control cost. IEEE Transactions on Automatic Control , 59(6):1423–1438, June 2014
work page 2014
-
[25]
S. Hassan-Moghaddam and M. R. Jovanovi´ c. Proximal gradient flow and Douglas -Rachford splitting dynamics: Global exponential stability via integral quadratic constrai nts. Automatica, 123:109311, 2021
work page 2021
-
[26]
H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, Y. Chen, R. Mazumd er, L. Hong, and E. Chi. DSelect-k: Differentiable Selection in the Mixture of Ex perts with Applications to Multi-Task Learning. In Advances in Neural Information Processing Systems , volume 34, pages 29335–29347, 2021
work page 2021
- [27]
-
[28]
C. K. Hemelrijk and H. Hildenbrandt. Self-organized shape and frontal d ensity of fish schools. Ethology, 114(3):245–254, 2008
work page 2008
-
[29]
E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbe l-Softmax. In International Conference on Learning Representations , 2017
work page 2017
-
[30]
L. Kozachkov, K. V. Kastanenka, and D. Krotov. Building transformers f rom neurons and astrocytes. Proceedings of the National Academy of Sciences , 120(34), 2023
work page 2023
-
[31]
S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics , 22:79–86, 1951
work page 1951
-
[32]
D. S. Leslie and E. J. Collins. Individual q-learning in normal form games. SIAM Journal on Control and Optimization , 44(2):495–514, January 2005. 31
work page 2005
-
[33]
H. Levine, W.J. Rappel, and I. Cohen. Self-organization in systems of s elf-propelled particles. Phys. Rev. E , 63:017101, Dec 2000
work page 2000
-
[34]
H. Ling, G. E. Mclvor, J. Westley, K. van der Vaart, R. T. Vaughan, A. Thorn ton, and N. T. Ouel- lette. Behavioural plasticity and the transition to order in jackdaw flocks. Nature Communications, 10(1):5174, 2019
work page 2019
-
[35]
W. Lohmiller and J.-J. E. Slotine. On contraction analysis for non-lin ear systems. Automatica, 34(6):683–696, 1998
work page 1998
-
[36]
R. Lukeman, Y. Li, and L. Edelstein-Keshet. Inferring individual rules from collective behavior. Proceedings of the National Academy of Sciences , 107(28):12576–12580, June 2010
work page 2010
-
[37]
R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior , 10(1):6–38, July 1995
work page 1995
-
[38]
P. Mertikopoulos and W. H. Sandholm. Learning in games via reinforcemen t and regularization. Mathematics of Operations Research , 41(4):1297–1324, November 2016
work page 2016
-
[39]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Si lver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research , pages 1928–1937. PMLR, Jun 2016
work page 1928
-
[40]
Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics . MIT Press, 2023
work page 2023
-
[41]
M. Nagumo. ¨Uber die Lage der Integralkurven gew¨ ohnlicher Differentialgleichunge n. Proceedings of the Physico-Mathematical Society of Japan. 3rd Series , 24:551–559, 1942
work page 1942
-
[42]
R. Olfati-Saber. Flocking for multi-agent dynamic systems: Algori thms and theory. IEEE Transac- tions on Automatic Control , 51(3):401–420, 2006
work page 2006
-
[43]
N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization , 1(3):127–239, 2014
work page 2014
- [44]
-
[45]
G. Peyr´ e and M. Cuturi. Computational optimal transport: With appli cations to data science. Foun- dations and Trends in Machine Learning , 11(5-6):355–607, 2019
work page 2019
-
[46]
A. M. Reynolds, G. E. McIvor, A. Thornton, P. Yang, and N. T. Ouellette. Stochastic modelling of bird flocks: accounting for the cohesiveness of collective motion. Journal of the Royal Society Interface , 19(189):20210745, 2022
work page 2022
-
[47]
C. W. Reynolds. Flocks, herds and schools: A distributed beha vioral model. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniqu es, page 25–34, 1987
work page 1987
-
[48]
C. W. Reynolds. Flocks, herds, and schools: A distributed beh avioral model. Computer Graphics , 21(4):25–34, 1987. 32
work page 1987
-
[49]
R. Tyrrell Rockafellar. Convex Analysis . Princeton University Press, 1970
work page 1970
- [50]
- [51]
-
[52]
W. H. Sandholm. Population Games and Evolutionary Dynamics . MIT Press, 2010
work page 2010
- [53]
-
[54]
M. Snow and J. Orchard. Biological softmax: Demonstrated in modern Hopfi eld networks. In Pro- ceedings of the Annual Meeting of the Cognitive Science Society , volume 44, 2022
work page 2022
-
[55]
E. D. Sontag. Contractive systems with inputs. In J. C. Willems, S. Hara, Y. Ohta, and H. Fujioka, editors, Perspectives in Mathematical System Theory, Control, and Signal Pr ocessing, pages 217–228. Springer, 2010
work page 2010
-
[56]
D. J. T. Sumpter. The principles of collective animal behaviour . Philosophical Transactions of The Royal Society B: Biological Sciences , 361(1465):5–22, 2006
work page 2006
-
[57]
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998
work page 1998
-
[58]
K. Tunstrøm, Y. Katz, C. C. Ioannou, C. Huepe, M. J. Lutz, and I. D. Couzi n. Collective States, Multistability and Transitional Behavior in Schooling Fish. PLOS Computational Biology , 9(2):1–11, 02 2013
work page 2013
-
[59]
A. Ullah. Entropy, divergence and distance measures with econometri c applications. Journal of Statistical Planning and Inference , 49(1):137–162, 1996
work page 1996
- [60]
-
[61]
S. Xie, G. Russo, and R. H. Middleton. Scalability in nonlinear networ k systems affected by delays and disturbances. IEEE Transactions on Control of Network Systems , 8(3):1128–1138, 2021
work page 2021
-
[62]
A. L. Yuille and D. Geiger. Winner-take-all mechanisms. In The Handbook of Brain Theory and Neural Networks , 1995. 33
work page 1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.