Neural Policy Composition from Free Energy Minimization

Francesca Rossi; Francesco Bullo; Giovanni Russo; Veronica Centorrino

arxiv: 2512.04745 · v3 · pith:JH5ACUV3new · submitted 2025-12-04 · 🧮 math.OC · cs.AI· cs.SY· eess.SY· nlin.AO

Neural Policy Composition from Free Energy Minimization

Francesca Rossi , Veronica Centorrino , Francesco Bullo , Giovanni Russo This is my paper

Pith reviewed 2026-05-21 17:10 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.SYeess.SYnlin.AO

keywords policy compositionvariational free energygradient flowneural gatingrecurrent circuitsmulti-agent flockingbandit tasksoptimal control

0 comments

The pith

Policy composition arises from minimizing a variational free energy, producing a convergent gradient flow that a neural circuit can implement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that flexible composition of behavioral primitives or policies follows directly from minimizing a variational free energy over their combinations. This supplies a single, architecture-independent objective that replaces hand-designed gating rules. From the free energy the authors derive a continuous-time gradient flow whose solutions converge at a known rate to the optimal mixing weights. The same flow admits an exact realization as a soft-competitive recurrent circuit whose connections depend on context. The resulting model accounts for observed patterns in multi-agent flocking, human bandit choices, and layered control tasks.

Core claim

Minimization of a suitably defined variational free energy over policy combinations induces a continuous-time gradient flow on the space of mixing weights; the trajectories of this flow converge, at an explicit rate, to the weights that realize the optimal composition of given primitives, and the flow itself is realized by a soft-competitive recurrent neural circuit with context-sensitive local interactions.

What carries the argument

The variational free energy functional whose gradient flow with respect to policy mixing weights yields both the convergence guarantee and the soft-competitive recurrent circuit.

If this is right

The composition dynamics converges to the optimal mixing at an explicit, provable rate.
The dynamics admits an exact implementation as a recurrent neural circuit without additional architectural constraints.
The same objective reproduces key behavioral signatures across flocking, bandit decision-making, and layered control tasks.
Gating rules emerge mechanistically from free-energy minimization rather than from prespecified design choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework supplies a candidate normative principle that could unify gating mechanisms across reinforcement learning and active inference models.
Because the dynamics is continuous-time and local, it offers a natural starting point for analyzing how biological circuits might implement skill composition on short timescales.
The explicit convergence rate could be used to predict how quickly an agent should switch between primitive policies when the context changes.

Load-bearing premise

A variational free energy can be defined over policy combinations so that its minimization simultaneously guarantees convergence to an optimal composition and supplies a direct mechanistic neural implementation.

What would settle it

A numerical integration of the derived gradient flow on the flocking or bandit benchmark that fails to converge to the composition minimizing the free energy, or a circuit simulation whose activity patterns deviate from the predicted soft-competitive interactions.

Figures

Figures reproduced from arXiv: 2512.04745 by Francesca Rossi, Francesco Bullo, Giovanni Russo, Veronica Centorrino.

**Figure 1.** Figure 1: GateMod Set-up. A At time step k − 1, an agent (e.g., a boid in a flock, or a person in a multi-armed bandit task, or an autonomous agent) receives the state xk−1 from the environment and determines action uk. Both xk−1 and uk are realizations of random variables, Xk−1 and Uk. We denote random variables with upper-case letters and their realizations with lower-case letters. Bold means that the variable is,… view at source ↗

**Figure 2.** Figure 2: GateMod. A GateFrame normative framework. At each time step, the agent computes optimal policy weights w⋆ k by solving an entropy-regularized optimization problem that minimizes a trade-off between statistical complexity and entropy. The constraints formalize the fact that the resulting policy is a linear, and hence convex, combination of primitives. The optimal weights correspond to the equilibrium of Ga… view at source ↗

**Figure 3.** Figure 3: A A boid in a flock of N boids. Position and velocity components form 4-dimensional state x i k ; u i k is the acceleration vector. We use the superscript to denote that states/actions are those of the i-th boid in the flock. The acceleration is built upon the social forces and a boid can only use information from boids within its field of view. The field angle, α, is set to 320◦ in the experiments. The ra… view at source ↗

**Figure 4.** Figure 4: A Comparison between Hybrid model from [36] and GateModin terms of PXP. Higher PXP for a given model suggests that the model provides better explanations for the data. Formally, PXP quantifies the probability that each considered model is the most frequent process that generated the data. To obtain the PXP, we start from GateMod optimal policy. The policy at each trial is used to compute the Bayesian Infor… view at source ↗

read the original abstract

The ability to flexibly compose previously acquired skills to execute intelligent behaviors is a hallmark of natural intelligence. Such compositional flexibility is often attributed to context-dependent gating mechanisms that determine how multiple policies or behavioral primitives are combined. Yet, despite remarkable efforts, the normative objective from which such gating rules should arise, and the neural computations capable of implementing them, remain unclear. Existing approaches typically rely on prespecified design choices for the gating rules, and remain tied to specific architectures, learning paradigms, or datasets. Here, we introduce a normative framework in which policy composition emerges from the minimization of a variational free energy, providing a principled and broadly applicable objective for gating. Based on this framework, we derive a continuous-time gradient flow whose trajectories are guaranteed to converge, with explicit rate, to the optimal composition of primitives. We further show that this dynamics admits a mechanistic neural implementation as a soft-competitive recurrent circuit with context-sensitive local interactions. We evaluate the model on emerging flocking behaviors in multi-agent systems, human decision-making in bandit tasks, and control benchmarks in layered architectures. Across these settings, the model provides interpretable mechanistic accounts of policy composition, reproduces key behavioral signatures, yields insights into data, and matches or outperforms established models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a continuous-time gradient flow for policy composition from variational free energy minimization and realizes it as a soft-competitive neural circuit, but the explicit convergence rate likely requires unverified convexity conditions on the free energy.

read the letter

The main thing to know is that this work frames policy composition as the outcome of minimizing a variational free energy over combinations of primitive policies. From that objective it derives a continuous-time gradient flow claimed to converge at an explicit rate, then shows the dynamics can be implemented directly as a recurrent neural circuit with context-sensitive local interactions and soft competition. The evaluations on multi-agent flocking, bandit decision tasks, and layered control benchmarks are used to argue that the model reproduces behavioral signatures and performs at least as well as existing approaches while offering a mechanistic account. That synthesis of a normative objective with a neural realization is the clearest new element relative to the cited active-inference and control literature. The framework avoids some of the usual hand-specified gating rules, which is a practical plus for readers working on compositional control. The soft spot is the convergence claim. The abstract states trajectories converge with an explicit rate, yet the stress-test note correctly flags that this probably depends on strong convexity or geodesic convexity of the free energy over the policy simplex. Nothing in the provided description confirms those conditions hold for general primitives rather than restricted cases, so the rate may not be as universal as stated. Without the full equations and Lyapunov analysis it is hard to judge how much extra structure is smuggled in. This paper is for people in control theory, robotics, and computational neuroscience who want a single free-energy principle for gating instead of architecture-specific mechanisms. Readers interested in bridging variational methods to neural implementations would get the most out of the mechanistic circuit and the cross-domain evaluations. It shows coherent thinking and concrete applications, so it deserves a serious referee. I would send it for peer review and ask specifically for the conditions that guarantee the stated rate and for any counter-examples when primitives violate convexity.

Referee Report

2 major / 2 minor

Summary. The paper introduces a normative framework in which policy composition emerges from minimization of a variational free energy functional over combinations of primitive policies. From this objective the authors derive a continuous-time gradient flow on the policy simplex whose trajectories converge to the optimal composition with an explicit rate; they further show that the dynamics admits a mechanistic implementation as a soft-competitive recurrent neural circuit with context-sensitive interactions. The framework is evaluated on multi-agent flocking, bandit decision-making, and layered control tasks, where it reproduces behavioral signatures and matches or exceeds baseline models.

Significance. If the claimed convergence guarantees and rate hold for general primitive policies without additional convexity or Lipschitz restrictions, the work would supply a principled, architecture-agnostic objective for gating that links free-energy minimization to both dynamical systems and neural implementation. The explicit rate, mechanistic circuit, and cross-domain evaluations would constitute a substantive contribution to normative modeling of compositional control.

major comments (2)

[§3] §3 (Gradient-flow derivation): The abstract and framework claim an explicit convergence rate for the continuous-time dynamics, yet the manuscript does not state or verify the strong-convexity (or geodesic-convexity) condition on the free-energy functional over the policy simplex that would be required for a uniform rate independent of the choice of primitives. Without this, the rate may hold only for restricted classes of gating variables or primitive policies.
[§4] §4 (Neural implementation): The mapping from the gradient flow to the soft-competitive recurrent circuit is presented as direct, but the derivation appears to introduce local interaction weights whose stability under the claimed dynamics is not shown to follow automatically from the free-energy objective; an explicit Lyapunov or contraction argument linking the circuit equations back to the variational functional is needed.

minor comments (2)

[§2] Notation for the policy simplex and the variational free-energy functional should be introduced with a single consistent definition early in the paper rather than piecemeal across sections.
[Figure 2] Figure 2 (neural circuit diagram) would benefit from explicit labeling of which variables correspond to the gating weights versus the primitive policy outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below, indicating revisions that will be incorporated to clarify the conditions and strengthen the arguments.

read point-by-point responses

Referee: [§3] §3 (Gradient-flow derivation): The abstract and framework claim an explicit convergence rate for the continuous-time dynamics, yet the manuscript does not state or verify the strong-convexity (or geodesic-convexity) condition on the free-energy functional over the policy simplex that would be required for a uniform rate independent of the choice of primitives. Without this, the rate may hold only for restricted classes of gating variables or primitive policies.

Authors: We thank the referee for this observation. The explicit convergence rate in Section 3 is derived under the assumption that the variational free-energy functional is strongly convex with respect to the Fisher-Rao metric on the policy simplex. This property holds when the primitive policies satisfy suitable regularity conditions, such as bounded second derivatives or sufficient separation in the policy space. While the evaluated tasks satisfy these conditions, we agree that the assumption should be stated explicitly. In the revision we will update the statement of the main theorem to include the strong-convexity requirement and add a short discussion of sufficient conditions on the primitives, together with a verification for the bandit and layered-control examples. revision: yes
Referee: [§4] §4 (Neural implementation): The mapping from the gradient flow to the soft-competitive recurrent circuit is presented as direct, but the derivation appears to introduce local interaction weights whose stability under the claimed dynamics is not shown to follow automatically from the free-energy objective; an explicit Lyapunov or contraction argument linking the circuit equations back to the variational functional is needed.

Authors: We agree that an explicit stability argument would improve the presentation. The soft-competitive circuit is obtained by rewriting the continuous-time gradient flow in terms of local, context-dependent interactions that arise directly from the variational derivatives. To make the link rigorous, we will add a new proposition in Section 4 that constructs a Lyapunov function given by the free-energy functional itself. We will show that the time derivative of this function along the circuit trajectories is non-positive, thereby establishing that the circuit dynamics inherit the convergence guarantees of the original variational objective. The argument and its proof will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained from external variational free energy principle with no reduction to fitted inputs or self-citations

full rationale

The paper presents policy composition as emerging directly from minimization of a variational free energy functional, followed by derivation of a continuous-time gradient flow with stated convergence guarantees. No equations or steps in the provided abstract or framework description reduce the claimed results to a self-referential definition, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. The free-energy objective is invoked as an external normative principle rather than constructed from the target gating dynamics, and the neural implementation is presented as a consequence rather than an input. This satisfies the criteria for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete. The central claim rests on the existence of a variational free energy that can be defined over policy spaces and whose minimization yields both optimal composition and a realizable neural circuit.

axioms (1)

domain assumption A variational free energy functional can be defined over combinations of existing policies such that its minimization produces the optimal composition.
Stated in the abstract as the normative basis for the entire framework.

pith-pipeline@v0.9.0 · 5757 in / 1276 out tokens · 104495 ms · 2026-05-21T17:10:19.928087+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean dAlembert_to_ODE_general_theorem and washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

GateFrame casts gating as minimization of statistical complexity (KL) minus ε-entropy subject to mixture-of-primitives constraint; this is shown equivalent to expected free energy when q ∝ q̃ exp(−c).
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates and z_monotone_absolute echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

GateFlow is the continuous-time proximal-gradient dynamics whose equilibrium is the GateFrame optimum and whose trajectories converge globally and exponentially by contractivity of the Jacobian.
IndisputableMonolith/Foundation/BranchSelection.lean RCLCombiner_isCoupling_iff and branch_selection refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

The softmax gating rule emerges exactly from the proximal operator of the entropic barrier; as ε→0 the rule recovers argmax (sparse) or Gumbel-softmax (dense) gating.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Energy-Based Dynamical Models for Neurocomputation, Learning, and Optimization
cs.LG 2026-04 unverdicted novelty 3.0

The paper reviews and extends energy-based dynamical models that use gradient flows and energy landscapes for neurocomputation, learning, and optimization tasks.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Abbas and H

B. Abbas and H. Attouch. Dynamical systems and forward-backward algorithm s associated with the sum of a convex subdiﬀerential and a monotone cocoercive operator. Optimization, 64(10):2223–2252, 2014

work page 2014
[2]

H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2 edition, 2017

work page 2017
[3]

A. Beck. First-Order Methods in Optimization . SIAM, 2017

work page 2017
[4]

Beck and M

A. Beck and M. Teboulle. Mirror descent and nonlinear projected su bgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003

work page 2003
[5]

F. Bullo. Contraction Theory for Dynamical Systems . Kindle Direct Publishing, 1.2 edition, 2024

work page 2024
[6]

Bullo, P

F. Bullo, P. Cisneros-Velarde, A. Davydov, and S. Jafarpour. From con traction theory to ﬁxed point algorithms on Riemannian and non-Euclidean spaces. In IEEE Conf. on Decision and Control , December 2021

work page 2021
[7]

Centorrino, A

V. Centorrino, A. Gokhale, A. Davydov, G. Russo, and F. Bullo. Positive competitive networks for sparse reconstruction. Neural Computation , 36(6):1163–1197, 2024

work page 2024
[8]

P. L. Combettes and J.-C. Pesquet. Deep neural network structur es solving variational inequalities. Set-Valued and Variational Analysis , 28(3):491–518, 2020

work page 2020
[9]

Cominetti, E

R. Cominetti, E. Melo, and S. Sorin. A payoﬀ-based learning proced ure and its application to traﬃc games. Games and Economic Behavior , 70(1):71–83, September 2010

work page 2010
[10]

Coucheney, B

P. Coucheney, B. Gaujal, and P. Mertikopoulos. Penalty-regulated dy namics and robust learning procedures in games. Mathematics of Operations Research , 40(3):611–633, August 2015

work page 2015
[11]

I. D. Couzin, J. Krause, R. James, G. D. Ruxton, and N. R. Franks. Coll ective Memory and Spatial Sorting in Animal Groups. Journal of Theoretical Biology , 218(1):1–11, 2002

work page 2002
[12]

T. M. Cover and J. A. Thomas. Elements of Information Theory . John Wiley & Sons, USA, 2006

work page 2006
[13]

Cucker and S

F. Cucker and S. Smale. Emergent behavior in ﬂocks. IEEE Transactions on Automatic Control , 52(5):852–862, 2007

work page 2007
[14]

Davydov, V

A. Davydov, V. Centorrino, A. Gokhale, G. Russo, and F. Bullo. Time-var ying convex optimiza- tion: A contraction and equilibrium tracking approach. IEEE Transactions on Automatic Control , 70(11):7446–7460, 2025

work page 2025
[15]

Davydov, S

A. Davydov, S. Jafarpour, and F. Bullo. Non-Euclidean contraction theor y for robust nonlinear stability. IEEE Transactions on Automatic Control , 67(12):6667–6681, 2022

work page 2022
[16]

Davydov, A

A. Davydov, A. V. Proskurnikov, and F. Bullo. Non-Euclidean contracti on analysis of continuous-time neural networks. IEEE Transactions on Automatic Control , 70(1):235–250, 2025. 30

work page 2025
[17]

I. M. Elfadel and J. L. Wyatt Jr. The” softmax” nonlinearity: Derivat ion using statistical mechanics and useful properties as a multiterminal analog circuit element. Advances in neural information processing systems, 6, 1993

work page 1993
[18]

Frank and P

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1–2):95–110, March 1956

work page 1956
[19]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

B. Gao and L. Pavel. On the properties of the softmax function with app lication in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Garrab´ e and Giovanni Russo

E. Garrab´ e and Giovanni Russo. Probabilistic design of optimal seque ntial decision-making algorithms in learning and control. Annual Reviews in Control , 54:81–102, 2022

work page 2022
[21]

S. J. Gershman. Deconstructing the human algorithms for exploration . Cognition, 173:34–42, 2018

work page 2018
[22]

Gokhale, A

A. Gokhale, A. Davydov, and F. Bullo. Proximal gradient dynamics: Monot onicity, exponential convergence, and applications. IEEE Control Systems Letters , 8:2853–2858, 2024

work page 2024
[23]

Goodfellow, Y

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016

work page 2016
[24]

P. Guan, M. Raginsky, and R. M. Willett. Online markov decision pr ocesses with kullback–leibler control cost. IEEE Transactions on Automatic Control , 59(6):1423–1438, June 2014

work page 2014
[25]

Hassan-Moghaddam and M

S. Hassan-Moghaddam and M. R. Jovanovi´ c. Proximal gradient ﬂow and Douglas -Rachford splitting dynamics: Global exponential stability via integral quadratic constrai nts. Automatica, 123:109311, 2021

work page 2021
[26]

Hazimeh, Z

H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, Y. Chen, R. Mazumd er, L. Hong, and E. Chi. DSelect-k: Diﬀerentiable Selection in the Mixture of Ex perts with Applications to Multi-Task Learning. In Advances in Neural Information Processing Systems , volume 34, pages 29335–29347, 2021

work page 2021
[27]

Heins, B

C. Heins, B. Millidge, L. Da Costa, R. P. Mann, K. J. Friston, and I. D. Couzin. Collective behavior from surprise minimization. Proceedings of the National Academy of Sciences , 121(17):e2320239121, 2024

work page 2024
[28]

C. K. Hemelrijk and H. Hildenbrandt. Self-organized shape and frontal d ensity of ﬁsh schools. Ethology, 114(3):245–254, 2008

work page 2008
[29]

E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbe l-Softmax. In International Conference on Learning Representations , 2017

work page 2017
[30]

Kozachkov, K

L. Kozachkov, K. V. Kastanenka, and D. Krotov. Building transformers f rom neurons and astrocytes. Proceedings of the National Academy of Sciences , 120(34), 2023

work page 2023
[31]

Kullback and R

S. Kullback and R. A. Leibler. On information and suﬃciency. Annals of Mathematical Statistics , 22:79–86, 1951

work page 1951
[32]

D. S. Leslie and E. J. Collins. Individual q-learning in normal form games. SIAM Journal on Control and Optimization , 44(2):495–514, January 2005. 31

work page 2005
[33]

Levine, W.J

H. Levine, W.J. Rappel, and I. Cohen. Self-organization in systems of s elf-propelled particles. Phys. Rev. E , 63:017101, Dec 2000

work page 2000
[34]

H. Ling, G. E. Mclvor, J. Westley, K. van der Vaart, R. T. Vaughan, A. Thorn ton, and N. T. Ouel- lette. Behavioural plasticity and the transition to order in jackdaw ﬂocks. Nature Communications, 10(1):5174, 2019

work page 2019
[35]

Lohmiller and J.-J

W. Lohmiller and J.-J. E. Slotine. On contraction analysis for non-lin ear systems. Automatica, 34(6):683–696, 1998

work page 1998
[36]

Lukeman, Y

R. Lukeman, Y. Li, and L. Edelstein-Keshet. Inferring individual rules from collective behavior. Proceedings of the National Academy of Sciences , 107(28):12576–12580, June 2010

work page 2010
[37]

R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior , 10(1):6–38, July 1995

work page 1995
[38]

Mertikopoulos and W

P. Mertikopoulos and W. H. Sandholm. Learning in games via reinforcemen t and regularization. Mathematics of Operations Research , 41(4):1297–1324, November 2016

work page 2016
[39]

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Si lver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research , pages 1928–1937. PMLR, Jun 2016

work page 1928
[40]

Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics . MIT Press, 2023

work page 2023
[41]

M. Nagumo. ¨Uber die Lage der Integralkurven gew¨ ohnlicher Diﬀerentialgleichunge n. Proceedings of the Physico-Mathematical Society of Japan. 3rd Series , 24:551–559, 1942

work page 1942
[42]

Olfati-Saber

R. Olfati-Saber. Flocking for multi-agent dynamic systems: Algori thms and theory. IEEE Transac- tions on Automatic Control , 51(3):401–420, 2006

work page 2006
[43]

Parikh and S

N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization , 1(3):127–239, 2014

work page 2014
[44]

Peters, K

J. Peters, K. Mulling, and Y. Altun. Relative entropy policy search . Proceedings of the AAAI Con- ference on Artiﬁcial Intelligence , 24(1):1607–1612, July 2010

work page 2010
[45]

Peyr´ e and M

G. Peyr´ e and M. Cuturi. Computational optimal transport: With appli cations to data science. Foun- dations and Trends in Machine Learning , 11(5-6):355–607, 2019

work page 2019
[46]

A. M. Reynolds, G. E. McIvor, A. Thornton, P. Yang, and N. T. Ouellette. Stochastic modelling of bird ﬂocks: accounting for the cohesiveness of collective motion. Journal of the Royal Society Interface , 19(189):20210745, 2022

work page 2022
[47]

C. W. Reynolds. Flocks, herds and schools: A distributed beha vioral model. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniqu es, page 25–34, 1987

work page 1987
[48]

C. W. Reynolds. Flocks, herds, and schools: A distributed beh avioral model. Computer Graphics , 21(4):25–34, 1987. 32

work page 1987
[49]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Convex Analysis . Princeton University Press, 1970

work page 1970
[50]

Russo, M

G. Russo, M. Di Bernardo, and E. D. Sontag. Global entrainment of transcr iptional systems to periodic inputs. PLoS Computational Biology , 6(4):e1000739, 2010

work page 2010
[51]

Russo, M

G. Russo, M. Di Bernardo, and E. D. Sontag. A contraction approach to the hi erarchical analysis and design of networked systems. IEEE Transactions on Automatic Control , 58(5):1328–1331, 2013

work page 2013
[52]

W. H. Sandholm. Population Games and Evolutionary Dynamics . MIT Press, 2010

work page 2010
[53]

Shaﬁei, H

A. Shaﬁei, H. Jesawada, K. Friston, and G. Russo. Distributionally rob ust free energy principle for decision-making. In Nature Communications, 2025

work page 2025
[54]

Snow and J

M. Snow and J. Orchard. Biological softmax: Demonstrated in modern Hopﬁ eld networks. In Pro- ceedings of the Annual Meeting of the Cognitive Science Society , volume 44, 2022

work page 2022
[55]

E. D. Sontag. Contractive systems with inputs. In J. C. Willems, S. Hara, Y. Ohta, and H. Fujioka, editors, Perspectives in Mathematical System Theory, Control, and Signal Pr ocessing, pages 217–228. Springer, 2010

work page 2010
[56]

D. J. T. Sumpter. The principles of collective animal behaviour . Philosophical Transactions of The Royal Society B: Biological Sciences , 361(1465):5–22, 2006

work page 2006
[57]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998

work page 1998
[58]

Tunstrøm, Y

K. Tunstrøm, Y. Katz, C. C. Ioannou, C. Huepe, M. J. Lutz, and I. D. Couzi n. Collective States, Multistability and Transitional Behavior in Schooling Fish. PLOS Computational Biology , 9(2):1–11, 02 2013

work page 2013
[59]

A. Ullah. Entropy, divergence and distance measures with econometri c applications. Journal of Statistical Planning and Inference , 49(1):137–162, 1996

work page 1996
[60]

Vicsek, A

T. Vicsek, A. Czir´ ok, E. Ben-Jacob, I. Cohen, and O. Shochet. Novel type of phase transition in a system of self-driven particles. Physical Review Letters , 75(6-7):1226–1229, 1995

work page 1995
[61]

S. Xie, G. Russo, and R. H. Middleton. Scalability in nonlinear networ k systems aﬀected by delays and disturbances. IEEE Transactions on Control of Network Systems , 8(3):1128–1138, 2021

work page 2021
[62]

A. L. Yuille and D. Geiger. Winner-take-all mechanisms. In The Handbook of Brain Theory and Neural Networks , 1995. 33

work page 1995

[1] [1]

Abbas and H

B. Abbas and H. Attouch. Dynamical systems and forward-backward algorithm s associated with the sum of a convex subdiﬀerential and a monotone cocoercive operator. Optimization, 64(10):2223–2252, 2014

work page 2014

[2] [2]

H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2 edition, 2017

work page 2017

[3] [3]

A. Beck. First-Order Methods in Optimization . SIAM, 2017

work page 2017

[4] [4]

Beck and M

A. Beck and M. Teboulle. Mirror descent and nonlinear projected su bgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003

work page 2003

[5] [5]

F. Bullo. Contraction Theory for Dynamical Systems . Kindle Direct Publishing, 1.2 edition, 2024

work page 2024

[6] [6]

Bullo, P

F. Bullo, P. Cisneros-Velarde, A. Davydov, and S. Jafarpour. From con traction theory to ﬁxed point algorithms on Riemannian and non-Euclidean spaces. In IEEE Conf. on Decision and Control , December 2021

work page 2021

[7] [7]

Centorrino, A

V. Centorrino, A. Gokhale, A. Davydov, G. Russo, and F. Bullo. Positive competitive networks for sparse reconstruction. Neural Computation , 36(6):1163–1197, 2024

work page 2024

[8] [8]

P. L. Combettes and J.-C. Pesquet. Deep neural network structur es solving variational inequalities. Set-Valued and Variational Analysis , 28(3):491–518, 2020

work page 2020

[9] [9]

Cominetti, E

R. Cominetti, E. Melo, and S. Sorin. A payoﬀ-based learning proced ure and its application to traﬃc games. Games and Economic Behavior , 70(1):71–83, September 2010

work page 2010

[10] [10]

Coucheney, B

P. Coucheney, B. Gaujal, and P. Mertikopoulos. Penalty-regulated dy namics and robust learning procedures in games. Mathematics of Operations Research , 40(3):611–633, August 2015

work page 2015

[11] [11]

I. D. Couzin, J. Krause, R. James, G. D. Ruxton, and N. R. Franks. Coll ective Memory and Spatial Sorting in Animal Groups. Journal of Theoretical Biology , 218(1):1–11, 2002

work page 2002

[12] [12]

T. M. Cover and J. A. Thomas. Elements of Information Theory . John Wiley & Sons, USA, 2006

work page 2006

[13] [13]

Cucker and S

F. Cucker and S. Smale. Emergent behavior in ﬂocks. IEEE Transactions on Automatic Control , 52(5):852–862, 2007

work page 2007

[14] [14]

Davydov, V

A. Davydov, V. Centorrino, A. Gokhale, G. Russo, and F. Bullo. Time-var ying convex optimiza- tion: A contraction and equilibrium tracking approach. IEEE Transactions on Automatic Control , 70(11):7446–7460, 2025

work page 2025

[15] [15]

Davydov, S

A. Davydov, S. Jafarpour, and F. Bullo. Non-Euclidean contraction theor y for robust nonlinear stability. IEEE Transactions on Automatic Control , 67(12):6667–6681, 2022

work page 2022

[16] [16]

Davydov, A

A. Davydov, A. V. Proskurnikov, and F. Bullo. Non-Euclidean contracti on analysis of continuous-time neural networks. IEEE Transactions on Automatic Control , 70(1):235–250, 2025. 30

work page 2025

[17] [17]

I. M. Elfadel and J. L. Wyatt Jr. The” softmax” nonlinearity: Derivat ion using statistical mechanics and useful properties as a multiterminal analog circuit element. Advances in neural information processing systems, 6, 1993

work page 1993

[18] [18]

Frank and P

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1–2):95–110, March 1956

work page 1956

[19] [19]

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning

B. Gao and L. Pavel. On the properties of the softmax function with app lication in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Garrab´ e and Giovanni Russo

E. Garrab´ e and Giovanni Russo. Probabilistic design of optimal seque ntial decision-making algorithms in learning and control. Annual Reviews in Control , 54:81–102, 2022

work page 2022

[21] [21]

S. J. Gershman. Deconstructing the human algorithms for exploration . Cognition, 173:34–42, 2018

work page 2018

[22] [22]

Gokhale, A

A. Gokhale, A. Davydov, and F. Bullo. Proximal gradient dynamics: Monot onicity, exponential convergence, and applications. IEEE Control Systems Letters , 8:2853–2858, 2024

work page 2024

[23] [23]

Goodfellow, Y

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016

work page 2016

[24] [24]

P. Guan, M. Raginsky, and R. M. Willett. Online markov decision pr ocesses with kullback–leibler control cost. IEEE Transactions on Automatic Control , 59(6):1423–1438, June 2014

work page 2014

[25] [25]

Hassan-Moghaddam and M

S. Hassan-Moghaddam and M. R. Jovanovi´ c. Proximal gradient ﬂow and Douglas -Rachford splitting dynamics: Global exponential stability via integral quadratic constrai nts. Automatica, 123:109311, 2021

work page 2021

[26] [26]

Hazimeh, Z

H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, Y. Chen, R. Mazumd er, L. Hong, and E. Chi. DSelect-k: Diﬀerentiable Selection in the Mixture of Ex perts with Applications to Multi-Task Learning. In Advances in Neural Information Processing Systems , volume 34, pages 29335–29347, 2021

work page 2021

[27] [27]

Heins, B

C. Heins, B. Millidge, L. Da Costa, R. P. Mann, K. J. Friston, and I. D. Couzin. Collective behavior from surprise minimization. Proceedings of the National Academy of Sciences , 121(17):e2320239121, 2024

work page 2024

[28] [28]

C. K. Hemelrijk and H. Hildenbrandt. Self-organized shape and frontal d ensity of ﬁsh schools. Ethology, 114(3):245–254, 2008

work page 2008

[29] [29]

E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbe l-Softmax. In International Conference on Learning Representations , 2017

work page 2017

[30] [30]

Kozachkov, K

L. Kozachkov, K. V. Kastanenka, and D. Krotov. Building transformers f rom neurons and astrocytes. Proceedings of the National Academy of Sciences , 120(34), 2023

work page 2023

[31] [31]

Kullback and R

S. Kullback and R. A. Leibler. On information and suﬃciency. Annals of Mathematical Statistics , 22:79–86, 1951

work page 1951

[32] [32]

D. S. Leslie and E. J. Collins. Individual q-learning in normal form games. SIAM Journal on Control and Optimization , 44(2):495–514, January 2005. 31

work page 2005

[33] [33]

Levine, W.J

H. Levine, W.J. Rappel, and I. Cohen. Self-organization in systems of s elf-propelled particles. Phys. Rev. E , 63:017101, Dec 2000

work page 2000

[34] [34]

H. Ling, G. E. Mclvor, J. Westley, K. van der Vaart, R. T. Vaughan, A. Thorn ton, and N. T. Ouel- lette. Behavioural plasticity and the transition to order in jackdaw ﬂocks. Nature Communications, 10(1):5174, 2019

work page 2019

[35] [35]

Lohmiller and J.-J

W. Lohmiller and J.-J. E. Slotine. On contraction analysis for non-lin ear systems. Automatica, 34(6):683–696, 1998

work page 1998

[36] [36]

Lukeman, Y

R. Lukeman, Y. Li, and L. Edelstein-Keshet. Inferring individual rules from collective behavior. Proceedings of the National Academy of Sciences , 107(28):12576–12580, June 2010

work page 2010

[37] [37]

R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. Games and Economic Behavior , 10(1):6–38, July 1995

work page 1995

[38] [38]

Mertikopoulos and W

P. Mertikopoulos and W. H. Sandholm. Learning in games via reinforcemen t and regularization. Mathematics of Operations Research , 41(4):1297–1324, November 2016

work page 2016

[39] [39]

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Si lver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research , pages 1928–1937. PMLR, Jun 2016

work page 1928

[40] [40]

Kevin P. Murphy. Probabilistic Machine Learning: Advanced Topics . MIT Press, 2023

work page 2023

[41] [41]

M. Nagumo. ¨Uber die Lage der Integralkurven gew¨ ohnlicher Diﬀerentialgleichunge n. Proceedings of the Physico-Mathematical Society of Japan. 3rd Series , 24:551–559, 1942

work page 1942

[42] [42]

Olfati-Saber

R. Olfati-Saber. Flocking for multi-agent dynamic systems: Algori thms and theory. IEEE Transac- tions on Automatic Control , 51(3):401–420, 2006

work page 2006

[43] [43]

Parikh and S

N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization , 1(3):127–239, 2014

work page 2014

[44] [44]

Peters, K

J. Peters, K. Mulling, and Y. Altun. Relative entropy policy search . Proceedings of the AAAI Con- ference on Artiﬁcial Intelligence , 24(1):1607–1612, July 2010

work page 2010

[45] [45]

Peyr´ e and M

G. Peyr´ e and M. Cuturi. Computational optimal transport: With appli cations to data science. Foun- dations and Trends in Machine Learning , 11(5-6):355–607, 2019

work page 2019

[46] [46]

A. M. Reynolds, G. E. McIvor, A. Thornton, P. Yang, and N. T. Ouellette. Stochastic modelling of bird ﬂocks: accounting for the cohesiveness of collective motion. Journal of the Royal Society Interface , 19(189):20210745, 2022

work page 2022

[47] [47]

C. W. Reynolds. Flocks, herds and schools: A distributed beha vioral model. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniqu es, page 25–34, 1987

work page 1987

[48] [48]

C. W. Reynolds. Flocks, herds, and schools: A distributed beh avioral model. Computer Graphics , 21(4):25–34, 1987. 32

work page 1987

[49] [49]

Tyrrell Rockafellar

R. Tyrrell Rockafellar. Convex Analysis . Princeton University Press, 1970

work page 1970

[50] [50]

Russo, M

G. Russo, M. Di Bernardo, and E. D. Sontag. Global entrainment of transcr iptional systems to periodic inputs. PLoS Computational Biology , 6(4):e1000739, 2010

work page 2010

[51] [51]

Russo, M

G. Russo, M. Di Bernardo, and E. D. Sontag. A contraction approach to the hi erarchical analysis and design of networked systems. IEEE Transactions on Automatic Control , 58(5):1328–1331, 2013

work page 2013

[52] [52]

W. H. Sandholm. Population Games and Evolutionary Dynamics . MIT Press, 2010

work page 2010

[53] [53]

Shaﬁei, H

A. Shaﬁei, H. Jesawada, K. Friston, and G. Russo. Distributionally rob ust free energy principle for decision-making. In Nature Communications, 2025

work page 2025

[54] [54]

Snow and J

M. Snow and J. Orchard. Biological softmax: Demonstrated in modern Hopﬁ eld networks. In Pro- ceedings of the Annual Meeting of the Cognitive Science Society , volume 44, 2022

work page 2022

[55] [55]

E. D. Sontag. Contractive systems with inputs. In J. C. Willems, S. Hara, Y. Ohta, and H. Fujioka, editors, Perspectives in Mathematical System Theory, Control, and Signal Pr ocessing, pages 217–228. Springer, 2010

work page 2010

[56] [56]

D. J. T. Sumpter. The principles of collective animal behaviour . Philosophical Transactions of The Royal Society B: Biological Sciences , 361(1465):5–22, 2006

work page 2006

[57] [57]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998

work page 1998

[58] [58]

Tunstrøm, Y

K. Tunstrøm, Y. Katz, C. C. Ioannou, C. Huepe, M. J. Lutz, and I. D. Couzi n. Collective States, Multistability and Transitional Behavior in Schooling Fish. PLOS Computational Biology , 9(2):1–11, 02 2013

work page 2013

[59] [59]

A. Ullah. Entropy, divergence and distance measures with econometri c applications. Journal of Statistical Planning and Inference , 49(1):137–162, 1996

work page 1996

[60] [60]

Vicsek, A

T. Vicsek, A. Czir´ ok, E. Ben-Jacob, I. Cohen, and O. Shochet. Novel type of phase transition in a system of self-driven particles. Physical Review Letters , 75(6-7):1226–1229, 1995

work page 1995

[61] [61]

S. Xie, G. Russo, and R. H. Middleton. Scalability in nonlinear networ k systems aﬀected by delays and disturbances. IEEE Transactions on Control of Network Systems , 8(3):1128–1138, 2021

work page 2021

[62] [62]

A. L. Yuille and D. Geiger. Winner-take-all mechanisms. In The Handbook of Brain Theory and Neural Networks , 1995. 33

work page 1995