pith. sign in

arxiv: 2601.21366 · v2 · pith:L3DHXBBNnew · submitted 2026-01-29 · 💻 cs.LG · math.OC

Perceptrons and localization of attention's mean-field landscape

Pith reviewed 2026-05-16 09:46 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords transformermean-field limitWasserstein gradient flowattention mechanismperceptronlocalizationcritical pointsparticle system
0
0 comments X

The pith

The perceptron block makes critical points of the mean-field attention energy atomic and localized on the sphere.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the Transformer forward pass as an interacting particle system on the unit sphere, with layers as time evolution and token embeddings as particles. In weight settings where the dynamics reduce to a Wasserstein gradient flow of an explicit energy, the perceptron block is incorporated into the potential. The central result shows that critical points in the mean-field limit are generically atomic measures supported on finite subsets of the sphere. This matters for understanding how attention concentrates rather than diffusing uniformly across infinite context. A reader would care because it supplies a rigorous particle-dynamics picture for why transformers develop sparse, focused representations at scale.

Core claim

Critical points are generically atomic and localized on subsets of the sphere. In the mean-field limit of the interacting particle system representing the Transformer forward pass, inclusion of the perceptron block forces stationary points of the associated energy to be discrete measures concentrated at finitely many points on the unit sphere.

What carries the argument

The Wasserstein gradient flow of an explicit energy functional on probability measures supported on the unit sphere, where the perceptron term supplies the confining potential that drives localization of the particles.

If this is right

  • Stationary states reduce to finite collections of point masses on the sphere.
  • Localization occurs generically once the perceptron nonlinearity is present.
  • The mean-field analysis now applies directly to the full perceptron-plus-attention block.
  • Long-term behavior in the infinite-context limit is governed by dynamics among finitely many representative embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Localized critical points may correspond to attention heads effectively selecting a small number of prototype token representations.
  • Empirical attention maps in trained models could be checked for concentration on discrete supports to test the prediction.
  • The framework suggests studying how these atomic supports evolve or merge across successive layers.

Load-bearing premise

In certain weight configurations the system dynamics can be expressed exactly as the gradient flow of an explicit energy functional on measures.

What would settle it

A numerical or analytic construction of a non-atomic stationary measure for the perceptron-augmented energy in the mean-field limit would falsify the genericity of atomic critical points.

Figures

Figures reproduced from arXiv: 2601.21366 by Antonio \'Alvarez-L\'opez, Borjan Geshkovski, Dom\`enec Ruiz-Balet.

Figure 1
Figure 1. Figure 1: Gradient descent with ReLU perceptron in 𝑑 = 2, starting from the uniform measure, with β = 1. Top: particle histograms for the dynamics at initial, intermediate and final times. Bottom left: final configuration for pure self-attention without a perceptron. Bottom right: the energy Eβ,ϑ with (blue) and without (pink) the perceptron. Background shading shows the perceptron landscape (green: > 0; orange: < 0… view at source ↗
Figure 2
Figure 2. Figure 2: Cluster masses (in blue, the largest being the thickest) at final time across √ β for gradient descent with GeLU perceptron, initialized with 𝑁 = 1000 points of mass 10−3 (see Section 4 for setup). The horizontal and red dashed lines represent the numerical term and the full upper bound in (3.3), respectively. Moreover, if ϑ = (𝜔𝑗 , 𝑎𝑗 )𝑗 satisfies |𝜔1||𝑎1| 2 + |𝜔2||𝑎2| 2 < 0.331, then supp 𝜇 can’t be a si… view at source ↗
Figure 3
Figure 3. Figure 3: Gradient ascent on S 1 with ReLU perceptron. Left: pure self-attention. Middle: self-attention with a ReLU perceptron. Right: measure at final time. Background shading represents the potential landscape (green: positive; orange: negative values). 4Code can be found at https://github.com/antonioalvarezl/2026-MLP-Attention-Energy. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gradient ascent on S 2 with β = 1. Top row: pure self￾attention. Bottom row: self-attention with ReLU perceptron. An anima￾tion is available at https://github.com/antonioalvarezl/2026-MLP-Attention￾Energy/blob/main/examples/USAS2.gif. (a) β = 0.05 (b) β = 5 (c) β = 50 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gradient descent on S 1 with ReLU perceptron. We repeat the same experiment using GeLU [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Histograms at final time for gradient descent with GeLU perceptron. 0 π 2π φ 0 π 2π θ 150 300 450 count 0 π 2π φ 0 π 2π θ 150 300 450 count 0 π 2π φ 0 π 2π θ 150 300 450 count 0 π 2π φ 0 π 2π θ 150 300 450 count [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gradient descent on S 2 with β = 1. Top row: ReLU perceptron. Bottom row: GeLU perceptron. An animation is available at https://github.com/antonioalvarezl/2026- MLP-Attention-Energy/blob/main/examples/USAdS2.gif. trajectories on S 1 for three representative values of β. Scaling of support size We count the number of clusters (atoms) at convergence for every setup considered across the swept values of β. Re… view at source ↗
Figure 8
Figure 8. Figure 8: Trajectories following softmax-normalized attention. Top row: gradient ascent with ReLU perceptron. Middle row: gradient descent with ReLU perceptron. Bottom row: gradient descent with GeLU perceptron. to a single Dirac mass [GLPR25, CLPR25b, GRRB24]. The same caveat applies when adding a perceptron (setup S1): for the fixed ϑ used throughout, the perceptron potential exhibits a single local maximum (cf [… view at source ↗
Figure 9
Figure 9. Figure 9: Number of clusters at convergence across setups. S1/S1-0: Ascent with/without ReLU perceptron. S2: Descent with ReLU perceptron. S3: Descent with GeLU perceptron [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Superposition of stationary measures from 10 independent uniform initializa￾tions with β = 10 and σ = GeLU. Left: pure self-attention descent yields seed-dependent cluster locations. Middle: the perceptron drift in ascent breaks the symmetry and se￾lects seed-independent clusters. Right: in descent, the perceptron still anchors a seed￾independent stationary support, which is less concentrated and may incl… view at source ↗
Figure 11
Figure 11. Figure 11: Number of clusters at convergence across dimensions 𝑑 ∈ {4, 5, 7, 10, 20, 50} for gradient descent using unnormalized self-attention with a ReLU perceptron. 0 1 2 3 √ 4 5 6 7 β 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 √ 4 5 6 7 β 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 √ 4 5 6 7 β 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 √ 4 5 6 7 β 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 √ 4 5 6 7 β 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 √ 4 5 6 7 β 0.0 0.2 0.4… view at source ↗
Figure 12
Figure 12. Figure 12: Cluster masses (in blue, the largest being the thickest) at final time across √ β. The horizontal and red dashed lines represent the numerical term and the full upper bound in (3.3), respectively. Top row: Gradient descent with ReLU (𝑑 = 4, 5, 7). Bottom row: Same setup for higher dimensions (𝑑 = 10, 20, 50). (See Section 4 for setup). 5 Proofs 5.1 Preliminaries We denote the spherical harmonics in 𝐿 2 (σ… view at source ↗
read the original abstract

The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper models the Transformer forward pass as an interacting particle system on the unit sphere, with layers corresponding to time, token embeddings to particles, and layer normalization idealized by the sphere. It claims that in some weight settings this evolution is a gradient flow of an explicit energy, permitting analysis via Wasserstein gradient flows in the mean-field limit of infinite context length. The central result is that the perceptron block produces critical points that are generically atomic and localized on subsets of the sphere.

Significance. If the gradient-flow representation and mean-field limit are rigorously justified, the work supplies a mathematical framework linking Transformer dynamics to optimal transport, potentially explaining localization and atomicity in attention landscapes. This could inform theoretical analyses of deep networks beyond empirical observation, though the significance hinges on validating the energy functional for the perceptron block.

major comments (2)
  1. [Abstract] Abstract: the claim that critical points are generically atomic and localized relies on treating the perceptron block as part of a Wasserstein gradient flow of an explicit energy, but the abstract only asserts this holds 'in some weight settings' without deriving the energy or showing that the nonlinear perceptron update preserves the variational structure; this assumption is load-bearing for the mean-field analysis and must be made explicit.
  2. [Main result (perceptron block analysis)] The genericity statement for atomic critical points requires a precise characterization (e.g., via the second variation of the energy or stability analysis of non-atomic measures); without an equation or theorem establishing that non-atomic measures have strictly higher energy or are unstable under the flow, the localization conclusion remains formal rather than proven.
minor comments (1)
  1. [Abstract] Clarify the precise form of the energy functional and the weight settings under which the gradient-flow property holds, including any restrictions on the perceptron weights or activation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the variational structure and genericity claims. We address each major point below and will incorporate clarifications and additional details into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that critical points are generically atomic and localized relies on treating the perceptron block as part of a Wasserstein gradient flow of an explicit energy, but the abstract only asserts this holds 'in some weight settings' without deriving the energy or showing that the nonlinear perceptron update preserves the variational structure; this assumption is load-bearing for the mean-field analysis and must be made explicit.

    Authors: We agree the abstract should be more precise. Section 2 of the manuscript derives the explicit energy functional for the perceptron block under the indicated weight settings and verifies by direct computation that the nonlinear update preserves the gradient-flow structure (the variation matches the Wasserstein gradient of the energy). We will revise the abstract to read 'in some weight settings where the perceptron block preserves the variational structure of an explicit energy' and add a short remark after the abstract summarizing the preservation argument. revision: yes

  2. Referee: [Main result (perceptron block analysis)] The genericity statement for atomic critical points requires a precise characterization (e.g., via the second variation of the energy or stability analysis of non-atomic measures); without an equation or theorem establishing that non-atomic measures have strictly higher energy or are unstable under the flow, the localization conclusion remains formal rather than proven.

    Authors: Theorem 4.1 already characterizes critical points via the first variation of the energy and proves atomicity for generic weights by showing that the second variation is positive definite precisely on atomic measures supported on subsets of the sphere. Non-atomic measures are shown to have strictly higher energy through a strict-convexity argument in the Wasserstein metric. We will add an explicit formula for the second variation (Equation (4.3) in the revision) and a stability lemma establishing that non-atomic measures are unstable equilibria under the flow, thereby making the genericity statement fully rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the mean-field localization claim

full rationale

The paper models the Transformer forward pass as an interacting particle system on the unit sphere (with time as layers and particles as token embeddings), and states that in some weight settings this is a gradient flow of an explicit energy, enabling Wasserstein mean-field analysis. The claim that critical points are generically atomic and localized follows from this structure and the infinite-context limit, rather than reducing tautologically to the inputs, fitted parameters, or self-citations. No load-bearing step equates a prediction to its own definition or renames a known result; the derivation remains independent under the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on modeling the forward pass as a Wasserstein gradient flow on the sphere and treating the perceptron as part of an explicit energy; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption The Transformer forward pass can be modeled as an interacting particle system on the unit sphere with time as layers.
    Stated in the abstract as the foundational view enabling the mean-field limit.
  • domain assumption In some weight settings the system is a gradient flow for an explicit energy.
    Required to apply Wasserstein gradient flow techniques to study critical points.

pith-pipeline@v0.9.0 · 5398 in / 1160 out tokens · 24837 ms · 2026-05-16T09:46:43.424606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Kinetic theory for Transformers and the lost-in-the-middle phenomenon

    math.AP 2026-05 conditional novelty 8.0

    A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.

  2. The physics of AI weather models

    physics.ao-ph 2026-05 unverdicted novelty 7.0

    AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.

  3. Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

    math.PR 2026-04 unverdicted novelty 7.0

    Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...

  4. Propagation of Chaos in Contextual Flow Maps

    cs.LG 2026-05 unverdicted novelty 6.0

    Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.

  5. Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

    math.AP 2026-05 unverdicted novelty 6.0

    In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 5 Pith papers · 4 internal anchors

  1. [1]

    [ÁLGRB26] Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet

    [AGRB25] Albert Alcalde, Borjan Geshkovski, and Domènec Ruiz-Balet. Atten- tion’s forward pass and Frank-Wolfe.arXiv preprint arXiv:2508.09628,

  2. [2]

    Why do llms attend to the first token?arXiv preprint arXiv:2504.02732, 2025

    [BAG+25] Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

  3. [3]

    Emer- gence of meta-stable clustering in mean-field transformer models

    [BPA25a] Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. Emer- gence of meta-stable clustering in mean-field transformer models. In International Conference on Learning Representations (ICLR 2025),

  4. [4]

    36 [BPA25b] Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi

    Oral presentation. 36 [BPA25b] Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multi- scale analysis of mean-field transformers in the moderate interaction regime. InAdvances in Neural Information Processing Systems (NeurIPS 2025),

  5. [5]

    A phase transition between positional and semantic learning in a solvable model of dot-product attention.Journal of Statistical Mechanics: Theory and Experiment, 2025(7):074001,

    [CBKZ25] Hugo Cui, Freya Behrens, Florent Krzakala, and Lenka Zdeborová. A phase transition between positional and semantic learning in a solvable model of dot-product attention.Journal of Statistical Mechanics: Theory and Experiment, 2025(7):074001,

  6. [6]

    Critical attention scaling in long-context transformers.arXiv preprint arXiv:2510.05554,

    [CLPR25a] Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Critical attention scaling in long-context transformers.arXiv preprint arXiv:2510.05554,

  7. [7]

    Two failure modes of deep transformers and how to avoid them: a unified theory of signal propa- gation at initialisation.arXiv preprint arXiv:2505.24333,

    [GG25] Alessio Giorlandino and Sebastian Goldt. Two failure modes of deep transformers and how to avoid them: a unified theory of signal propa- gation at initialisation.arXiv preprint arXiv:2505.24333,

  8. [8]

    Gerber, R

    [GGH+25] Nicolai Gerber, Rishabh Gvalani, Martin Hairer, Greg Pavliotis, and André Schlichting. Formation of clusters and coarsening in weakly interacting diffusions.arXiv preprint arXiv:2510.17629,

  9. [9]

    Geshkovski, P

    [GRRB24] Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure interpolation using transformers.arXiv preprint arXiv:2411.04551,

  10. [10]

    On the num- ber of modes of Gaussian kernel density estimators.arXiv preprint arXiv:2412.09080,

    [GRS24] Borjan Geshkovski, Philippe Rigollet, and Yihang Sun. On the num- ber of modes of Gaussian kernel density estimators.arXiv preprint arXiv:2412.09080,

  11. [11]

    Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics

    [KBH24] Hugo Koubbi, Matthieu Boussard, and Louis Hernandez. The impact of lora on the emergence of clusters in transformers.arXiv preprint arXiv:2402.15415,

  12. [12]

    Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

    [KM25] Marko Karbevski and Antonij Mijoski. Key and value weights are probably all you need: On the necessity of the query, key, value weight triplet in decoder-only transformers.arXiv preprint arXiv:2510.23912,

  13. [13]

    Understanding and improving Transformer from a multi-particle dynamic system point of view

    [LLH+20] Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu. Understanding and improving Transformer from a multi-particle dynamic system point of view. InICLR 2020 Workshop: ODE/PDE + DL,

  14. [14]

    2 OLMo 2 Furious

    [OWS+25] Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 OLMo 2 Furious.arXiv preprint arXiv:2501.00656,

  15. [15]

    Nonlinear diffusion limit of non- local interactions on a sphere.arXiv preprint arXiv:2512.03185,

    [PS25] Mark A Peletier and Anna Shalova. Nonlinear diffusion limit of non- local interactions on a sphere.arXiv preprint arXiv:2512.03185,

  16. [16]

    Solutions of stationary McKean- Vlasov equation on a high-dimensional sphere and other Riemannian manifolds.arXiv preprint arXiv:2412.14813,

    [SS24] Anna Shalova and André Schlichting. Solutions of stationary McKean- Vlasov equation on a high-dimensional sphere and other Riemannian manifolds.arXiv preprint arXiv:2412.14813,

  17. [17]

    Dissecting the interplay of attention paths in a statistical me- chanics theory of transformers

    [TMIS24] Lorenzo Tiberi, Francesca Mignacco, Kazuki Irie, and Haim Sompolin- sky. Dissecting the interplay of attention paths in a statistical me- chanics theory of transformers. InAdvances in Neural Information Processing Systems (NeurIPS 2024),