pith. sign in

Perceptrons and localization of attention's mean-field landscape

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it
abstract

The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.

citation-role summary

background 2

citation-polarity summary

years

2026 5

roles

background 2

polarities

background 2

representative citing papers

The physics of AI weather models

physics.ao-ph · 2026-05-22 · unverdicted · novelty 7.0

AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.

Propagation of Chaos in Contextual Flow Maps

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.

citing papers explorer

Showing 5 of 5 citing papers.

  • Kinetic theory for Transformers and the lost-in-the-middle phenomenon math.AP · 2026-05-09 · conditional · none · ref 1 · internal anchor

    A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.

  • The physics of AI weather models physics.ao-ph · 2026-05-22 · unverdicted · none · ref 49 · internal anchor

    AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.

  • Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models math.PR · 2026-04-29 · unverdicted · none · ref 2 · internal anchor

    Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

  • Propagation of Chaos in Contextual Flow Maps cs.LG · 2026-05-16 · unverdicted · none · ref 2 · internal anchor

    Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.

  • Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime math.AP · 2026-05-11 · unverdicted · none · ref 5 · internal anchor

    In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).