pith. sign in

arxiv: 2606.05140 · v1 · pith:CWDQWXVUnew · submitted 2026-06-03 · 🧮 math.AP · math-ph· math.MP· math.PR· stat.ML

Phase transitions for the noisy transformer model in arbitrary dimension

Pith reviewed 2026-06-28 04:58 UTC · model grok-4.3

classification 🧮 math.AP math-phmath.MPmath.PRstat.ML
keywords McKean-Vlasov free energyphase transitionself-attention modelglobal minimizermodified Bessel functionBeckner-Onofri inequalityspherenoisy transformer
0
0 comments X

The pith

A unique noise threshold β_*^{(d)} defined by a Bessel function ratio separates continuous from discontinuous phase transitions in the USA model on the sphere in every dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a sharp global-minimizer dichotomy for the McKean-Vlasov free energy of the unnormalized self-attention model on the unit sphere. There exists a unique β_*^{(d)} > 0 satisfying I_{d/2+1}(β_*^{(d)})/I_{d/2}(β_*^{(d)}) = 1/d. Below this noise level the uniform density stays the unique global minimizer up to the explicit linear-stability threshold K_#^{(d)}(β) and the transition is continuous; above it the uniform density already fails to be globally minimizing at that threshold so the true critical coupling is strictly smaller and the transition is discontinuous. The proof combines the sharp Beckner-Onofri inequality with Funk-Hecke expansions and a quartic obstruction. A reader would care because the character of the transition governs whether attention patterns emerge gradually or jump as coupling strength grows.

Core claim

There is a unique β_*^{(d)}>0 such that I_{d/2+1}(β_*^{(d)})/I_{d/2}(β_*^{(d)})=1/d. For 0<β≤β_*^{(d)}, the uniform density remains the unique global minimizer up to the linear-stability threshold K_#^{(d)}(β)=β^{d/2}/(2^{d/2}Γ(d/2)I_{d/2}(β)), and the phase transition is continuous. For β>β_*^{(d)}, the uniform density is not globally minimizing at K_#^{(d)}(β), so the critical coupling satisfies K_c < K_#^{(d)}(β) and the transition is discontinuous. This holds in every dimension d≥2.

What carries the argument

The critical inverse temperature β_*^{(d)} defined by the modified-Bessel ratio I_{d/2+1}(β_*)/I_{d/2}(β_*)=1/d, which determines via the Beckner-Onofri inequality and the degree-two quartic obstruction whether the uniform density on the sphere remains globally minimizing at the linear-stability threshold.

If this is right

  • For β ≤ β_*^{(d)} the phase transition occurs continuously at K_c = K_#^{(d)}(β).
  • For β > β_*^{(d)} the phase transition occurs discontinuously at some K_c < K_#^{(d)}(β).
  • The uniform density is the unique global minimizer up to linear instability precisely when β ≤ β_*^{(d)}.
  • The dichotomy and the explicit formulas for β_*^{(d)} and K_#^{(d)}(β) hold in every dimension d ≥ 2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit Bessel ratio lets one compute the transition type for any concrete dimension without running dynamics.
  • In high dimensions the discontinuous regime may occupy most of the relevant noise range.
  • Similar global-minimizer switches may occur in other mean-field models on spheres that admit Beckner-Onofri inequalities.
  • Direct minimization of the free energy at K = K_# for β slightly above β_* would provide an independent check of the energy drop.

Load-bearing premise

The sharp Beckner-Onofri logarithmic HLS inequality on the sphere together with the Funk-Hecke Bessel coefficient computation and the degree-two quartic obstruction suffice to establish the global-minimizer dichotomy without higher-order stability analysis or further assumptions on the measure.

What would settle it

Numerical evaluation showing that for any β larger than β_*^{(d)} the free energy of a suitable non-uniform test measure on the sphere is strictly lower than the uniform value exactly at coupling strength K=K_#^{(d)}(β).

read the original abstract

We study the McKean--Vlasov free energy on the unit sphere associated with the unnormalized self-attention (USA) model for noisy transformer dynamics. We prove a sharp global-minimizer dichotomy in every dimension $d\ge2$. There is a unique $\beta_*^{(d)}>0$ such that \begin{equation*} \frac{I_{d/2+1}(\beta_*^{(d)})}{I_{d/2}(\beta_*^{(d)})}=\frac1d, \end{equation*} where $I_\nu$ is the modified Bessel function of the first kind. For $0<\beta\le \beta_*^{(d)}$, the uniform density remains the unique global minimizer up to the linear-stability threshold \begin{equation*} K_\#^{(d)}(\beta)=\frac{\beta^{d/2}}{2^{d/2}\Gamma(d/2)I_{d/2}(\beta)}, \end{equation*} and the phase transition is continuous. For $\beta>\beta_*^{(d)}$, the uniform density is not globally minimizing at $K_\#^{(d)}(\beta)$, so the critical coupling satisfies $K_c<K_\#^{(d)}(\beta)$ and the transition is discontinuous. This result generalizes the authors' recent $d=2$ work arXiv:2604.16288 to arbitrary dimension. The proof uses the sharp Beckner--Onofri/logarithmic Hardy-Littlewood-Sobolev (HLS) inequality on the sphere, together with a Funk--Hecke/Bessel coefficient computation and a degree-two quartic obstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proves a sharp global-minimizer dichotomy for the McKean-Vlasov free energy of the unnormalized self-attention (USA) model on the unit sphere S^{d-1} in every dimension d≥2. It identifies the unique β_*^{(d)}>0 solving I_{d/2+1}(β_*)/I_{d/2}(β_*)=1/d and shows that for 0<β≤β_* the uniform density is the unique global minimizer up to the linear-stability threshold K_#^{(d)}(β)=β^{d/2}/(2^{d/2}Γ(d/2)I_{d/2}(β)), yielding a continuous phase transition, while for β>β_* the uniform density is not globally minimizing at K_#, so the critical coupling satisfies K_c<K_# and the transition is discontinuous. The argument adapts the sharp Beckner-Onofri/logarithmic HLS inequality via the Funk-Hecke formula (producing explicit modified-Bessel eigenvalues) together with the sign of the degree-2 quartic Landau coefficient.

Significance. If the result holds, the manuscript supplies a complete, parameter-free characterization of the phase transition for the noisy transformer model in arbitrary dimension, directly generalizing the authors' d=2 analysis. The proof strategy—combining the sharp sphere HLS inequality with explicit Bessel-coefficient computations and a quartic obstruction—avoids higher-mode stability analysis in the continuous regime and furnishes an explicit descent direction in the discontinuous regime; this is a substantive technical advance.

minor comments (2)
  1. [Abstract] Abstract, displayed equation for K_#^{(d)}(β): the linear-stability threshold is stated without a one-sentence reminder that it arises from the vanishing of the l=1 or l=2 eigenvalue of the linearized operator; a parenthetical reference to the relevant Funk-Hecke mode would improve readability.
  2. [Abstract] The proof-strategy paragraph in the abstract invokes the 'degree-two quartic obstruction' without naming the spherical-harmonic degree; a brief clause identifying it as the l=2 mode would align the abstract with the later technical development.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the significance of the result, and recommendation to accept the manuscript.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation defines β_*^{(d)} via the standard Bessel ratio equation I_{d/2+1}(β)/I_{d/2}(β)=1/d, which is an independent mathematical fact about modified Bessel functions rather than a fitted or self-referential quantity. The global-minimizer dichotomy is established by adapting the external sharp Beckner-Onofri/logarithmic HLS inequality on the sphere (via Funk-Hecke formula yielding explicit eigenvalues) and computing the sign of the degree-2 quartic coefficient, which changes at the same β_* by direct expansion. The sole self-citation is to the authors' d=2 work for generalization context; the general-d proof is self-contained against the cited external inequalities and does not reduce any central claim to a self-citation chain, definition, or fitted input. No load-bearing step matches the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on standard results in analysis and special functions from prior literature; no free parameters are fitted to data and no new entities are postulated.

axioms (3)
  • standard math The sharp Beckner-Onofri/logarithmic Hardy-Littlewood-Sobolev inequality holds on the sphere
    Invoked to bound the free energy and establish global minimality of the uniform measure.
  • standard math The Funk-Hecke formula computes the spherical harmonic coefficients of the interaction kernel
    Used to obtain the explicit Bessel-function expressions for the linear stability threshold.
  • standard math Modified Bessel functions of the first kind satisfy the stated ratio equation at the critical β_*
    Defines the dimension-dependent threshold separating the two regimes.

pith-pipeline@v0.9.1-grok · 5831 in / 1880 out tokens · 47091 ms · 2026-06-28T04:58:25.946907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 13 canonical work pages

  1. [1]

    and Bonilla, L

    Juan A. Acebr \'o n, L. L. Bonilla, Conrad J. P \'e rez Vicente, F \'e lix Ritort, and Renato Spigler. The kuramoto model: A simple paradigm for synchronization phenomena. Reviews of Modern Physics , 77(1):137--185, 2005. doi:10.1103/RevModPhys.77.137

  2. [2]

    Spherical Harmonics and Approximations on the Unit Sphere: An Introduction , volume 2044 of Lecture Notes in Mathematics

    Kendall Atkinson and Weimin Han. Spherical Harmonics and Approximations on the Unit Sphere: An Introduction , volume 2044 of Lecture Notes in Mathematics . Springer, 2012

  3. [3]

    On the structure of stationary solutions to McKean--Vlasov equations with applications to noisy transformers

    Krishnakumar Balasubramanian, Sayan Banerjee, and Philippe Rigollet. On the structure of stationary solutions to McKean--Vlasov equations with applications to noisy transformers. arXiv preprint , 2025. arXiv:2510.20094

  4. [4]

    Sharp sobolev inequalities on the sphere and the Moser--Trudinger inequality

    William Beckner. Sharp sobolev inequalities on the sphere and the Moser--Trudinger inequality. Annals of Mathematics , 138(1):213--242, 1993. doi:10.2307/2946638

  5. [5]

    Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

    Martin Burger, Samira Kabri, Yury Korolev, Tim Roith, and Lukas Weigand. Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , 383(2298):20240233, 2025. doi:10.1098/rsta.2024.0233

  6. [6]

    Carrillo, Rishabh S

    Jos \'e A. Carrillo, Rishabh S. Gvalani, Grigorios A. Pavliotis, and Andr \'e Schlichting. Long-time behaviour and phase transitions for the McKean--Vlasov equation on a torus. Archive for Rational Mechanics and Analysis , 235(1):635--690, 2020. doi:10.1007/s00205-019-01430-4

  7. [7]

    Carlen and Michael Loss

    Eric A. Carlen and Michael Loss. Competing symmetries, the logarithmic HLS inequality and Onofri 's inequality on S^n . Geometric and Functional Analysis , 2(1):90--104, 1992. doi:10.1007/BF01895706

  8. [8]

    The McKean--Vlasov equation in finite volume

    Lincoln Chayes and Vladislav Panferov. The McKean--Vlasov equation in finite volume. Journal of Statistical Physics , 138(1--3):351--380, 2010. doi:10.1007/s10955-009-9913-z

  9. [9]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems , 31, 2018

  10. [10]

    Redesigning the Transformer architecture with insights from multi-particle dynamical systems

    Subhabrata Dutta, Tanya Gautam, Soumen Chakrabarti, and Tanmoy Chakraborty. Redesigning the Transformer architecture with insights from multi-particle dynamical systems. Advances in Neural Information Processing Systems , 34:5531--5544, 2021

  11. [11]

    https://dlmf.nist.gov/, Release 1.2.6 of 2026-03-15

    NIST Digital Library of Mathematical Functions . https://dlmf.nist.gov/, Release 1.2.6 of 2026-03-15. F. W. J. Olver, A. B. Olde Daalhuis , D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. McClain, eds

  12. [12]

    M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, I . Communications on Pure and Applied Mathematics , 28(1):1--47, 1975. doi:10.1002/cpa.3160280102

  13. [13]

    A proposal on machine learning via dynamical systems

    Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics , 5(1):1--11, 2017. doi:10.1007/s40304-017-0103-z

  14. [14]

    The emergence of clusters in self-attention dynamics

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. In Advances in Neural Information Processing Systems 36 , pages 57026--57037, 2023. arXiv:2305.05465

  15. [15]

    A mathematical perspective on transformers

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers. Bulletin of the American Mathematical Society , 62(3):427--479, 2025. doi:10.1090/bull/1863

  16. [16]

    Stable architectures for deep neural networks

    Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems , 34(1):014004, 2017. doi:10.1088/1361-6420/aa9a90

  17. [17]

    Self-entrainment of a population of coupled non-linear oscillators

    Yoshiki Kuramoto. Self-entrainment of a population of coupled non-linear oscillators. In International Symposium on Mathematical Problems in Theoretical Physics (Kyoto Univ., Kyoto, 1975) , volume 39 of Lecture Notes in Physics , pages 420--422. Springer, Berlin-New York, 1975

  18. [18]

    Understanding and improving Transformer from a multi-particle dynamic system point of view

    Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu. Understanding and improving Transformer from a multi-particle dynamic system point of view. In International Conference on Learning Representations , 2020

  19. [19]

    N. A. Lebedev and I. M. Milin. An inequality. Vestnik Leningrad. Univ. , 20(19):157--158, 1965

  20. [20]

    Phase transitions and linear stability for the mean-field Kuramoto--Daido model

    Kyunghoo Mun and Matthew Rosenzweig. Phase transitions and linear stability for the mean-field Kuramoto--Daido model. arXiv preprint , 2026. arXiv:2602.14954

  21. [21]

    Phase transitions in Doi--Onsager , noisy transformer, and other multimodal models

    Kyunghoo Mun and Matthew Rosenzweig. Phase transitions in Doi--Onsager , noisy transformer, and other multimodal models. arXiv preprint , 2026. arXiv:2604.16288

  22. [22]

    The mean-field dynamics of transformers

    Philippe Rigollet. The mean-field dynamics of transformers. arXiv preprint , 2025. arXiv:2512.01868

  23. [23]

    Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr \'e

    Michael E. Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr \'e . Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics , pages 3515--3530. PMLR, 2022

  24. [24]

    Monotonicity properties for ratios and products of modified Bessel functions and sharp trigonometric bounds

    Javier Segura. Monotonicity properties for ratios and products of modified Bessel functions and sharp trigonometric bounds. Results in Mathematics , 76:221, 2021. doi:10.1007/s00025-021-01531-1

  25. [25]

    Solutions of stationary McKean--Vlasov equation on a high-dimensional sphere and other Riemannian manifolds

    Anna Shalova and Andr \'e Schlichting. Solutions of stationary McKean--Vlasov equation on a high-dimensional sphere and other Riemannian manifolds. Advances in Nonlinear Analysis , 15(1):20250141, 2026. doi:10.1515/anona-2025-0141

  26. [26]

    Strogatz

    Steven H. Strogatz. From kuramoto to crawford: Exploring the onset of synchronization in populations of coupled oscillators. Physica D: Nonlinear Phenomena , 143(1--4):1--20, 2000. doi:10.1016/S0167-2789(00)00094-4

  27. [27]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 , pages 5998--6008, 2017. arXiv:1706.03762