Phase transitions for the noisy transformer model in arbitrary dimension
Pith reviewed 2026-06-28 04:58 UTC · model grok-4.3
The pith
A unique noise threshold β_*^{(d)} defined by a Bessel function ratio separates continuous from discontinuous phase transitions in the USA model on the sphere in every dimension.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
There is a unique β_*^{(d)}>0 such that I_{d/2+1}(β_*^{(d)})/I_{d/2}(β_*^{(d)})=1/d. For 0<β≤β_*^{(d)}, the uniform density remains the unique global minimizer up to the linear-stability threshold K_#^{(d)}(β)=β^{d/2}/(2^{d/2}Γ(d/2)I_{d/2}(β)), and the phase transition is continuous. For β>β_*^{(d)}, the uniform density is not globally minimizing at K_#^{(d)}(β), so the critical coupling satisfies K_c < K_#^{(d)}(β) and the transition is discontinuous. This holds in every dimension d≥2.
What carries the argument
The critical inverse temperature β_*^{(d)} defined by the modified-Bessel ratio I_{d/2+1}(β_*)/I_{d/2}(β_*)=1/d, which determines via the Beckner-Onofri inequality and the degree-two quartic obstruction whether the uniform density on the sphere remains globally minimizing at the linear-stability threshold.
If this is right
- For β ≤ β_*^{(d)} the phase transition occurs continuously at K_c = K_#^{(d)}(β).
- For β > β_*^{(d)} the phase transition occurs discontinuously at some K_c < K_#^{(d)}(β).
- The uniform density is the unique global minimizer up to linear instability precisely when β ≤ β_*^{(d)}.
- The dichotomy and the explicit formulas for β_*^{(d)} and K_#^{(d)}(β) hold in every dimension d ≥ 2.
Where Pith is reading between the lines
- The explicit Bessel ratio lets one compute the transition type for any concrete dimension without running dynamics.
- In high dimensions the discontinuous regime may occupy most of the relevant noise range.
- Similar global-minimizer switches may occur in other mean-field models on spheres that admit Beckner-Onofri inequalities.
- Direct minimization of the free energy at K = K_# for β slightly above β_* would provide an independent check of the energy drop.
Load-bearing premise
The sharp Beckner-Onofri logarithmic HLS inequality on the sphere together with the Funk-Hecke Bessel coefficient computation and the degree-two quartic obstruction suffice to establish the global-minimizer dichotomy without higher-order stability analysis or further assumptions on the measure.
What would settle it
Numerical evaluation showing that for any β larger than β_*^{(d)} the free energy of a suitable non-uniform test measure on the sphere is strictly lower than the uniform value exactly at coupling strength K=K_#^{(d)}(β).
read the original abstract
We study the McKean--Vlasov free energy on the unit sphere associated with the unnormalized self-attention (USA) model for noisy transformer dynamics. We prove a sharp global-minimizer dichotomy in every dimension $d\ge2$. There is a unique $\beta_*^{(d)}>0$ such that \begin{equation*} \frac{I_{d/2+1}(\beta_*^{(d)})}{I_{d/2}(\beta_*^{(d)})}=\frac1d, \end{equation*} where $I_\nu$ is the modified Bessel function of the first kind. For $0<\beta\le \beta_*^{(d)}$, the uniform density remains the unique global minimizer up to the linear-stability threshold \begin{equation*} K_\#^{(d)}(\beta)=\frac{\beta^{d/2}}{2^{d/2}\Gamma(d/2)I_{d/2}(\beta)}, \end{equation*} and the phase transition is continuous. For $\beta>\beta_*^{(d)}$, the uniform density is not globally minimizing at $K_\#^{(d)}(\beta)$, so the critical coupling satisfies $K_c<K_\#^{(d)}(\beta)$ and the transition is discontinuous. This result generalizes the authors' recent $d=2$ work arXiv:2604.16288 to arbitrary dimension. The proof uses the sharp Beckner--Onofri/logarithmic Hardy-Littlewood-Sobolev (HLS) inequality on the sphere, together with a Funk--Hecke/Bessel coefficient computation and a degree-two quartic obstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves a sharp global-minimizer dichotomy for the McKean-Vlasov free energy of the unnormalized self-attention (USA) model on the unit sphere S^{d-1} in every dimension d≥2. It identifies the unique β_*^{(d)}>0 solving I_{d/2+1}(β_*)/I_{d/2}(β_*)=1/d and shows that for 0<β≤β_* the uniform density is the unique global minimizer up to the linear-stability threshold K_#^{(d)}(β)=β^{d/2}/(2^{d/2}Γ(d/2)I_{d/2}(β)), yielding a continuous phase transition, while for β>β_* the uniform density is not globally minimizing at K_#, so the critical coupling satisfies K_c<K_# and the transition is discontinuous. The argument adapts the sharp Beckner-Onofri/logarithmic HLS inequality via the Funk-Hecke formula (producing explicit modified-Bessel eigenvalues) together with the sign of the degree-2 quartic Landau coefficient.
Significance. If the result holds, the manuscript supplies a complete, parameter-free characterization of the phase transition for the noisy transformer model in arbitrary dimension, directly generalizing the authors' d=2 analysis. The proof strategy—combining the sharp sphere HLS inequality with explicit Bessel-coefficient computations and a quartic obstruction—avoids higher-mode stability analysis in the continuous regime and furnishes an explicit descent direction in the discontinuous regime; this is a substantive technical advance.
minor comments (2)
- [Abstract] Abstract, displayed equation for K_#^{(d)}(β): the linear-stability threshold is stated without a one-sentence reminder that it arises from the vanishing of the l=1 or l=2 eigenvalue of the linearized operator; a parenthetical reference to the relevant Funk-Hecke mode would improve readability.
- [Abstract] The proof-strategy paragraph in the abstract invokes the 'degree-two quartic obstruction' without naming the spherical-harmonic degree; a brief clause identifying it as the l=2 mode would align the abstract with the later technical development.
Simulated Author's Rebuttal
We thank the referee for their careful reading, positive assessment of the significance of the result, and recommendation to accept the manuscript.
Circularity Check
No significant circularity detected
full rationale
The derivation defines β_*^{(d)} via the standard Bessel ratio equation I_{d/2+1}(β)/I_{d/2}(β)=1/d, which is an independent mathematical fact about modified Bessel functions rather than a fitted or self-referential quantity. The global-minimizer dichotomy is established by adapting the external sharp Beckner-Onofri/logarithmic HLS inequality on the sphere (via Funk-Hecke formula yielding explicit eigenvalues) and computing the sign of the degree-2 quartic coefficient, which changes at the same β_* by direct expansion. The sole self-citation is to the authors' d=2 work for generalization context; the general-d proof is self-contained against the cited external inequalities and does not reduce any central claim to a self-citation chain, definition, or fitted input. No load-bearing step matches the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (3)
- standard math The sharp Beckner-Onofri/logarithmic Hardy-Littlewood-Sobolev inequality holds on the sphere
- standard math The Funk-Hecke formula computes the spherical harmonic coefficients of the interaction kernel
- standard math Modified Bessel functions of the first kind satisfy the stated ratio equation at the critical β_*
Reference graph
Works this paper leans on
-
[1]
Juan A. Acebr \'o n, L. L. Bonilla, Conrad J. P \'e rez Vicente, F \'e lix Ritort, and Renato Spigler. The kuramoto model: A simple paradigm for synchronization phenomena. Reviews of Modern Physics , 77(1):137--185, 2005. doi:10.1103/RevModPhys.77.137
-
[2]
Spherical Harmonics and Approximations on the Unit Sphere: An Introduction , volume 2044 of Lecture Notes in Mathematics
Kendall Atkinson and Weimin Han. Spherical Harmonics and Approximations on the Unit Sphere: An Introduction , volume 2044 of Lecture Notes in Mathematics . Springer, 2012
2044
-
[3]
Krishnakumar Balasubramanian, Sayan Banerjee, and Philippe Rigollet. On the structure of stationary solutions to McKean--Vlasov equations with applications to noisy transformers. arXiv preprint , 2025. arXiv:2510.20094
arXiv 2025
-
[4]
Sharp sobolev inequalities on the sphere and the Moser--Trudinger inequality
William Beckner. Sharp sobolev inequalities on the sphere and the Moser--Trudinger inequality. Annals of Mathematics , 138(1):213--242, 1993. doi:10.2307/2946638
-
[5]
Martin Burger, Samira Kabri, Yury Korolev, Tim Roith, and Lukas Weigand. Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , 383(2298):20240233, 2025. doi:10.1098/rsta.2024.0233
-
[6]
Jos \'e A. Carrillo, Rishabh S. Gvalani, Grigorios A. Pavliotis, and Andr \'e Schlichting. Long-time behaviour and phase transitions for the McKean--Vlasov equation on a torus. Archive for Rational Mechanics and Analysis , 235(1):635--690, 2020. doi:10.1007/s00205-019-01430-4
-
[7]
Eric A. Carlen and Michael Loss. Competing symmetries, the logarithmic HLS inequality and Onofri 's inequality on S^n . Geometric and Functional Analysis , 2(1):90--104, 1992. doi:10.1007/BF01895706
-
[8]
The McKean--Vlasov equation in finite volume
Lincoln Chayes and Vladislav Panferov. The McKean--Vlasov equation in finite volume. Journal of Statistical Physics , 138(1--3):351--380, 2010. doi:10.1007/s10955-009-9913-z
-
[9]
Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. Advances in Neural Information Processing Systems , 31, 2018
2018
-
[10]
Redesigning the Transformer architecture with insights from multi-particle dynamical systems
Subhabrata Dutta, Tanya Gautam, Soumen Chakrabarti, and Tanmoy Chakraborty. Redesigning the Transformer architecture with insights from multi-particle dynamical systems. Advances in Neural Information Processing Systems , 34:5531--5544, 2021
2021
-
[11]
https://dlmf.nist.gov/, Release 1.2.6 of 2026-03-15
NIST Digital Library of Mathematical Functions . https://dlmf.nist.gov/, Release 1.2.6 of 2026-03-15. F. W. J. Olver, A. B. Olde Daalhuis , D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. McClain, eds
2026
-
[12]
M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time, I . Communications on Pure and Applied Mathematics , 28(1):1--47, 1975. doi:10.1002/cpa.3160280102
-
[13]
A proposal on machine learning via dynamical systems
Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics , 5(1):1--11, 2017. doi:10.1007/s40304-017-0103-z
-
[14]
The emergence of clusters in self-attention dynamics
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. In Advances in Neural Information Processing Systems 36 , pages 57026--57037, 2023. arXiv:2305.05465
arXiv 2023
-
[15]
A mathematical perspective on transformers
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers. Bulletin of the American Mathematical Society , 62(3):427--479, 2025. doi:10.1090/bull/1863
-
[16]
Stable architectures for deep neural networks
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems , 34(1):014004, 2017. doi:10.1088/1361-6420/aa9a90
-
[17]
Self-entrainment of a population of coupled non-linear oscillators
Yoshiki Kuramoto. Self-entrainment of a population of coupled non-linear oscillators. In International Symposium on Mathematical Problems in Theoretical Physics (Kyoto Univ., Kyoto, 1975) , volume 39 of Lecture Notes in Physics , pages 420--422. Springer, Berlin-New York, 1975
1975
-
[18]
Understanding and improving Transformer from a multi-particle dynamic system point of view
Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu. Understanding and improving Transformer from a multi-particle dynamic system point of view. In International Conference on Learning Representations , 2020
2020
-
[19]
N. A. Lebedev and I. M. Milin. An inequality. Vestnik Leningrad. Univ. , 20(19):157--158, 1965
1965
-
[20]
Phase transitions and linear stability for the mean-field Kuramoto--Daido model
Kyunghoo Mun and Matthew Rosenzweig. Phase transitions and linear stability for the mean-field Kuramoto--Daido model. arXiv preprint , 2026. arXiv:2602.14954
arXiv 2026
-
[21]
Phase transitions in Doi--Onsager , noisy transformer, and other multimodal models
Kyunghoo Mun and Matthew Rosenzweig. Phase transitions in Doi--Onsager , noisy transformer, and other multimodal models. arXiv preprint , 2026. arXiv:2604.16288
Pith/arXiv arXiv 2026
-
[22]
The mean-field dynamics of transformers
Philippe Rigollet. The mean-field dynamics of transformers. arXiv preprint , 2025. arXiv:2512.01868
arXiv 2025
-
[23]
Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr \'e
Michael E. Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr \'e . Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics , pages 3515--3530. PMLR, 2022
2022
-
[24]
Javier Segura. Monotonicity properties for ratios and products of modified Bessel functions and sharp trigonometric bounds. Results in Mathematics , 76:221, 2021. doi:10.1007/s00025-021-01531-1
-
[25]
Anna Shalova and Andr \'e Schlichting. Solutions of stationary McKean--Vlasov equation on a high-dimensional sphere and other Riemannian manifolds. Advances in Nonlinear Analysis , 15(1):20250141, 2026. doi:10.1515/anona-2025-0141
-
[26]
Steven H. Strogatz. From kuramoto to crawford: Exploring the onset of synchronization in populations of coupled oscillators. Physica D: Nonlinear Phenomena , 143(1--4):1--20, 2000. doi:10.1016/S0167-2789(00)00094-4
-
[27]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 , pages 5998--6008, 2017. arXiv:1706.03762
Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.