Emergence of Frontier Superposition: M\"obius attractor and Cascade Supervision
Pith reviewed 2026-05-20 22:27 UTC · model grok-4.3
The pith
Möbius attractor plus cascade supervision lets gradient descent locate equal-weight superposition states in transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under S_n-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. Cascade supervision supplies selectivity bootstrap, gradient persistence across depth, and per-step discrimination, while end-to-end supervision produces internal gradients that decay as (np)^-(D-c-2)/2 and stall before the manifold is reached. The parameter-free decay law therefore predicts final-step cosine similarity of 0.35 versus 0.71 at depth D=3; measured values are 0.37 versus 0.69.
What carries the argument
The Möbius attractor: the 1D map obtained by reducing permutation-symmetric layer dynamics to a single scalar whose fixed points include the equal-weight superposition state.
If this is right
- Cascade supervision maintains gradient magnitude across depth while end-to-end losses do not.
- The equal-weight superposition state lies on a codimension-one manifold of global optima reachable by gradient descent only when the Möbius attractor is present.
- Parameter-free predictions of cosine similarity match experiment within 0.02 at every depth for both supervision regimes.
- Superposition enables a fixed-depth forward pass to carry the full reasoning frontier without serial token unrolling.
Where Pith is reading between the lines
- If the attractor persists on graphs that violate the tree assumption, superposition may appear in a wider class of relational tasks.
- The same supervision pattern could be ported to other sequence models to test whether Möbius-like maps arise outside transformers.
- Measuring gradient norms layer by layer on non-symmetric inputs would quantify how far the symmetry assumption can be relaxed before the attractor disappears.
Load-bearing premise
The network stays inside the tree regime where S_n symmetry collapses the full dynamics to the one-dimensional Möbius map.
What would settle it
Train the same architecture on Erdős-Rényi graphs of increasing density until the tree approximation breaks and measure whether intermediate-layer cosines still follow the predicted Möbius trajectory.
Figures
read the original abstract
Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erd\H{o}s-R\'enyi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a M\"obius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D M\"obius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: M\"obius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that superposition reasoning emerges in Transformers for reachability on Erdős-Rényi graphs when a Möbius attractor (arising from reducing layerwise dynamics to a 1D Möbius map under S_n-symmetry in the tree regime, whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state) is combined with Cascade Supervision (a loss class delivering selectivity bootstrap, gradient persistence, and per-step discrimination). End-to-end supervision is argued to be insufficient due to a provable gradient decay of (np)^{-(D-c-2)/2}, while the parameter-free decay law predicts final-step cosines of 0.35 (end-to-end) vs. 0.71 (cascade) at depth D=3, with experiments matching at 0.37 vs. 0.69 within 0.02 at every step.
Significance. If the symmetry reduction and attractor properties are rigorously derived and shown to be preserved, the work would supply a mechanistic account of how gradient descent locates superposition states amid permutation-symmetric saddles, advancing understanding of bounded-depth reasoning in transformers. The explicit parameter-free decay law and its close experimental match constitute a falsifiable prediction and a clear strength of the manuscript.
major comments (2)
- [Abstract (architectural contributions paragraph)] Abstract (architectural contributions paragraph): The central claim that under S_n-symmetry in the tree regime layerwise dynamics reduce exactly to a 1D Möbius map whose zero set forms a codimension-one manifold of global optima containing the equal-weight superposition state is load-bearing for the thesis that the Möbius attractor explains escape from saddles. No derivation is supplied showing that residual-stream updates preserve S_n symmetry or that finite n, p, or stochastic gradients keep trajectories on the manifold; if symmetry breaking occurs the 1D reduction fails and the attractor argument does not apply.
- [Abstract (supervision side)] Abstract (supervision side): The parameter-free decay law is presented as predicting the observed cosine gap and as independent of the supervision choice, yet the abstract provides no derivation details establishing that the law is obtained from first principles rather than reducing to quantities defined by the cascade loss itself; this circularity risk directly affects the claim that the law is predictive and that end-to-end supervision is provably insufficient.
minor comments (1)
- The abstract would be clearer if it stated the number of runs, dataset sizes, and whether error bars are shown for the reported cosine matches (0.37 vs. 0.69).
Simulated Author's Rebuttal
We thank the referee for the careful and constructive report. The two major comments identify important points where additional rigor and clarity would strengthen the manuscript. We address each below and commit to revisions that directly respond to the concerns while preserving the core claims.
read point-by-point responses
-
Referee: [Abstract (architectural contributions paragraph)] The central claim that under S_n-symmetry in the tree regime layerwise dynamics reduce exactly to a 1D Möbius map whose zero set forms a codimension-one manifold of global optima containing the equal-weight superposition state is load-bearing for the thesis that the Möbius attractor explains escape from saddles. No derivation is supplied showing that residual-stream updates preserve S_n symmetry or that finite n, p, or stochastic gradients keep trajectories on the manifold; if symmetry breaking occurs the 1D reduction fails and the attractor argument does not apply.
Authors: We agree that explicit preservation of S_n symmetry under residual-stream updates is essential for the 1D reduction to hold. Section 3 of the manuscript derives the reduction by showing that, in the tree regime, both the Erdős-Rényi sampling and the reachability objective are invariant under node permutations, so symmetric initializations remain symmetric under deterministic gradient steps. We acknowledge, however, that the current text only sketches the effect of finite n, p, and stochastic gradients rather than supplying quantitative bounds. We will add a new lemma with a perturbation analysis establishing that symmetry-breaking terms remain O(1/sqrt(n)) and that trajectories stay sufficiently close to the manifold for the Möbius attractor to govern the long-term dynamics. revision: yes
-
Referee: [Abstract (supervision side)] The parameter-free decay law is presented as predicting the observed cosine gap and as independent of the supervision choice, yet the abstract provides no derivation details establishing that the law is obtained from first principles rather than reducing to quantities defined by the cascade loss itself; this circularity risk directly affects the claim that the law is predictive and that end-to-end supervision is provably insufficient.
Authors: The decay law (np)^(-(D-c-2)/2) is obtained by analyzing gradient magnitudes under end-to-end supervision alone; it follows from the repeated fan-out of the graph and does not invoke any property of Cascade Supervision. The main text (Section 4.1) contains the first-principles derivation. The abstract is necessarily concise, which may have created the impression of circularity. We will revise the abstract to include a one-sentence outline of the gradient-propagation argument and to state explicitly that the law applies to end-to-end supervision, thereby clarifying its independence from the cascade loss class. revision: partial
Circularity Check
No significant circularity; derivation chain is self-contained
full rationale
The paper presents the Möbius attractor as a reduction of layerwise dynamics to a 1D map under the explicit S_n-symmetry assumption in the tree regime, with the zero set identified as containing the equal-weight superposition state. The decay law is stated as parameter-free and derived from the explicit gradient scaling expression (np)^{-(D-c-2)/2} for end-to-end supervision, yielding the specific numerical predictions 0.35 vs. 0.71 that are then compared to experimental outcomes. No quoted step equates a claimed prediction or first-principles result to its own fitted inputs or to a self-citation chain; the architectural and supervisional contributions are treated as independent inputs whose consequences are derived and externally validated. The central thesis therefore retains independent mathematical content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Under S_n-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map
- domain assumption End-to-end supervision fails gradient persistence condition (B)
invented entities (2)
-
Möbius attractor
no independent evidence
-
Cascade Supervision
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gregor Bachmann and Vaishnavh Nagarajan
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 970f59b22f4c72aec75174aae63c7459-Paper-Conference.pdf. Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade,...
-
[2]
10 Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian
URLhttps://api.semanticscholar.org/CorpusID:274610816. 10 Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping.ArXiv, abs/2402.14083,
-
[3]
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma
URLhttps://api.semanticscholar.org/CorpusID:267782588. Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transform- ers to solve inherently serial problems.ArXiv, abs/2402.12875, 2024. URL https://api. semanticscholar.org/CorpusID:267760184. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Ki...
-
[4]
in the commutant basis; the Möbius scalars areA:= 1 +ηaandB:=ηb ′ 1 withη >0a fixed post-MLP gain (definition in App. F). Cascade variables. zc ∈S d−1: hidden state at depth c (z0 =u r). Theideal targetis z∗ c := k−1/2 c P v∈Vc uv, and thefrontier-only targetis z∗ new :=m −1/2 f P w∈Vc+1\Vc uw, with z∗ c ⊥z ∗ new in the orthogonal regime. The post-attenti...
-
[5]
The space of Sn-equivariant linear maps Rn →R n is the two-dimensional commutant Minv = span{In, Jn}, whereJ n =11 ⊤
-
[6]
The space of Sn-fixed matrices in Rn×n under simultaneous conjugation coincides with Minv
-
[7]
Any Sn-equivariant nonlinear map of the formx7→T 2 σ(T1x+b1)+b2 with T1, T2 ∈ Minv, biases b1, b2 ∈span(1) and coordinate-wise σ cannot inject mass into the kerT 1 ∩kerT 2 subspace; in particular no equivariant MLP can recover an embedding direction that the preceding linear stage projected to zero. Proof. Step 1: isotypic decomposition.The standard permu...
work page 2010
-
[8]
For anyt∈(0,1),P |kc −(np) c|> t(np) c ≤2 exp(−t 2(np)c/3)
-
[9]
In particular, |Vc|= (np) c(1±O( p logn/(np) c)) with probability 1−O(n −1), uniformly inc≤D. 4.m (c) f =k c+1 −k c =k c(np−1)(1+o(1)) in the tree regime, som(c) f = (np)c(np−1)(1± O( p logn/(np) c)). Proof. Step 1: Galton–Watson coupling.Couple BFS exploration of G(n, p) from r with a Galton– Watson tree of offspring Binomial(n−1, p) , mean np−p . In the...
-
[10]
P j /∈Sσj ≤(K− |S|)e −∆/|S|
-
[11]
Fori∈Sandk /∈S,|∂σ i/∂ak|=σ iσk ≤e −∆/|S|2
-
[12]
Restricted to perturbations δa with δa|S = 0 , the softmax Jacobian has spectral norm O(e−∆). This implies that once selectivity Sc →1 at step c, the gradient pulled back through the softmax decays like e−βRmin, locking the cascade variable γc in place against perturbations supported on non-frontier edges. Proof. (1) From ai ≥a j + ∆ for any i∈S , j /∈S ,...
work page 2025
-
[13]
The two-term decomposition matches the bias-variance analysis of §6
from 0.91 to 0.27 with S1 ≤0.18 — both the structural ceiling of Step 2 and the secondary collapse of selectivity (logit-gap-from-residual) contribute. The two-term decomposition matches the bias-variance analysis of §6. Remark.The proof is independent of α, η, and the selectivity ladder of Thm. 5: the q m(c) f /kc+1 ceiling holds even at infinite selecti...
-
[14]
For b′ 1 <0 (forward-invariant by Lem
Then hold =η σ(p old) +ζ σ(b ′ 1), h new =η σ(p new) +ζ σ(b ′ 1). For b′ 1 <0 (forward-invariant by Lem. 30), σ(b′
-
[15]
= 0 and the ζ contribution vanishes. The ReLU-active conditions split naturally into two, a/ p kc >|b ′ 1|(z ∗ c -channel),(H3a) a α/m(c) f >|b ′ 1|(z ∗ new-channel),(H3b) c= 0, . . . , D−1 . We close thema priori, without invoking any post-hoc value of (α, a, b′ 1), by a two-phase argument: (H3a) holds at τ= 0 and is forward-invariant for any standard in...
-
[16]
DefiningA:= 1 +ηaandB:=ηb ′ 1, αeff,c = α A+m (c) f B A+ √kc B
= α(1 +ηa) +m (c) f ηb′ 1 (1 +ηa) + √kcηb′ 1 . DefiningA:= 1 +ηaandB:=ηb ′ 1, αeff,c = α A+m (c) f B A+ √kc B . This is a Möbius transform of the projective coordinateτ:=A/B∈CP 1; equivalentlyα eff,c(τ) = (ατ+m (c) f )/(τ+ √kc). 25 Tightness and remarks.The reduction is exact under (R1)–(R5) and the ReLU-active condition; relaxing the latter (i.e. b′ 1 ne...
-
[17]
= 0); it re-appears as a bifurcation parameter when b′ 1 crosses zero, but no observed trajectory crosses this boundary (Lem. 30). G Proof of Prop. 3:{A=0}is the unique optimum Setup.With αeff,c in Möbius form (App. F), the saturated cosine objective is L= PD−1 c=0 1− cosc(αeff,c) where the per-step cosine attains its unique maximum1 at α∗ eff,c :=m (c) f...
-
[18]
=−g(τ)t(τ)/b ′ 1(τ) where g:=∂ eL/∂t >0 (Lem. 29). With b′ 1 <0 , t >0 , g >0 , the right-hand side has the sign of−gt/b′ 1 >0 , so ˙b′ 1 >0 , meaning b′ 1 increases towards 0 from below. However the rate vanishes near zero: as b′ 1 →0 −, | ˙b′ 1|=gt/|b ′ 1| → ∞ but the t-flow simultaneously drives t→0 (Lem. 31); a Nagumo tangent-cone argument on the clos...
work page 2000
-
[19]
This establishes adescent rate ˙V| Stage1 ≤ −κ AVat the saddle
Stage 1 (selectivity bootstrap) used a constant κA >0 — supplied by condition (A). This establishes adescent rate ˙V| Stage1 ≤ −κ AVat the saddle
-
[20]
This contributes a multiplicative factorR 2 c ≥poly(D) −2 to the descent rate
Stage 2–3 (ladder + error contraction) used the gradient sustenance Rc = Θ(1) — supplied by condition (B), with the additional polynomial overhead poly(D) from condition (B)’s lower bound. This contributes a multiplicative factorR 2 c ≥poly(D) −2 to the descent rate
-
[21]
This contributes a final multiplicative factor Imin λ⊥ to the contraction near the Möbius variety
Stage 4 (Möbius locking) used the Hessian gap λ⊥ · Imin — supplied by condition (C) via the Fisher-info-induced curvature lower bound. This contributes a final multiplicative factor Imin λ⊥ to the contraction near the Möbius variety. Composing the three Lyapunov estimates (multiplicativity holds because each stage operates on a disjoint coordinate block: ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.