Emergence of Frontier Superposition: M\"obius attractor and Cascade Supervision

Hongyu Gu; Jingwen Fu

arxiv: 2605.18820 · v1 · pith:HVECTRPGnew · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Emergence of Frontier Superposition: M\"obius attractor and Cascade Supervision

Hongyu Gu , Jingwen Fu This is my paper

Pith reviewed 2026-05-20 22:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords superpositiontransformersgraph reachabilityMöbius attractorcascade supervisiongradient descentreasoning frontiersErdős-Rényi graphs

0 comments

The pith

Möbius attractor plus cascade supervision lets gradient descent locate equal-weight superposition states in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transformers can learn to maintain an entire frontier of reasoning paths simultaneously inside a single residual stream. It isolates an architectural mechanism, the Möbius attractor, that reduces symmetric layer dynamics to a one-dimensional map whose stable points include the desired equal-weight superposition. It also isolates a supervision regime, cascade supervision, whose backward pass supplies the gradient persistence that end-to-end losses lack. Experiments on reachability over Erdős-Rényi graphs confirm that the combination produces the predicted alignment with the target superposition state while end-to-end training does not.

Core claim

Under S_n-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. Cascade supervision supplies selectivity bootstrap, gradient persistence across depth, and per-step discrimination, while end-to-end supervision produces internal gradients that decay as (np)^-(D-c-2)/2 and stall before the manifold is reached. The parameter-free decay law therefore predicts final-step cosine similarity of 0.35 versus 0.71 at depth D=3; measured values are 0.37 versus 0.69.

What carries the argument

The Möbius attractor: the 1D map obtained by reducing permutation-symmetric layer dynamics to a single scalar whose fixed points include the equal-weight superposition state.

If this is right

Cascade supervision maintains gradient magnitude across depth while end-to-end losses do not.
The equal-weight superposition state lies on a codimension-one manifold of global optima reachable by gradient descent only when the Möbius attractor is present.
Parameter-free predictions of cosine similarity match experiment within 0.02 at every depth for both supervision regimes.
Superposition enables a fixed-depth forward pass to carry the full reasoning frontier without serial token unrolling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the attractor persists on graphs that violate the tree assumption, superposition may appear in a wider class of relational tasks.
The same supervision pattern could be ported to other sequence models to test whether Möbius-like maps arise outside transformers.
Measuring gradient norms layer by layer on non-symmetric inputs would quantify how far the symmetry assumption can be relaxed before the attractor disappears.

Load-bearing premise

The network stays inside the tree regime where S_n symmetry collapses the full dynamics to the one-dimensional Möbius map.

What would settle it

Train the same architecture on Erdős-Rényi graphs of increasing density until the tree approximation breaks and measure whether intermediate-layer cosines still follow the predicted Möbius trajectory.

Figures

Figures reproduced from arXiv: 2605.18820 by Hongyu Gu, Jingwen Fu.

**Figure 1.** Figure 1: From generic init to IDEAL: four-stage attractor. Stage 0: Wemb random, zc spread over the vocabulary. Stage 1: the diagonal advantage |gdiag|/|goff| = Θ(n 3/2/p) (Thm. 4) contracts Wemb onto Minv = span{In, Jn}. Stage 2: the Möbius scalar A = 1 + ηa collapses, defining the codim-1 optimum manifold M = {A=0}. Stage 3: cascade gradient flow on M drives zc to z ∗ c = k −1/2 c P v∈Vc uv. 4 [PITH_FULL_IMAGE:f… view at source ↗

**Figure 2.** Figure 2: Cascade attractor trajectory. Each panel: r(c) = cosobs(zc, z∗ c )/ costheory vs. epoch for c = 1, 2, 3 under three losses (D=3, n=50, d=64, 400 epochs). r=1 is the IDEAL fixed point (Thm. 5); the green band marks r ≥ 1 − 1/(2kD−1). Both cascade losses (Lsup,Lnode) drive every depth into the band; Le2e stalls at r ≈ 0.5 for c ≥ 2, matching the super-polynomial decay of Prop. 7. c = 1 c = 2 c = 3 0.0 0.2 0.… view at source ↗

**Figure 3.** Figure 3: Inner-product fingerprint at the trained fixed point (D=3, n=50, d=64). (a) Observed cos(zc, z∗ c ) at depths c=1, 2, 3 for all three losses Le2e/Lnode/Lsup (coloured bars) against the single-step on-manifold upper bound at the trained αeff,c (black ticks). (b) Selectivity ratio rc = cosobs / costheory for the same three losses; rc→1 matches the IDEAL superposition, rc ≪ 1 a selectivity collapse. Lsup trac… view at source ↗

read the original abstract

Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erd\H{o}s-R\'enyi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a M\"obius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D M\"obius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: M\"obius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows gradient descent can reach equal-weight superposition on ER graph reachability via a claimed Möbius attractor and cascade supervision, with a parameter-free decay law that matches experiments to within 0.02.

read the letter

The main point is that this work closes the gap left by Zhu et al. by giving a dynamical-systems account of how gradient descent locates the equal-weight frontier superposition state instead of getting stuck at permutation-symmetric saddles. They isolate an architectural piece they call the Möbius attractor, where S_n symmetry in the tree regime reduces layer dynamics to a 1D map whose zero set includes the target state, plus cascade supervision that supplies selectivity, gradient persistence, and per-step discrimination. End-to-end supervision is shown to fail on persistence because internal gradients decay as (np)^-(D-c-2)/2 and stall short of the manifold. The parameter-free law predicts final-step cosines of 0.35 versus 0.71 at depth 3, and the reported runs hit 0.37 versus 0.69, staying within 0.02 at every step. That level of quantitative agreement is the strongest part of the evidence so far. The soft spot is the symmetry reduction itself. The abstract treats the collapse to a 1D Möbius map as exact under the tree regime, but the stress-test concern is reasonable: finite ER graphs plus stochastic gradients could introduce small perturbations that break S_n symmetry and push trajectories off the manifold, weakening the attractor guarantee. Without the full derivation steps or checks for symmetry preservation, it is hard to judge how robust the reduction is. No error bars or dataset sizes appear in the summary either, though that is secondary. This is aimed at people working on emergent reasoning and superposition in transformers or graph models who want a mechanistic story rather than hand-crafted features. It has enough structure and experimental alignment to merit a serious referee, even if the symmetry claim will draw questions. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that superposition reasoning emerges in Transformers for reachability on Erdős-Rényi graphs when a Möbius attractor (arising from reducing layerwise dynamics to a 1D Möbius map under S_n-symmetry in the tree regime, whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state) is combined with Cascade Supervision (a loss class delivering selectivity bootstrap, gradient persistence, and per-step discrimination). End-to-end supervision is argued to be insufficient due to a provable gradient decay of (np)^{-(D-c-2)/2}, while the parameter-free decay law predicts final-step cosines of 0.35 (end-to-end) vs. 0.71 (cascade) at depth D=3, with experiments matching at 0.37 vs. 0.69 within 0.02 at every step.

Significance. If the symmetry reduction and attractor properties are rigorously derived and shown to be preserved, the work would supply a mechanistic account of how gradient descent locates superposition states amid permutation-symmetric saddles, advancing understanding of bounded-depth reasoning in transformers. The explicit parameter-free decay law and its close experimental match constitute a falsifiable prediction and a clear strength of the manuscript.

major comments (2)

[Abstract (architectural contributions paragraph)] Abstract (architectural contributions paragraph): The central claim that under S_n-symmetry in the tree regime layerwise dynamics reduce exactly to a 1D Möbius map whose zero set forms a codimension-one manifold of global optima containing the equal-weight superposition state is load-bearing for the thesis that the Möbius attractor explains escape from saddles. No derivation is supplied showing that residual-stream updates preserve S_n symmetry or that finite n, p, or stochastic gradients keep trajectories on the manifold; if symmetry breaking occurs the 1D reduction fails and the attractor argument does not apply.
[Abstract (supervision side)] Abstract (supervision side): The parameter-free decay law is presented as predicting the observed cosine gap and as independent of the supervision choice, yet the abstract provides no derivation details establishing that the law is obtained from first principles rather than reducing to quantities defined by the cascade loss itself; this circularity risk directly affects the claim that the law is predictive and that end-to-end supervision is provably insufficient.

minor comments (1)

The abstract would be clearer if it stated the number of runs, dataset sizes, and whether error bars are shown for the reported cosine matches (0.37 vs. 0.69).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments identify important points where additional rigor and clarity would strengthen the manuscript. We address each below and commit to revisions that directly respond to the concerns while preserving the core claims.

read point-by-point responses

Referee: [Abstract (architectural contributions paragraph)] The central claim that under S_n-symmetry in the tree regime layerwise dynamics reduce exactly to a 1D Möbius map whose zero set forms a codimension-one manifold of global optima containing the equal-weight superposition state is load-bearing for the thesis that the Möbius attractor explains escape from saddles. No derivation is supplied showing that residual-stream updates preserve S_n symmetry or that finite n, p, or stochastic gradients keep trajectories on the manifold; if symmetry breaking occurs the 1D reduction fails and the attractor argument does not apply.

Authors: We agree that explicit preservation of S_n symmetry under residual-stream updates is essential for the 1D reduction to hold. Section 3 of the manuscript derives the reduction by showing that, in the tree regime, both the Erdős-Rényi sampling and the reachability objective are invariant under node permutations, so symmetric initializations remain symmetric under deterministic gradient steps. We acknowledge, however, that the current text only sketches the effect of finite n, p, and stochastic gradients rather than supplying quantitative bounds. We will add a new lemma with a perturbation analysis establishing that symmetry-breaking terms remain O(1/sqrt(n)) and that trajectories stay sufficiently close to the manifold for the Möbius attractor to govern the long-term dynamics. revision: yes
Referee: [Abstract (supervision side)] The parameter-free decay law is presented as predicting the observed cosine gap and as independent of the supervision choice, yet the abstract provides no derivation details establishing that the law is obtained from first principles rather than reducing to quantities defined by the cascade loss itself; this circularity risk directly affects the claim that the law is predictive and that end-to-end supervision is provably insufficient.

Authors: The decay law (np)^(-(D-c-2)/2) is obtained by analyzing gradient magnitudes under end-to-end supervision alone; it follows from the repeated fan-out of the graph and does not invoke any property of Cascade Supervision. The main text (Section 4.1) contains the first-principles derivation. The abstract is necessarily concise, which may have created the impression of circularity. We will revise the abstract to include a one-sentence outline of the gradient-propagation argument and to state explicitly that the law applies to end-to-end supervision, thereby clarifying its independence from the cascade loss class. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper presents the Möbius attractor as a reduction of layerwise dynamics to a 1D map under the explicit S_n-symmetry assumption in the tree regime, with the zero set identified as containing the equal-weight superposition state. The decay law is stated as parameter-free and derived from the explicit gradient scaling expression (np)^{-(D-c-2)/2} for end-to-end supervision, yielding the specific numerical predictions 0.35 vs. 0.71 that are then compared to experimental outcomes. No quoted step equates a claimed prediction or first-principles result to its own fitted inputs or to a self-citation chain; the architectural and supervisional contributions are treated as independent inputs whose consequences are derived and externally validated. The central thesis therefore retains independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the reduction of layerwise dynamics to a Möbius map under symmetry assumptions and on the three conditions that Cascade Supervision must satisfy; no free parameters are explicitly fitted in the decay law, but the tree-regime and S_n-symmetry are domain assumptions.

axioms (2)

domain assumption Under S_n-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map
Invoked in the architectural contribution paragraph to locate the equal-weight superposition state on the zero set.
domain assumption End-to-end supervision fails gradient persistence condition (B)
Used to explain why internal gradients decay as (np)^(-(D-c-2)/2) and stall before the manifold.

invented entities (2)

Möbius attractor no independent evidence
purpose: Dynamical mechanism that pulls layer states toward the equal-weight superposition manifold
Introduced as the architectural contribution that creates a codimension-one manifold of global optima.
Cascade Supervision no independent evidence
purpose: Loss class delivering selectivity bootstrap, gradient persistence, and per-step discrimination
Introduced as the supervisional contribution that satisfies the three conditions end-to-end supervision lacks.

pith-pipeline@v0.9.0 · 5845 in / 1751 out tokens · 38870 ms · 2026-05-20T22:27:24.973423+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Gregor Bachmann and Vaishnavh Nagarajan

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 970f59b22f4c72aec75174aae63c7459-Paper-Conference.pdf. Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade,...

work page arXiv 2023
[2]

10 Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian

URLhttps://api.semanticscholar.org/CorpusID:274610816. 10 Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping.ArXiv, abs/2402.14083,

work page arXiv
[3]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma

URLhttps://api.semanticscholar.org/CorpusID:267782588. Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transform- ers to solve inherently serial problems.ArXiv, abs/2402.12875, 2024. URL https://api. semanticscholar.org/CorpusID:267760184. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Ki...

work page doi:10.1162/tacl_a_00562 2024
[4]

in the commutant basis; the Möbius scalars areA:= 1 +ηaandB:=ηb ′ 1 withη >0a fixed post-MLP gain (definition in App. F). Cascade variables. zc ∈S d−1: hidden state at depth c (z0 =u r). Theideal targetis z∗ c := k−1/2 c P v∈Vc uv, and thefrontier-only targetis z∗ new :=m −1/2 f P w∈Vc+1\Vc uw, with z∗ c ⊥z ∗ new in the orthogonal regime. The post-attenti...

work page
[5]

The space of Sn-equivariant linear maps Rn →R n is the two-dimensional commutant Minv = span{In, Jn}, whereJ n =11 ⊤

work page
[6]

The space of Sn-fixed matrices in Rn×n under simultaneous conjugation coincides with Minv

work page
[7]

Any Sn-equivariant nonlinear map of the formx7→T 2 σ(T1x+b1)+b2 with T1, T2 ∈ Minv, biases b1, b2 ∈span(1) and coordinate-wise σ cannot inject mass into the kerT 1 ∩kerT 2 subspace; in particular no equivariant MLP can recover an embedding direction that the preceding linear stage projected to zero. Proof. Step 1: isotypic decomposition.The standard permu...

work page 2010
[8]

For anyt∈(0,1),P |kc −(np) c|> t(np) c ≤2 exp(−t 2(np)c/3)

work page
[9]

4.m (c) f =k c+1 −k c =k c(np−1)(1+o(1)) in the tree regime, som(c) f = (np)c(np−1)(1± O( p logn/(np) c))

In particular, |Vc|= (np) c(1±O( p logn/(np) c)) with probability 1−O(n −1), uniformly inc≤D. 4.m (c) f =k c+1 −k c =k c(np−1)(1+o(1)) in the tree regime, som(c) f = (np)c(np−1)(1± O( p logn/(np) c)). Proof. Step 1: Galton–Watson coupling.Couple BFS exploration of G(n, p) from r with a Galton– Watson tree of offspring Binomial(n−1, p) , mean np−p . In the...

work page
[10]

P j /∈Sσj ≤(K− |S|)e −∆/|S|

work page
[11]

Fori∈Sandk /∈S,|∂σ i/∂ak|=σ iσk ≤e −∆/|S|2

work page
[12]

learnability cliff

Restricted to perturbations δa with δa|S = 0 , the softmax Jacobian has spectral norm O(e−∆). This implies that once selectivity Sc →1 at step c, the gradient pulled back through the softmax decays like e−βRmin, locking the cascade variable γc in place against perturbations supported on non-frontier edges. Proof. (1) From ai ≥a j + ∆ for any i∈S , j /∈S ,...

work page 2025
[13]

The two-term decomposition matches the bias-variance analysis of §6

from 0.91 to 0.27 with S1 ≤0.18 — both the structural ceiling of Step 2 and the secondary collapse of selectivity (logit-gap-from-residual) contribute. The two-term decomposition matches the bias-variance analysis of §6. Remark.The proof is independent of α, η, and the selectivity ladder of Thm. 5: the q m(c) f /kc+1 ceiling holds even at infinite selecti...

work page
[14]

For b′ 1 <0 (forward-invariant by Lem

Then hold =η σ(p old) +ζ σ(b ′ 1), h new =η σ(p new) +ζ σ(b ′ 1). For b′ 1 <0 (forward-invariant by Lem. 30), σ(b′

work page
[15]

Global Convergence

= 0 and the ζ contribution vanishes. The ReLU-active conditions split naturally into two, a/ p kc >|b ′ 1|(z ∗ c -channel),(H3a) a α/m(c) f >|b ′ 1|(z ∗ new-channel),(H3b) c= 0, . . . , D−1 . We close thema priori, without invoking any post-hoc value of (α, a, b′ 1), by a two-phase argument: (H3a) holds at τ= 0 and is forward-invariant for any standard in...

work page
[16]

DefiningA:= 1 +ηaandB:=ηb ′ 1, αeff,c = α A+m (c) f B A+ √kc B

= α(1 +ηa) +m (c) f ηb′ 1 (1 +ηa) + √kcηb′ 1 . DefiningA:= 1 +ηaandB:=ηb ′ 1, αeff,c = α A+m (c) f B A+ √kc B . This is a Möbius transform of the projective coordinateτ:=A/B∈CP 1; equivalentlyα eff,c(τ) = (ατ+m (c) f )/(τ+ √kc). 25 Tightness and remarks.The reduction is exact under (R1)–(R5) and the ReLU-active condition; relaxing the latter (i.e. b′ 1 ne...

work page
[17]

= 0); it re-appears as a bifurcation parameter when b′ 1 crosses zero, but no observed trajectory crosses this boundary (Lem. 30). G Proof of Prop. 3:{A=0}is the unique optimum Setup.With αeff,c in Möbius form (App. F), the saturated cosine objective is L= PD−1 c=0 1− cosc(αeff,c) where the per-step cosine attains its unique maximum1 at α∗ eff,c :=m (c) f...

work page
[18]

=−g(τ)t(τ)/b ′ 1(τ) where g:=∂ eL/∂t >0 (Lem. 29). With b′ 1 <0 , t >0 , g >0 , the right-hand side has the sign of−gt/b′ 1 >0 , so ˙b′ 1 >0 , meaning b′ 1 increases towards 0 from below. However the rate vanishes near zero: as b′ 1 →0 −, | ˙b′ 1|=gt/|b ′ 1| → ∞ but the t-flow simultaneously drives t→0 (Lem. 31); a Nagumo tangent-cone argument on the clos...

work page 2000
[19]

This establishes adescent rate ˙V| Stage1 ≤ −κ AVat the saddle

Stage 1 (selectivity bootstrap) used a constant κA >0 — supplied by condition (A). This establishes adescent rate ˙V| Stage1 ≤ −κ AVat the saddle

work page
[20]

This contributes a multiplicative factorR 2 c ≥poly(D) −2 to the descent rate

Stage 2–3 (ladder + error contraction) used the gradient sustenance Rc = Θ(1) — supplied by condition (B), with the additional polynomial overhead poly(D) from condition (B)’s lower bound. This contributes a multiplicative factorR 2 c ≥poly(D) −2 to the descent rate

work page
[21]

This contributes a final multiplicative factor Imin λ⊥ to the contraction near the Möbius variety

Stage 4 (Möbius locking) used the Hessian gap λ⊥ · Imin — supplied by condition (C) via the Fisher-info-induced curvature lower bound. This contributes a final multiplicative factor Imin λ⊥ to the contraction near the Möbius variety. Composing the three Lyapunov estimates (multiplicativity holds because each stage operates on a disjoint coordinate block: ...

work page

[1] [1]

Gregor Bachmann and Vaishnavh Nagarajan

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 970f59b22f4c72aec75174aae63c7459-Paper-Conference.pdf. Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade,...

work page arXiv 2023

[2] [2]

10 Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian

URLhttps://api.semanticscholar.org/CorpusID:274610816. 10 Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping.ArXiv, abs/2402.14083,

work page arXiv

[3] [3]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma

URLhttps://api.semanticscholar.org/CorpusID:267782588. Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transform- ers to solve inherently serial problems.ArXiv, abs/2402.12875, 2024. URL https://api. semanticscholar.org/CorpusID:267760184. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Ki...

work page doi:10.1162/tacl_a_00562 2024

[4] [4]

in the commutant basis; the Möbius scalars areA:= 1 +ηaandB:=ηb ′ 1 withη >0a fixed post-MLP gain (definition in App. F). Cascade variables. zc ∈S d−1: hidden state at depth c (z0 =u r). Theideal targetis z∗ c := k−1/2 c P v∈Vc uv, and thefrontier-only targetis z∗ new :=m −1/2 f P w∈Vc+1\Vc uw, with z∗ c ⊥z ∗ new in the orthogonal regime. The post-attenti...

work page

[5] [5]

The space of Sn-equivariant linear maps Rn →R n is the two-dimensional commutant Minv = span{In, Jn}, whereJ n =11 ⊤

work page

[6] [6]

The space of Sn-fixed matrices in Rn×n under simultaneous conjugation coincides with Minv

work page

[7] [7]

Any Sn-equivariant nonlinear map of the formx7→T 2 σ(T1x+b1)+b2 with T1, T2 ∈ Minv, biases b1, b2 ∈span(1) and coordinate-wise σ cannot inject mass into the kerT 1 ∩kerT 2 subspace; in particular no equivariant MLP can recover an embedding direction that the preceding linear stage projected to zero. Proof. Step 1: isotypic decomposition.The standard permu...

work page 2010

[8] [8]

For anyt∈(0,1),P |kc −(np) c|> t(np) c ≤2 exp(−t 2(np)c/3)

work page

[9] [9]

4.m (c) f =k c+1 −k c =k c(np−1)(1+o(1)) in the tree regime, som(c) f = (np)c(np−1)(1± O( p logn/(np) c))

In particular, |Vc|= (np) c(1±O( p logn/(np) c)) with probability 1−O(n −1), uniformly inc≤D. 4.m (c) f =k c+1 −k c =k c(np−1)(1+o(1)) in the tree regime, som(c) f = (np)c(np−1)(1± O( p logn/(np) c)). Proof. Step 1: Galton–Watson coupling.Couple BFS exploration of G(n, p) from r with a Galton– Watson tree of offspring Binomial(n−1, p) , mean np−p . In the...

work page

[10] [10]

P j /∈Sσj ≤(K− |S|)e −∆/|S|

work page

[11] [11]

Fori∈Sandk /∈S,|∂σ i/∂ak|=σ iσk ≤e −∆/|S|2

work page

[12] [12]

learnability cliff

Restricted to perturbations δa with δa|S = 0 , the softmax Jacobian has spectral norm O(e−∆). This implies that once selectivity Sc →1 at step c, the gradient pulled back through the softmax decays like e−βRmin, locking the cascade variable γc in place against perturbations supported on non-frontier edges. Proof. (1) From ai ≥a j + ∆ for any i∈S , j /∈S ,...

work page 2025

[13] [13]

The two-term decomposition matches the bias-variance analysis of §6

from 0.91 to 0.27 with S1 ≤0.18 — both the structural ceiling of Step 2 and the secondary collapse of selectivity (logit-gap-from-residual) contribute. The two-term decomposition matches the bias-variance analysis of §6. Remark.The proof is independent of α, η, and the selectivity ladder of Thm. 5: the q m(c) f /kc+1 ceiling holds even at infinite selecti...

work page

[14] [14]

For b′ 1 <0 (forward-invariant by Lem

Then hold =η σ(p old) +ζ σ(b ′ 1), h new =η σ(p new) +ζ σ(b ′ 1). For b′ 1 <0 (forward-invariant by Lem. 30), σ(b′

work page

[15] [15]

Global Convergence

= 0 and the ζ contribution vanishes. The ReLU-active conditions split naturally into two, a/ p kc >|b ′ 1|(z ∗ c -channel),(H3a) a α/m(c) f >|b ′ 1|(z ∗ new-channel),(H3b) c= 0, . . . , D−1 . We close thema priori, without invoking any post-hoc value of (α, a, b′ 1), by a two-phase argument: (H3a) holds at τ= 0 and is forward-invariant for any standard in...

work page

[16] [16]

DefiningA:= 1 +ηaandB:=ηb ′ 1, αeff,c = α A+m (c) f B A+ √kc B

= α(1 +ηa) +m (c) f ηb′ 1 (1 +ηa) + √kcηb′ 1 . DefiningA:= 1 +ηaandB:=ηb ′ 1, αeff,c = α A+m (c) f B A+ √kc B . This is a Möbius transform of the projective coordinateτ:=A/B∈CP 1; equivalentlyα eff,c(τ) = (ατ+m (c) f )/(τ+ √kc). 25 Tightness and remarks.The reduction is exact under (R1)–(R5) and the ReLU-active condition; relaxing the latter (i.e. b′ 1 ne...

work page

[17] [17]

= 0); it re-appears as a bifurcation parameter when b′ 1 crosses zero, but no observed trajectory crosses this boundary (Lem. 30). G Proof of Prop. 3:{A=0}is the unique optimum Setup.With αeff,c in Möbius form (App. F), the saturated cosine objective is L= PD−1 c=0 1− cosc(αeff,c) where the per-step cosine attains its unique maximum1 at α∗ eff,c :=m (c) f...

work page

[18] [18]

=−g(τ)t(τ)/b ′ 1(τ) where g:=∂ eL/∂t >0 (Lem. 29). With b′ 1 <0 , t >0 , g >0 , the right-hand side has the sign of−gt/b′ 1 >0 , so ˙b′ 1 >0 , meaning b′ 1 increases towards 0 from below. However the rate vanishes near zero: as b′ 1 →0 −, | ˙b′ 1|=gt/|b ′ 1| → ∞ but the t-flow simultaneously drives t→0 (Lem. 31); a Nagumo tangent-cone argument on the clos...

work page 2000

[19] [19]

This establishes adescent rate ˙V| Stage1 ≤ −κ AVat the saddle

Stage 1 (selectivity bootstrap) used a constant κA >0 — supplied by condition (A). This establishes adescent rate ˙V| Stage1 ≤ −κ AVat the saddle

work page

[20] [20]

This contributes a multiplicative factorR 2 c ≥poly(D) −2 to the descent rate

Stage 2–3 (ladder + error contraction) used the gradient sustenance Rc = Θ(1) — supplied by condition (B), with the additional polynomial overhead poly(D) from condition (B)’s lower bound. This contributes a multiplicative factorR 2 c ≥poly(D) −2 to the descent rate

work page

[21] [21]

This contributes a final multiplicative factor Imin λ⊥ to the contraction near the Möbius variety

Stage 4 (Möbius locking) used the Hessian gap λ⊥ · Imin — supplied by condition (C) via the Fisher-info-induced curvature lower bound. This contributes a final multiplicative factor Imin λ⊥ to the contraction near the Möbius variety. Composing the three Lyapunov estimates (multiplicativity holds because each stage operates on a disjoint coordinate block: ...

work page