pith. machine review for the scientific record. sign in

arxiv: 2605.04971 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords geometric continuityresidual connectionssymmetry breakingweight matricessingular vectorsdeep neural networkstransformersgradient coherence
0
0 comments X

The pith

Residual connections and symmetry-breaking nonlinearities cause geometric continuity in deep network weight matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Weight matrices in deep networks show geometric continuity when the leading singular vectors of adjacent layers point in similar directions. The paper traces this to two mechanisms isolated in toy MLPs and small transformers: residual connections create coherent gradients across layers that align weight updates, while symmetry-breaking nonlinearities lock layers into one shared coordinate frame and block rotational drift. Experiments confirm that a nonlinear but rotation-preserving activation loses continuity, separating symmetry breaking from nonlinearity itself. Activation focuses continuity on the top singular direction while normalization spreads it; in transformers the effect is projection-specific, with Q, K, Gate, and Up showing input-space continuity and O and Down showing output-space continuity.

Core claim

Geometric continuity arises because residual connections produce cross-layer gradient coherence that aligns updates, while symmetry-breaking nonlinearities constrain all layers to a common coordinate frame and prevent the rotation drift that would otherwise destabilize weight structure.

What carries the argument

Cross-layer gradient coherence from residuals combined with rotational symmetry breaking by nonlinear activations.

If this is right

  • Activation concentrates continuity in the leading singular direction while normalization distributes it across multiple directions.
  • In transformers, continuity is projection-specific: Q, K, Gate, and Up develop input-space continuity while O and Down develop output-space continuity.
  • V projections, lacking an adjacent nonlinearity, show only low continuity.
  • Replacing symmetry-breaking activations with rotation-preserving ones eliminates continuity even though nonlinearity remains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same mechanisms dominate at scale, removing residuals or switching to rotation-preserving activations would reduce cross-layer alignment and potentially increase training instability in large models.
  • The projection-specific pattern suggests continuity supports stable information flow through the residual stream in attention blocks.
  • Measuring whether higher continuity correlates with better generalization on held-out data would test a direct link between this geometry and task performance.

Load-bearing premise

The mechanisms isolated in toy MLPs and small transformers are the dominant causes of geometric continuity in large practical networks.

What would settle it

Training a deep MLP or transformer without residual connections or with only rotation-preserving activations and then observing that principal singular vectors of adjacent layers still align would falsify the claimed mechanisms.

Figures

Figures reproduced from arXiv: 2605.04971 by Honggyo Suh, Kyungwon Jeong, Won-Gi Paeng.

Figure 1
Figure 1. Figure 1: Geometric continuity of weight v1 across layers. 3D PCA of principal right singular vectors from a 16-layer MLP trained on MNIST. (a) Before training: random. (b) Res+ReLU: smooth trajectory. (c) Res+None: activation removed, low continuity. (d) NoRes+ReLU: residual removed, weak continuity. Both residual connections and symmetry-breaking nonlinearity are necessary. otherwise destabilize weight structure. … view at source ↗
Figure 2
Figure 2. Figure 2: Rotation drift and continuity collapse without activation (both configurations use small initialization σ=0.0001, MNIST, 50 epochs). (a) Per-layer weight v1 rotation angle from epoch 1 reference (left two panels): Res+None layers all rotate to ∼85–90 (mutual misalignment), while Res+ReLU layers rotate only ∼25–35 coherently. (b) Inter-layer weight (red) and gradient (blue) v1 continuity, and test accuracy … view at source ↗
Figure 3
Figure 3. Figure 3: Residual stream read/write structure of a transformer block. Each projection either reads from the residual stream (green up arrows: Q, K, V, Gate, Up) or writes to it (red down arrows: O, Down). Nonlinearities (softmax, σ) are shown as blue boxes. Section 5.2 uses this structure to predict each projection’s continuity space. 5.1 Setup We train a small Llama-style transformer (D=256, 8 layers, 4 heads, SiL… view at source ↗
Figure 4
Figure 4. Figure 4: 3D PCA of principal right singular vectors (v1) across 32 layers. Colors indicate layer index (blue to yellow). Q, K, Up, Gate show smooth trajectories; V, O, Down are scattered; OV composite (WOWV ) shows moderate structure. A Geometric Continuity in Pretrained LLMs A.1 Experimental Setup Model and Weight Matrices. We analyze Llama-3.1-8B [5], a 32-layer transformer with hidden dimension d = 4096 and inte… view at source ↗
Figure 5
Figure 5. Figure 5: 3D PCA of principal left singular vectors (u1) across 32 layers. Colors indicate layer index (blue to yellow). O, Down show smooth trajectories; Q, K, V, Gate, Up are scattered—opposite to view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise continuity (cosine similarity) between adjacent principal singular vectors. Each view at source ↗
Figure 7
Figure 7. Figure 7: GPT-2 XL geometric continuity. 14 view at source ↗
Figure 8
Figure 8. Figure 8: Qwen3-8B geometric continuity. 15 view at source ↗
Figure 9
Figure 9. Figure 9: Gemma-3-12B geometric continuity. 16 view at source ↗
Figure 10
Figure 10. Figure 10: EXAONE-4.0-32B geometric continuity. 17 view at source ↗
Figure 11
Figure 11. Figure 11: Llama-3.1-70B geometric continuity. 18 view at source ↗
Figure 12
Figure 12. Figure 12: Gradient accumulation forms weight structure. (a) Weight v1 continuity starts at 0.04 and rises to 0.96 during training, while gradient v1 continuity is already high at the first backward pass—confirming a causal direction from gradient to weight. (b) Weight v1 aligns most strongly with the long-term gradient average (EMA β=0.999: alignment ∼0.85), confirming that weight structure reflects cumulative grad… view at source ↗
Figure 13
Figure 13. Figure 13: Gradient rank and weight continuity across datasets and architectures. (a) Within Res+ReLU, weight v1 continuity trends downward with gradient effective rank across six clas￾sification datasets (erank 1.6–15.4). (b) Across three architectures (× six datasets = 18 points), continuity is determined by architecture (color bands), not gradient rank (x-axis): Res+ReLU achieves v1 > 0.92 regardless of gradient … view at source ↗
read the original abstract

Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experiments on toy MLPs and small transformers, we identify two mechanisms: residual connections create cross-layer gradient coherence that aligns weight updates across layers, and symmetry-breaking nonlinearities constrain all layers to a shared coordinate frame, preventing the rotation drift that would otherwise destabilize weight structure. Crucially, a nonlinear but rotation-preserving activation fails to retain continuity, isolating symmetry breaking -- not nonlinearity itself -- as the active ingredient. Activation and normalization play distinct roles: activation concentrates continuity in the leading singular direction, while normalization distributes it across multiple directions. In transformers, continuity is projection-specific: Q, K, Gate, and Up (which read from the residual stream) develop input-space ($\mathbf{v}_1$) continuity; O and Down (which write to it) develop output-space ($\mathbf{u}_1$) continuity; V alone, lacking an adjacent nonlinearity, develops only low continuity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that geometric continuity in deep neural networks—where principal singular vectors of adjacent weight matrices align—arises from two mechanisms identified via controlled experiments: residual connections induce cross-layer gradient coherence that aligns updates across layers, while symmetry-breaking nonlinearities enforce a shared coordinate frame that prevents rotational drift. A rotation-preserving but nonlinear activation fails to maintain continuity, isolating symmetry breaking as the key factor rather than nonlinearity per se. Activation concentrates continuity in the leading singular direction, normalization distributes it across multiple directions, and in small transformers continuity is projection-specific (Q/K/Gate/Up develop input-space v1 continuity; O/Down develop output-space u1 continuity; V shows low continuity).

Significance. If the identified mechanisms prove dominant beyond toy regimes, the work supplies a mechanistic account of an empirically noted but previously unexplained property of trained networks, with potential to guide architecture choices involving residuals and activations. The strength lies in the use of targeted ablations on toy MLPs and small transformers to causally isolate residual connections and symmetry breaking, rather than relying on post-hoc correlations. This empirical identification approach is a positive contribution to understanding emergent geometric properties in deep learning.

major comments (2)
  1. [Abstract] Abstract and experimental sections: The central claim that residual connections and symmetry-breaking nonlinearities explain geometric continuity in deep neural networks rests on the untested assumption that these factors dominate in large-scale practical models; no scaling experiments, comparisons to standard large transformers, or ablations varying optimizer or data distribution are provided to rule out overriding effects from those sources.
  2. [Experiments] Toy MLP and transformer experiments: Results on singular-vector alignment and gradient coherence are reported without error bars, multiple random seeds, or statistical tests, making it impossible to assess whether the observed differences (e.g., between symmetry-breaking and rotation-preserving activations) are robust or could arise from initialization variance.
minor comments (1)
  1. [Abstract] The abstract introduces v1 (input-space) and u1 (output-space) continuity for transformer projections without a preceding definition or reference to a figure illustrating the singular-vector decomposition; this notation should be clarified on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the value of our controlled ablation approach. We address each major comment below and commit to revisions that clarify scope and improve statistical reporting without misrepresenting the current results.

read point-by-point responses
  1. Referee: [Abstract] The central claim that residual connections and symmetry-breaking nonlinearities explain geometric continuity in deep neural networks rests on the untested assumption that these factors dominate in large-scale practical models; no scaling experiments, comparisons to standard large transformers, or ablations varying optimizer or data distribution are provided to rule out overriding effects from those sources.

    Authors: We agree that the experiments are limited to toy MLPs and small transformers and provide no direct evidence that the identified mechanisms dominate at scale. The manuscript's contribution is the causal isolation of residual connections and symmetry breaking within these controlled regimes. In revision we will (i) rewrite the abstract and introduction to explicitly restrict claims to the studied settings, (ii) add a dedicated Limitations section that discusses the absence of scaling studies, large-transformer comparisons, and optimizer/data ablations, and (iii) outline concrete directions for future work. These textual changes will be made; no new large-scale experiments are feasible for this revision. revision: partial

  2. Referee: [Experiments] Results on singular-vector alignment and gradient coherence are reported without error bars, multiple random seeds, or statistical tests, making it impossible to assess whether the observed differences (e.g., between symmetry-breaking and rotation-preserving activations) are robust or could arise from initialization variance.

    Authors: We accept this criticism. Although the qualitative patterns were reproducible in our development runs, the manuscript does not report variance across seeds. We will re-execute the core MLP and transformer experiments with a minimum of five independent random seeds, add error bars (standard deviation) to all alignment and coherence plots, and include statistical significance tests (paired t-tests) for the key contrasts, such as symmetry-breaking versus rotation-preserving activations. The revised figures and text will reflect these additions. revision: yes

Circularity Check

0 steps flagged

Empirical identification of mechanisms via controlled experiments; no derivation reduces to inputs

full rationale

The paper's central claims rest on experimental observations from toy MLPs and small transformers, including ablation studies on residual connections, symmetry-breaking nonlinearities, and activation/normalization roles. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the provided text. The work directly tests causal factors (e.g., rotation-preserving activations failing to retain continuity) without self-definitional loops or imported uniqueness theorems. This qualifies as self-contained empirical analysis against external benchmarks, warranting a zero circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on experimental observations from controlled toy models rather than formal axioms or new postulated entities; no free parameters are introduced to fit the continuity metric.

pith-pipeline@v0.9.0 · 5490 in / 1082 out tokens · 62267 ms · 2026-05-08T17:17:50.652493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    DOCS: Quantifying weight similarity for deeper insights into large language models

    Zeping Min and Xinshang Wang. DOCS: Quantifying weight similarity for deeper insights into large language models. InInternational Conference on Learning Representations, 2025. 9

  2. [2]

    Basis shar- ing: Cross-layer parameter sharing for large language model compression

    Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis shar- ing: Cross-layer parameter sharing for large language model compression. InInternational Conference on Learning Representations, 2025

  3. [3]

    The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887, 2024

    Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887, 2024

  4. [4]

    ShortGPT: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

  5. [5]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  6. [6]

    Residual alignment: Uncovering the mechanisms of residual networks

    Jianing Li and Vardan Papyan. Residual alignment: Uncovering the mechanisms of residual networks. InAdvances in Neural Information Processing Systems, 2023

  7. [7]

    Your transformer is secretly linear

    Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. Your transformer is secretly linear. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  8. [8]

    Gradient descent aligns the layers of deep linear networks

    Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019

  9. [9]

    Implicit regularization of deep residual networks towards neural ODEs

    Pierre Marion, Yu-Han Wu, Michael E Sander, and Gérard Biau. Implicit regularization of deep residual networks towards neural ODEs. InInternational Conference on Learning Representations, 2024

  10. [10]

    Feature learning as align- ment: a structural property of gradient descent in non-linear neural networks.arXiv preprint arXiv:2402.05271, 2024

    Daniel Beaglehole, Ioannis Mitliagkas, and Atish Agarwala. Feature learning as align- ment: a structural property of gradient descent in non-linear neural networks.arXiv preprint arXiv:2402.05271, 2024

  11. [11]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InInternational Conference on Learning Representations, 2014

  12. [12]

    On the symmetries of deep learning models and their internal representations

    Charles Godfrey, Davis Brown, Tegan Emerson, and Henry Kvinge. On the symmetries of deep learning models and their internal representations. InAdvances in Neural Information Processing Systems, 2022

  13. [13]

    Transformative or conservative? conser- vation laws for ResNets and transformers

    Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Transformative or conservative? conser- vation laws for ResNets and transformers. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  14. [14]

    The effective rank: A measure of effective dimensionality

    Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 15th European Signal Processing Conference (EUSIPCO), 2007

  15. [15]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  16. [16]

    Language models are unsupervised multitask learners.OpenAI Blog, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 2019

  17. [17]

    Qwen3 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  18. [18]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  19. [19]

    read” from the residual stream;left singular vectors( u1) for O, Down—these projections show continuity in output space, reflecting how they “write

    LG AI Research. EXAONE 4.0 technical report.arXiv preprint, 2025. 10 Figure 4: 3D PCA of principalright singular vectors( v1) across 32 layers. Colors indicate layer index (blue to yellow). Q, K, Up, Gate show smooth trajectories; V , O, Down are scattered; OV composite (WOWV ) shows moderate structure. A Geometric Continuity in Pretrained LLMs A.1 Experi...