Recognition: unknown
Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking
Pith reviewed 2026-05-08 17:17 UTC · model grok-4.3
The pith
Residual connections and symmetry-breaking nonlinearities cause geometric continuity in deep network weight matrices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Geometric continuity arises because residual connections produce cross-layer gradient coherence that aligns updates, while symmetry-breaking nonlinearities constrain all layers to a common coordinate frame and prevent the rotation drift that would otherwise destabilize weight structure.
What carries the argument
Cross-layer gradient coherence from residuals combined with rotational symmetry breaking by nonlinear activations.
If this is right
- Activation concentrates continuity in the leading singular direction while normalization distributes it across multiple directions.
- In transformers, continuity is projection-specific: Q, K, Gate, and Up develop input-space continuity while O and Down develop output-space continuity.
- V projections, lacking an adjacent nonlinearity, show only low continuity.
- Replacing symmetry-breaking activations with rotation-preserving ones eliminates continuity even though nonlinearity remains.
Where Pith is reading between the lines
- If the same mechanisms dominate at scale, removing residuals or switching to rotation-preserving activations would reduce cross-layer alignment and potentially increase training instability in large models.
- The projection-specific pattern suggests continuity supports stable information flow through the residual stream in attention blocks.
- Measuring whether higher continuity correlates with better generalization on held-out data would test a direct link between this geometry and task performance.
Load-bearing premise
The mechanisms isolated in toy MLPs and small transformers are the dominant causes of geometric continuity in large practical networks.
What would settle it
Training a deep MLP or transformer without residual connections or with only rotation-preserving activations and then observing that principal singular vectors of adjacent layers still align would falsify the claimed mechanisms.
Figures
read the original abstract
Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experiments on toy MLPs and small transformers, we identify two mechanisms: residual connections create cross-layer gradient coherence that aligns weight updates across layers, and symmetry-breaking nonlinearities constrain all layers to a shared coordinate frame, preventing the rotation drift that would otherwise destabilize weight structure. Crucially, a nonlinear but rotation-preserving activation fails to retain continuity, isolating symmetry breaking -- not nonlinearity itself -- as the active ingredient. Activation and normalization play distinct roles: activation concentrates continuity in the leading singular direction, while normalization distributes it across multiple directions. In transformers, continuity is projection-specific: Q, K, Gate, and Up (which read from the residual stream) develop input-space ($\mathbf{v}_1$) continuity; O and Down (which write to it) develop output-space ($\mathbf{u}_1$) continuity; V alone, lacking an adjacent nonlinearity, develops only low continuity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that geometric continuity in deep neural networks—where principal singular vectors of adjacent weight matrices align—arises from two mechanisms identified via controlled experiments: residual connections induce cross-layer gradient coherence that aligns updates across layers, while symmetry-breaking nonlinearities enforce a shared coordinate frame that prevents rotational drift. A rotation-preserving but nonlinear activation fails to maintain continuity, isolating symmetry breaking as the key factor rather than nonlinearity per se. Activation concentrates continuity in the leading singular direction, normalization distributes it across multiple directions, and in small transformers continuity is projection-specific (Q/K/Gate/Up develop input-space v1 continuity; O/Down develop output-space u1 continuity; V shows low continuity).
Significance. If the identified mechanisms prove dominant beyond toy regimes, the work supplies a mechanistic account of an empirically noted but previously unexplained property of trained networks, with potential to guide architecture choices involving residuals and activations. The strength lies in the use of targeted ablations on toy MLPs and small transformers to causally isolate residual connections and symmetry breaking, rather than relying on post-hoc correlations. This empirical identification approach is a positive contribution to understanding emergent geometric properties in deep learning.
major comments (2)
- [Abstract] Abstract and experimental sections: The central claim that residual connections and symmetry-breaking nonlinearities explain geometric continuity in deep neural networks rests on the untested assumption that these factors dominate in large-scale practical models; no scaling experiments, comparisons to standard large transformers, or ablations varying optimizer or data distribution are provided to rule out overriding effects from those sources.
- [Experiments] Toy MLP and transformer experiments: Results on singular-vector alignment and gradient coherence are reported without error bars, multiple random seeds, or statistical tests, making it impossible to assess whether the observed differences (e.g., between symmetry-breaking and rotation-preserving activations) are robust or could arise from initialization variance.
minor comments (1)
- [Abstract] The abstract introduces v1 (input-space) and u1 (output-space) continuity for transformer projections without a preceding definition or reference to a figure illustrating the singular-vector decomposition; this notation should be clarified on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the value of our controlled ablation approach. We address each major comment below and commit to revisions that clarify scope and improve statistical reporting without misrepresenting the current results.
read point-by-point responses
-
Referee: [Abstract] The central claim that residual connections and symmetry-breaking nonlinearities explain geometric continuity in deep neural networks rests on the untested assumption that these factors dominate in large-scale practical models; no scaling experiments, comparisons to standard large transformers, or ablations varying optimizer or data distribution are provided to rule out overriding effects from those sources.
Authors: We agree that the experiments are limited to toy MLPs and small transformers and provide no direct evidence that the identified mechanisms dominate at scale. The manuscript's contribution is the causal isolation of residual connections and symmetry breaking within these controlled regimes. In revision we will (i) rewrite the abstract and introduction to explicitly restrict claims to the studied settings, (ii) add a dedicated Limitations section that discusses the absence of scaling studies, large-transformer comparisons, and optimizer/data ablations, and (iii) outline concrete directions for future work. These textual changes will be made; no new large-scale experiments are feasible for this revision. revision: partial
-
Referee: [Experiments] Results on singular-vector alignment and gradient coherence are reported without error bars, multiple random seeds, or statistical tests, making it impossible to assess whether the observed differences (e.g., between symmetry-breaking and rotation-preserving activations) are robust or could arise from initialization variance.
Authors: We accept this criticism. Although the qualitative patterns were reproducible in our development runs, the manuscript does not report variance across seeds. We will re-execute the core MLP and transformer experiments with a minimum of five independent random seeds, add error bars (standard deviation) to all alignment and coherence plots, and include statistical significance tests (paired t-tests) for the key contrasts, such as symmetry-breaking versus rotation-preserving activations. The revised figures and text will reflect these additions. revision: yes
Circularity Check
Empirical identification of mechanisms via controlled experiments; no derivation reduces to inputs
full rationale
The paper's central claims rest on experimental observations from toy MLPs and small transformers, including ablation studies on residual connections, symmetry-breaking nonlinearities, and activation/normalization roles. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the provided text. The work directly tests causal factors (e.g., rotation-preserving activations failing to retain continuity) without self-definitional loops or imported uniqueness theorems. This qualifies as self-contained empirical analysis against external benchmarks, warranting a zero circularity score.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
DOCS: Quantifying weight similarity for deeper insights into large language models
Zeping Min and Xinshang Wang. DOCS: Quantifying weight similarity for deeper insights into large language models. InInternational Conference on Learning Representations, 2025. 9
2025
-
[2]
Basis shar- ing: Cross-layer parameter sharing for large language model compression
Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, and Grace Li Zhang. Basis shar- ing: Cross-layer parameter sharing for large language model compression. InInternational Conference on Learning Representations, 2025
2025
-
[3]
The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887, 2024
Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887, 2024
-
[4]
ShortGPT: Layers in large language models are more redundant than you expect
Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. ShortGPT: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, 2025
2025
-
[5]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[6]
Residual alignment: Uncovering the mechanisms of residual networks
Jianing Li and Vardan Papyan. Residual alignment: Uncovering the mechanisms of residual networks. InAdvances in Neural Information Processing Systems, 2023
2023
-
[7]
Your transformer is secretly linear
Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. Your transformer is secretly linear. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[8]
Gradient descent aligns the layers of deep linear networks
Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019
2019
-
[9]
Implicit regularization of deep residual networks towards neural ODEs
Pierre Marion, Yu-Han Wu, Michael E Sander, and Gérard Biau. Implicit regularization of deep residual networks towards neural ODEs. InInternational Conference on Learning Representations, 2024
2024
-
[10]
Daniel Beaglehole, Ioannis Mitliagkas, and Atish Agarwala. Feature learning as align- ment: a structural property of gradient descent in non-linear neural networks.arXiv preprint arXiv:2402.05271, 2024
-
[11]
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InInternational Conference on Learning Representations, 2014
2014
-
[12]
On the symmetries of deep learning models and their internal representations
Charles Godfrey, Davis Brown, Tegan Emerson, and Henry Kvinge. On the symmetries of deep learning models and their internal representations. InAdvances in Neural Information Processing Systems, 2022
2022
-
[13]
Transformative or conservative? conser- vation laws for ResNets and transformers
Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Transformative or conservative? conser- vation laws for ResNets and transformers. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025
2025
-
[14]
The effective rank: A measure of effective dimensionality
Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 15th European Signal Processing Conference (EUSIPCO), 2007
2007
-
[15]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review arXiv 2016
-
[16]
Language models are unsupervised multitask learners.OpenAI Blog, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 2019
2019
-
[17]
An Yang, Baosong Yang, Beichen Zhang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review arXiv 2025
-
[18]
Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
read” from the residual stream;left singular vectors( u1) for O, Down—these projections show continuity in output space, reflecting how they “write
LG AI Research. EXAONE 4.0 technical report.arXiv preprint, 2025. 10 Figure 4: 3D PCA of principalright singular vectors( v1) across 32 layers. Colors indicate layer index (blue to yellow). Q, K, Up, Gate show smooth trajectories; V , O, Down are scattered; OV composite (WOWV ) shows moderate structure. A Geometric Continuity in Pretrained LLMs A.1 Experi...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.