The Routing and Filtering Structure of Attention
Pith reviewed 2026-05-20 21:59 UTC · model grok-4.3
The pith
Attention separates into low-rank routing that cascades spectrally with depth and filtering that scales relevance, enabling early-layer linearization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The attention interaction matrix QK^T contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce S-D attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability (Re(λ) ≤ 0) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank 2 at t
What carries the argument
S-D attention parameterization that separates skew-symmetric routing from symmetric filtering by construction while guaranteeing Re(λ) ≤ 0 for stability.
If this is right
- Linearizing the first seven layers of 125M S-D attention costs under 5 percent perplexity while standard attention collapses.
- Replacing the first four layers with ELU+1 linear attention reaches within 1.4 percent of baseline performance at full head dimension.
- Cascade-allocated architectures trade 47 to 65 percent fewer attention parameters for a 3.9 to 8.4 percent perplexity increase.
- The linearizable region widens with model depth.
- Effective rank of routing starts at 2 in the first layer and grows across scales from 7M to 355M parameters.
Where Pith is reading between the lines
- The cascade pattern suggests a natural depth hierarchy that could guide layer-wise allocation of attention complexity in new architectures.
- Similar routing-filtering decompositions might reveal simplification opportunities in attention variants used for vision or multimodal models.
- Testing whether the low-rank routing structure appears in models trained on non-language tasks would check if the cascade is domain-specific or general.
- The decomposition could be used to design hybrid models that apply full attention only after the linearizable prefix.
Load-bearing premise
The skew-symmetric and symmetric decomposition of the attention interaction matrix cleanly separates routing from filtering in a way that preserves model behavior and permits stable training of the S-D parameterization without layer normalization.
What would settle it
Linearizing the first seven layers of a 125M S-D attention model and observing a perplexity increase larger than 5 percent would falsify the claim that the spectral cascade identifies safe simplification regions.
Figures
read the original abstract
The attention interaction matrix $QK^{\top}$ contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce $S$-$D$ attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability ($\mathrm{Re}(\lambda) \le 0$) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank $2$ at the first layer, expanding with depth across six scales from 7M to 355M parameters. The cascade predicts where attention can be simplified: linearizing the first seven layers of 125M $S$-$D$ attention costs ${<}5\%$ perplexity, whereas standard attention collapses under the same intervention. The linearizable region widens with depth. Replacing the first four layers with ELU+1 linear attention reaches within $1.4\%$ of baseline at full head dimension. Cascade-allocated architectures trade attention parameters for perplexity ($47\%-65\%$ fewer attention parameters at $+3.9\%$ to $+8.4\%$ PPL). The routing-filtering decomposition makes the spectral budget legible; the cascade makes it actionable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the attention interaction matrix QK^T decomposes into a skew-symmetric routing component (redistributing information across positions) and a symmetric filtering component (scaling mutual relevance). Empirical decomposition of 1776 heads across five pretrained transformers reveals routing at low effective rank, below the capacity of the weight kernel. The authors introduce S-D attention, a parameterization that disentangles routing from filtering by construction with Re(λ) ≤ 0 stability guarantees, enabling stable training without layer normalization. In this disentangled form, routing self-organizes into a spectral cascade with effective rank 2 at layer 1 that expands with depth across model scales (7M to 355M parameters). This cascade identifies linearizable regions: linearizing the first seven layers of 125M S-D attention incurs <5% perplexity cost (vs. collapse in standard attention), with further gains from ELU+1 linear attention in early layers and parameter-efficient cascade-allocated architectures.
Significance. If the routing-filtering decomposition and spectral cascade hold without being artifacts of the S-D parameterization, the work provides a principled, actionable basis for simplifying attention in early layers while preserving performance. This could inform efficient transformer design, such as hybrid linear attention architectures that trade parameters for modest perplexity increases. The empirical scale (multiple models, heads, and scales) and the predictive use of the cascade for linearization experiments are strengths, though the diagnostic value depends on verifying that S-D matches standard attention expressivity and dynamics.
major comments (2)
- [S-D attention definition and stability analysis] S-D attention parameterization (abstract and methods): The claim that the skew-symmetric/symmetric split 'disentangles routing from filtering by construction' with guaranteed stability and equivalent training dynamics requires explicit verification that the constrained parameterization attains the same expressivity as unconstrained attention. Without this, the reported effective rank 2 at layer 1 and spectral expansion could be induced by the Re(λ) ≤ 0 and antisymmetry constraints rather than emerging in standard attention, undermining both the pretrained decomposition results and the linearization predictions.
- [Linearization and cascade-allocated architecture experiments] Linearization experiments (abstract): The result that linearizing the first seven layers costs <5% perplexity in S-D attention but collapses in standard attention is load-bearing for the cascade's predictive value. This comparison needs controls confirming that the performance gap is due to the revealed routing structure rather than differences in training stability or normalization between the two attention forms.
minor comments (2)
- [Abstract and experimental setup] The abstract reports decomposition across 1776 heads but lacks details on error bars, variance across runs, or precise data exclusion rules for the pretrained models; these should be added for verifiability.
- [Introduction or methods] Notation for the skew-symmetric and symmetric components of QK^T should be introduced with explicit matrix equations early in the paper to clarify the decomposition.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation of major revision. We address each major comment below with clarifications drawn from the manuscript and indicate revisions where they strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [S-D attention definition and stability analysis] S-D attention parameterization (abstract and methods): The claim that the skew-symmetric/symmetric split 'disentangles routing from filtering by construction' with guaranteed stability and equivalent training dynamics requires explicit verification that the constrained parameterization attains the same expressivity as unconstrained attention. Without this, the reported effective rank 2 at layer 1 and spectral expansion could be induced by the Re(λ) ≤ 0 and antisymmetry constraints rather than emerging in standard attention, undermining both the pretrained decomposition results and the linearization predictions.
Authors: The decomposition of QK^T into skew-symmetric routing and symmetric filtering is performed directly on the interaction matrices extracted from 1776 heads in five standard pretrained transformers; it does not rely on the S-D parameterization at all. The low effective rank of routing is therefore an independent empirical observation. For the S-D experiments, the parameterization is introduced precisely to enforce disentanglement and stability (Re(λ) ≤ 0) while still permitting the model to reach competitive perplexity. In the revised manuscript we will add a dedicated subsection that (i) sketches how any attention matrix whose eigenvalues satisfy Re(λ) ≤ 0 can be represented in the S-D form and (ii) reports side-by-side training curves and final perplexity for S-D versus standard attention on the same data and scale, confirming that the spectral cascade is not an artifact of the constraints but emerges once routing and filtering are isolated. revision: yes
-
Referee: [Linearization and cascade-allocated architecture experiments] Linearization experiments (abstract): The result that linearizing the first seven layers costs <5% perplexity in S-D attention but collapses in standard attention is load-bearing for the cascade's predictive value. This comparison needs controls confirming that the performance gap is due to the revealed routing structure rather than differences in training stability or normalization between the two attention forms.
Authors: We agree that isolating the contribution of the routing cascade requires matched controls. The current experiments already train both forms from scratch on identical data and report that standard attention collapses under early-layer linearization while S-D does not; however, to further rule out normalization or stability confounds we will add, in revision, (i) a set of standard-attention runs trained without layer normalization and (ii) S-D runs with an auxiliary normalization term restored. We will also include training-loss curves and eigenvalue histograms for both families so that readers can verify that the linearization gap tracks the presence of the low-rank routing cascade rather than differences in optimization dynamics. revision: yes
Circularity Check
No significant circularity; empirical decomposition and diagnostic parameterization are self-contained
full rationale
The paper decomposes QK^T in 1776 heads from five pretrained transformers to measure low-rank routing, then introduces S-D attention as a diagnostic split (skew-symmetric routing vs. symmetric filtering) that is applied to new models. The spectral cascade is reported as an observed training outcome in disentangled unnormalized S-D models across scales, with downstream predictions tested via perplexity on linearization interventions. No quoted step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the 'by construction' phrasing refers only to the definitional split itself, not to the rank or cascade measurements. External benchmarks (perplexity, parameter counts) remain independent of the internal definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The attention interaction matrix QK^T decomposes into a skew-symmetric routing component and a symmetric filtering component.
invented entities (1)
-
S-D attention
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce S–D attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability (Re(λ)≤0) and trains stably without layer normalization.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Routing organizes into a spectral cascade... layer 0 collapses to effective rank 2.00... terminal layer reaches rank 19–21
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need,” inAdvances in Neural Information Processing Systems 30, 2017
work page 2017
-
[2]
Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,
NVIDIA. “Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,” arXiv preprint arXiv:2504.03624, 2025. 9
-
[3]
Jamba: A Hybrid Transformer-Mamba Language Model
O. Lieber, B. Lenz, H. Bata, et al. “Jamba: A hybrid transformer-Mamba language model,” arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
S. De, S. L. Smith, A. Fernando, A. Botev, et al. “Griffin: Mixing gated linear recurrences with local attention for efficient language models,” arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Zamba: A compact 7B SSM hybrid model,
P. Glorioso, Q. Anthony, Y . Tokpanov, et al. “Zamba: A compact 7B SSM hybrid model,” arXiv preprint, arXiv:2405.16712, 2024
-
[6]
L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P
A. Botev, S. De, S. L. Smith, A. Fernando, et al. “RecurrentGemma: Moving past transformers for efficient open language models,” arXiv preprint, arXiv:2404.07839, 2024
-
[7]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao. “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. “Language models are unsupervised multitask learners,” OpenAI blog, 2019
work page 2019
-
[9]
BERT: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of NAACL-HLT, 2019
work page 2019
-
[10]
J. L. Ba, J. R. Kiros, and G. E. Hinton. “Layer normalization,” arXiv preprint, arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
S. Jamil, and R. Kapadia. “Interpretable Physics Extraction from Data for Linear Dynamical Systems using Lie Generator Networks,” arXiv preprint, arXiv:2603.27442, 2026
-
[12]
Lie Generator Networks for Nonlinear Partial Differential Equations,
S. Jamil, and R. Kapadia. “Lie Generator Networks for Nonlinear Partial Differential Equations,” arXiv preprint, arXiv:2603.29264, 2026
-
[13]
A. Gokaslan and V . Cohen. OpenWebText corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019
work page 2019
-
[14]
Pointer sentinel mixture models,
S. Merity, C. Xiong, J. Bradbury, and R. Socher. “Pointer sentinel mixture models,” inInternational Conference on Learning Representations (ICLR), 2017
work page 2017
-
[15]
FlashAttention: Fast and memory-efficient exact attention with IO-awareness,
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inAdvances in Neural Information Processing Systems 35, 2022
work page 2022
-
[16]
Transformers are RNNs: Fast autoregressive transformers with linear attention,
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. “Transformers are RNNs: Fast autoregressive transformers with linear attention,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020
work page 2020
-
[17]
Efficiently modeling long sequences with structured state spaces,
A. Gu, K. Goel, and C. Ré. “Efficiently modeling long sequences with structured state spaces,” inInterna- tional Conference on Learning Representations (ICLR), 2022
work page 2022
-
[18]
Gated linear attention transformers with hardware- efficient training,
S. Yang, B. Wang, Y . Shen, R. Panda, and Y . Kim. “Gated linear attention transformers with hardware- efficient training,” inProceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[19]
Rethinking attention with Performers,
K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, et al. “Rethinking attention with Performers,” in International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[20]
RWKV: Reinventing RNNs for the transformer era,
B. Peng, E. Alcaide, Q. Anthony, et al. “RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP, 2023
work page 2023
-
[21]
Retentive Network: A Successor to Transformer for Large Language Models
Y . Sun, L. Dong, S. Huang, S. Ma, et al. “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Pythia: A suite for analyzing large language models across training and scaling,
S. Biderman, H. Schoelkopf, Q. Anthony, et al. “Pythia: A suite for analyzing large language models across training and scaling,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023. A Evaluation Sequences and Decomposition Protocol The pretrained model analysis in Section 3 uses six text sequences spanning different domain...
work page 2023
-
[23]
The cat sat on the mat and then it slowly walked to the door
“The cat sat on the mat and then it slowly walked to the door.” (15 tokens)
-
[24]
“Although the government proposed sweeping reforms to the healthcare system last year, the legislature has not yet passed any of the key provisions that were originally outlined in the draft bill submitted by the committee.” (38 tokens)
-
[25]
“In fluid dynamics, the Navier–Stokes equations describe the motion of viscous fluid sub- stances. These partial differential equations arise from applying Newton’s second law to 10 fluid motion, together with the assumption that the stress in the fluid is the sum of a diffusing viscous term and a pressure term.” (49 tokens)
-
[26]
“She told him that the book he had lent her, which she had finally finished reading over the weekend, was one of the most thought-provoking novels she had encountered in years.” (35 tokens)
-
[27]
The quick brown fox jumps over the lazy dog
“The quick brown fox jumps over the lazy dog.” (10 tokens)
-
[28]
Token counts reflect GPT-2 BPE tokenization
“Scientists at CERN announced that the particle accelerator had produced results consistent with theoretical predictions made decades ago, confirming that the Standard Model remains robust despite numerous attempts to find physics beyond it.” (37 tokens) All ρ and eigenvalue statistics are averaged across these six sequences per head. Token counts reflect...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.