pith. sign in

arxiv: 2605.18826 · v1 · pith:RBR5UVUPnew · submitted 2026-05-12 · 💻 cs.LG · cs.AI

The Routing and Filtering Structure of Attention

Pith reviewed 2026-05-20 21:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords attention mechanismtransformer modelsrouting and filteringspectral cascadelow-rank structurelinear attentionparameter efficiencymodel simplification
0
0 comments X

The pith

Attention separates into low-rank routing that cascades spectrally with depth and filtering that scales relevance, enabling early-layer linearization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes the attention interaction matrix QK^T into a skew-symmetric routing component that redistributes information between positions and a symmetric filtering component that scales mutual relevance. Across 1776 heads in five pretrained transformers, routing operates at low effective rank well below the capacity of the weight kernel. The authors introduce S-D attention, a parameterization that disentangles the two by construction and trains stably without layer normalization. When isolated and unnormalized, routing self-organizes into a spectral cascade whose effective rank begins at 2 in the first layer and expands with depth across models from 7M to 355M parameters. This structure identifies which layers tolerate simplification: linearizing the first seven layers of a 125M S-D model costs less than 5 percent perplexity, while standard attention collapses under the same change.

Core claim

The attention interaction matrix QK^T contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce S-D attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability (Re(λ) ≤ 0) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank 2 at t

What carries the argument

S-D attention parameterization that separates skew-symmetric routing from symmetric filtering by construction while guaranteeing Re(λ) ≤ 0 for stability.

If this is right

  • Linearizing the first seven layers of 125M S-D attention costs under 5 percent perplexity while standard attention collapses.
  • Replacing the first four layers with ELU+1 linear attention reaches within 1.4 percent of baseline performance at full head dimension.
  • Cascade-allocated architectures trade 47 to 65 percent fewer attention parameters for a 3.9 to 8.4 percent perplexity increase.
  • The linearizable region widens with model depth.
  • Effective rank of routing starts at 2 in the first layer and grows across scales from 7M to 355M parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cascade pattern suggests a natural depth hierarchy that could guide layer-wise allocation of attention complexity in new architectures.
  • Similar routing-filtering decompositions might reveal simplification opportunities in attention variants used for vision or multimodal models.
  • Testing whether the low-rank routing structure appears in models trained on non-language tasks would check if the cascade is domain-specific or general.
  • The decomposition could be used to design hybrid models that apply full attention only after the linearizable prefix.

Load-bearing premise

The skew-symmetric and symmetric decomposition of the attention interaction matrix cleanly separates routing from filtering in a way that preserves model behavior and permits stable training of the S-D parameterization without layer normalization.

What would settle it

Linearizing the first seven layers of a 125M S-D attention model and observing a perplexity increase larger than 5 percent would falsify the claim that the spectral cascade identifies safe simplification regions.

Figures

Figures reproduced from arXiv: 2605.18826 by Rehan Kapadia, Shafayeth Jamil.

Figure 1
Figure 1. Figure 1: The routing–filtering structure of pretrained models. (a) Mean [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The spectral cascade. (a) Spectral budget of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Linearization cost follows the cascade. (a) Per-layer [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spectrum-guided architecture. (a) Val perplexity vs. total parameters across four configura [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The spectral budget of attention in GPT-2 Large (WikiText-2 val, baseline PPL [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: S–D attention at 125M parameters (OpenWebText). Training perplexity. S–D without LayerNorm trains stably over 28,000 steps with no divergence. C The Diagonal Offset ε The S–D parameterization L = S − D constrains Re(λ) ≤ 0 for any D ⪰ 0. We parameterize di = softplus(Wd · xi + bd) + ε with bd a learnable per-head vector and ε a fixed offset. The role of ε is to keep eigenvalues away from the imaginary axis… view at source ↗
Figure 7
Figure 7. Figure 7: Cascade emergence during 355M training. (a) Per-layer effective routing rank at log-spaced [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

The attention interaction matrix $QK^{\top}$ contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce $S$-$D$ attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability ($\mathrm{Re}(\lambda) \le 0$) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank $2$ at the first layer, expanding with depth across six scales from 7M to 355M parameters. The cascade predicts where attention can be simplified: linearizing the first seven layers of 125M $S$-$D$ attention costs ${<}5\%$ perplexity, whereas standard attention collapses under the same intervention. The linearizable region widens with depth. Replacing the first four layers with ELU+1 linear attention reaches within $1.4\%$ of baseline at full head dimension. Cascade-allocated architectures trade attention parameters for perplexity ($47\%-65\%$ fewer attention parameters at $+3.9\%$ to $+8.4\%$ PPL). The routing-filtering decomposition makes the spectral budget legible; the cascade makes it actionable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the attention interaction matrix QK^T decomposes into a skew-symmetric routing component (redistributing information across positions) and a symmetric filtering component (scaling mutual relevance). Empirical decomposition of 1776 heads across five pretrained transformers reveals routing at low effective rank, below the capacity of the weight kernel. The authors introduce S-D attention, a parameterization that disentangles routing from filtering by construction with Re(λ) ≤ 0 stability guarantees, enabling stable training without layer normalization. In this disentangled form, routing self-organizes into a spectral cascade with effective rank 2 at layer 1 that expands with depth across model scales (7M to 355M parameters). This cascade identifies linearizable regions: linearizing the first seven layers of 125M S-D attention incurs <5% perplexity cost (vs. collapse in standard attention), with further gains from ELU+1 linear attention in early layers and parameter-efficient cascade-allocated architectures.

Significance. If the routing-filtering decomposition and spectral cascade hold without being artifacts of the S-D parameterization, the work provides a principled, actionable basis for simplifying attention in early layers while preserving performance. This could inform efficient transformer design, such as hybrid linear attention architectures that trade parameters for modest perplexity increases. The empirical scale (multiple models, heads, and scales) and the predictive use of the cascade for linearization experiments are strengths, though the diagnostic value depends on verifying that S-D matches standard attention expressivity and dynamics.

major comments (2)
  1. [S-D attention definition and stability analysis] S-D attention parameterization (abstract and methods): The claim that the skew-symmetric/symmetric split 'disentangles routing from filtering by construction' with guaranteed stability and equivalent training dynamics requires explicit verification that the constrained parameterization attains the same expressivity as unconstrained attention. Without this, the reported effective rank 2 at layer 1 and spectral expansion could be induced by the Re(λ) ≤ 0 and antisymmetry constraints rather than emerging in standard attention, undermining both the pretrained decomposition results and the linearization predictions.
  2. [Linearization and cascade-allocated architecture experiments] Linearization experiments (abstract): The result that linearizing the first seven layers costs <5% perplexity in S-D attention but collapses in standard attention is load-bearing for the cascade's predictive value. This comparison needs controls confirming that the performance gap is due to the revealed routing structure rather than differences in training stability or normalization between the two attention forms.
minor comments (2)
  1. [Abstract and experimental setup] The abstract reports decomposition across 1776 heads but lacks details on error bars, variance across runs, or precise data exclusion rules for the pretrained models; these should be added for verifiability.
  2. [Introduction or methods] Notation for the skew-symmetric and symmetric components of QK^T should be introduced with explicit matrix equations early in the paper to clarify the decomposition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and recommendation of major revision. We address each major comment below with clarifications drawn from the manuscript and indicate revisions where they strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [S-D attention definition and stability analysis] S-D attention parameterization (abstract and methods): The claim that the skew-symmetric/symmetric split 'disentangles routing from filtering by construction' with guaranteed stability and equivalent training dynamics requires explicit verification that the constrained parameterization attains the same expressivity as unconstrained attention. Without this, the reported effective rank 2 at layer 1 and spectral expansion could be induced by the Re(λ) ≤ 0 and antisymmetry constraints rather than emerging in standard attention, undermining both the pretrained decomposition results and the linearization predictions.

    Authors: The decomposition of QK^T into skew-symmetric routing and symmetric filtering is performed directly on the interaction matrices extracted from 1776 heads in five standard pretrained transformers; it does not rely on the S-D parameterization at all. The low effective rank of routing is therefore an independent empirical observation. For the S-D experiments, the parameterization is introduced precisely to enforce disentanglement and stability (Re(λ) ≤ 0) while still permitting the model to reach competitive perplexity. In the revised manuscript we will add a dedicated subsection that (i) sketches how any attention matrix whose eigenvalues satisfy Re(λ) ≤ 0 can be represented in the S-D form and (ii) reports side-by-side training curves and final perplexity for S-D versus standard attention on the same data and scale, confirming that the spectral cascade is not an artifact of the constraints but emerges once routing and filtering are isolated. revision: yes

  2. Referee: [Linearization and cascade-allocated architecture experiments] Linearization experiments (abstract): The result that linearizing the first seven layers costs <5% perplexity in S-D attention but collapses in standard attention is load-bearing for the cascade's predictive value. This comparison needs controls confirming that the performance gap is due to the revealed routing structure rather than differences in training stability or normalization between the two attention forms.

    Authors: We agree that isolating the contribution of the routing cascade requires matched controls. The current experiments already train both forms from scratch on identical data and report that standard attention collapses under early-layer linearization while S-D does not; however, to further rule out normalization or stability confounds we will add, in revision, (i) a set of standard-attention runs trained without layer normalization and (ii) S-D runs with an auxiliary normalization term restored. We will also include training-loss curves and eigenvalue histograms for both families so that readers can verify that the linearization gap tracks the presence of the low-rank routing cascade rather than differences in optimization dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical decomposition and diagnostic parameterization are self-contained

full rationale

The paper decomposes QK^T in 1776 heads from five pretrained transformers to measure low-rank routing, then introduces S-D attention as a diagnostic split (skew-symmetric routing vs. symmetric filtering) that is applied to new models. The spectral cascade is reported as an observed training outcome in disentangled unnormalized S-D models across scales, with downstream predictions tested via perplexity on linearization interventions. No quoted step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the 'by construction' phrasing refers only to the definitional split itself, not to the rank or cascade measurements. External benchmarks (perplexity, parameter counts) remain independent of the internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; main structural addition is the S-D parameterization and the assumption that the skew-symmetric/symmetric split is meaningful for attention dynamics. No explicit free parameters or invented physical entities are described.

axioms (1)
  • domain assumption The attention interaction matrix QK^T decomposes into a skew-symmetric routing component and a symmetric filtering component.
    Foundational premise stated in the abstract that enables the entire analysis and S-D parameterization.
invented entities (1)
  • S-D attention no independent evidence
    purpose: Diagnostic parameterization that disentangles routing from filtering by construction while guaranteeing stability (Re(lambda) <= 0).
    Newly introduced mechanism for the decomposition and experiments; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5785 in / 1332 out tokens · 94226 ms · 2026-05-20T21:59:37.690448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 5 internal anchors

  1. [1]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need,” inAdvances in Neural Information Processing Systems 30, 2017

  2. [2]

    Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,

    NVIDIA. “Nemotron-H: A family of accurate and efficient hybrid Mamba-Transformer models,” arXiv preprint arXiv:2504.03624, 2025. 9

  3. [3]

    Jamba: A Hybrid Transformer-Mamba Language Model

    O. Lieber, B. Lenz, H. Bata, et al. “Jamba: A hybrid transformer-Mamba language model,” arXiv preprint arXiv:2403.19887, 2024

  4. [4]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    S. De, S. L. Smith, A. Fernando, A. Botev, et al. “Griffin: Mixing gated linear recurrences with local attention for efficient language models,” arXiv preprint arXiv:2402.19427, 2024

  5. [5]

    Zamba: A compact 7B SSM hybrid model,

    P. Glorioso, Q. Anthony, Y . Tokpanov, et al. “Zamba: A compact 7B SSM hybrid model,” arXiv preprint, arXiv:2405.16712, 2024

  6. [6]

    L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P

    A. Botev, S. De, S. L. Smith, A. Fernando, et al. “RecurrentGemma: Moving past transformers for efficient open language models,” arXiv preprint, arXiv:2404.07839, 2024

  7. [7]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao. “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023

  8. [8]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. “Language models are unsupervised multitask learners,” OpenAI blog, 2019

  9. [9]

    BERT: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of NAACL-HLT, 2019

  10. [10]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton. “Layer normalization,” arXiv preprint, arXiv:1607.06450, 2016

  11. [11]

    Interpretable Physics Extraction from Data for Linear Dynamical Systems using Lie Generator Networks,

    S. Jamil, and R. Kapadia. “Interpretable Physics Extraction from Data for Linear Dynamical Systems using Lie Generator Networks,” arXiv preprint, arXiv:2603.27442, 2026

  12. [12]

    Lie Generator Networks for Nonlinear Partial Differential Equations,

    S. Jamil, and R. Kapadia. “Lie Generator Networks for Nonlinear Partial Differential Equations,” arXiv preprint, arXiv:2603.29264, 2026

  13. [13]

    Gokaslan and V

    A. Gokaslan and V . Cohen. OpenWebText corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

  14. [14]

    Pointer sentinel mixture models,

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. “Pointer sentinel mixture models,” inInternational Conference on Learning Representations (ICLR), 2017

  15. [15]

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inAdvances in Neural Information Processing Systems 35, 2022

  16. [16]

    Transformers are RNNs: Fast autoregressive transformers with linear attention,

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. “Transformers are RNNs: Fast autoregressive transformers with linear attention,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020

  17. [17]

    Efficiently modeling long sequences with structured state spaces,

    A. Gu, K. Goel, and C. Ré. “Efficiently modeling long sequences with structured state spaces,” inInterna- tional Conference on Learning Representations (ICLR), 2022

  18. [18]

    Gated linear attention transformers with hardware- efficient training,

    S. Yang, B. Wang, Y . Shen, R. Panda, and Y . Kim. “Gated linear attention transformers with hardware- efficient training,” inProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  19. [19]

    Rethinking attention with Performers,

    K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, et al. “Rethinking attention with Performers,” in International Conference on Learning Representations (ICLR), 2021

  20. [20]

    RWKV: Reinventing RNNs for the transformer era,

    B. Peng, E. Alcaide, Q. Anthony, et al. “RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP, 2023

  21. [21]

    Retentive Network: A Successor to Transformer for Large Language Models

    Y . Sun, L. Dong, S. Huang, S. Ma, et al. “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621, 2023

  22. [22]

    Pythia: A suite for analyzing large language models across training and scaling,

    S. Biderman, H. Schoelkopf, Q. Anthony, et al. “Pythia: A suite for analyzing large language models across training and scaling,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023. A Evaluation Sequences and Decomposition Protocol The pretrained model analysis in Section 3 uses six text sequences spanning different domain...

  23. [23]

    The cat sat on the mat and then it slowly walked to the door

    “The cat sat on the mat and then it slowly walked to the door.” (15 tokens)

  24. [24]

    “Although the government proposed sweeping reforms to the healthcare system last year, the legislature has not yet passed any of the key provisions that were originally outlined in the draft bill submitted by the committee.” (38 tokens)

  25. [25]

    “In fluid dynamics, the Navier–Stokes equations describe the motion of viscous fluid sub- stances. These partial differential equations arise from applying Newton’s second law to 10 fluid motion, together with the assumption that the stress in the fluid is the sum of a diffusing viscous term and a pressure term.” (49 tokens)

  26. [26]

    She told him that the book he had lent her, which she had finally finished reading over the weekend, was one of the most thought-provoking novels she had encountered in years

    “She told him that the book he had lent her, which she had finally finished reading over the weekend, was one of the most thought-provoking novels she had encountered in years.” (35 tokens)

  27. [27]

    The quick brown fox jumps over the lazy dog

    “The quick brown fox jumps over the lazy dog.” (10 tokens)

  28. [28]

    Token counts reflect GPT-2 BPE tokenization

    “Scientists at CERN announced that the particle accelerator had produced results consistent with theoretical predictions made decades ago, confirming that the Standard Model remains robust despite numerous attempts to find physics beyond it.” (37 tokens) All ρ and eigenvalue statistics are averaged across these six sequences per head. Token counts reflect...