pith. sign in

arxiv: 2603.15031 · v1 · pith:D427NFWTnew · submitted 2026-03-16 · 💻 cs.CL

Attention Residuals

Pith reviewed 2026-05-21 06:33 UTC · model grok-4.3

classification 💻 cs.CL
keywords attention residualsresidual connectionspre-normlarge language modelstransformer architecturegradient distributionmodel scalingblock attention
0
0 comments X

The pith

Attention Residuals replace fixed residual sums with learned attention over prior layers to reduce PreNorm dilution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard residual connections in modern LLMs add every preceding layer output with equal fixed weights, causing hidden states to grow uncontrollably and diluting each layer's contribution deeper in the network. Attention Residuals address this by computing softmax attention over preceding layer outputs, so each layer assigns input-dependent weights to earlier representations instead of treating them uniformly. The result is more even output magnitudes and gradient flow across depth. A blocked variant attends only over block-level aggregates to keep memory costs manageable during large-scale training. When inserted into a 48B model and pre-trained on 1.4T tokens, the change improves results on every downstream task evaluated.

Core claim

Attention Residuals (AttnRes) replace the fixed unit-weight accumulation of residual connections with softmax attention computed over the outputs of preceding layers. This lets each layer selectively and dynamically weight earlier representations according to the current input. Block AttnRes partitions the stack into blocks and attends over block representations to control memory use, while cache-based pipeline communication and two-phase computation keep overhead low. The change produces more uniform hidden-state magnitudes and gradient distributions across depth, yielding measurable gains when the method is applied at scale.

What carries the argument

Attention Residuals (AttnRes), which computes softmax attention weights over preceding layer outputs to replace fixed residual additions.

If this is right

  • Hidden-state magnitudes remain more uniform across network depth instead of growing uncontrollably.
  • Gradient norms are distributed more evenly, reducing the dilution of early-layer signals.
  • Downstream task performance improves consistently when the replacement is applied at scale.
  • The gains hold across different model sizes according to scaling-law experiments.
  • Block AttnRes preserves most benefits while keeping memory and communication costs low enough for practical training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-aggregation idea could be tested in non-transformer stacks where depth causes similar contribution decay.
  • Attention entropy over layers during training would indicate how selective the mechanism actually becomes in practice.
  • The method might reduce the need for extra normalization tricks when pushing transformer depth further.
  • Content-dependent depth weighting could interact usefully with mixture-of-experts routing that already selects across experts.

Load-bearing premise

The learned attention weights over preceding layers or blocks will remain stable and effective throughout large-scale pre-training without introducing new optimization difficulties.

What would settle it

Train two identical models to the same token count, one with standard PreNorm residuals and one with AttnRes; if the variance of layer-output magnitudes stays unchanged and downstream scores show no improvement, the central benefit is falsified.

read the original abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Attention Residuals (AttnRes), replacing the fixed unit-weight summation of standard PreNorm residuals with softmax attention over preceding layer outputs to enable learned, input-dependent aggregation that mitigates dilution of earlier contributions. A Block AttnRes variant is introduced to reduce memory overhead by attending over block representations, supported by cache-based pipeline communication and a two-phase strategy. Scaling-law experiments, ablations, and a 48B-parameter (3B activated) pre-training run on 1.4T tokens are reported to yield more uniform output magnitudes and gradient distributions across depth, with downstream gains on all evaluated tasks.

Significance. If the uniformity gains and downstream improvements prove robust and attributable to the content-dependent weighting rather than added parameters or compute, the approach could provide a practical architectural fix for a known limitation in deep transformer training dynamics, potentially improving capacity utilization in very deep models.

major comments (2)
  1. [Abstract and scaling-law experiments] Abstract and scaling-law section: the claim of consistent improvement across model sizes and better downstream performance lacks any reported effect sizes, baseline comparisons (e.g., standard PreNorm with matched parameter count), or statistical significance tests, leaving the magnitude and reliability of the gains unclear.
  2. [Ablations and pre-training results] Ablations and uniformity results: while ablations are stated to validate content-dependent selection, no quantitative checks are provided on attention entropy, weight distributions over early vs. recent blocks, or layer-wise contribution magnitudes during the 1.4T-token run. This directly bears on the central claim, as collapse to recent blocks would reduce AttnRes to a standard residual plus overhead without dilution mitigation.
minor comments (2)
  1. [Block AttnRes description] The description of the two-phase computation strategy and cache-based pipeline communication would benefit from a concise pseudocode or diagram to clarify implementation overhead.
  2. [Method section] Notation for block partitioning and attention over block representations should be made fully explicit, including how block-level outputs are cached and reused.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and indicate the changes made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and scaling-law experiments] Abstract and scaling-law section: the claim of consistent improvement across model sizes and better downstream performance lacks any reported effect sizes, baseline comparisons (e.g., standard PreNorm with matched parameter count), or statistical significance tests, leaving the magnitude and reliability of the gains unclear.

    Authors: We agree that explicit effect sizes and matched-parameter baseline comparisons would improve clarity. In the revised manuscript we have added a table of relative improvements (with effect sizes) across the scaling-law models and confirmed that all PreNorm baselines use identical parameter counts and training budgets. For statistical significance, smaller-scale experiments include multiple random seeds with consistent trends; the 48B run is reported as a single large-scale experiment due to compute limits, but the accompanying uniformity and gradient analyses provide corroborating evidence that the gains are not artifacts of a single run. revision: yes

  2. Referee: [Ablations and pre-training results] Ablations and uniformity results: while ablations are stated to validate content-dependent selection, no quantitative checks are provided on attention entropy, weight distributions over early vs. recent blocks, or layer-wise contribution magnitudes during the 1.4T-token run. This directly bears on the central claim, as collapse to recent blocks would reduce AttnRes to a standard residual plus overhead without dilution mitigation.

    Authors: This observation is correct and directly relevant to the core claim. We have added the requested quantitative checks in the revised version: attention entropy statistics across layers, histograms of attention weights on early versus recent blocks, and layer-wise contribution magnitude plots measured during the 1.4T-token pre-training. These analyses show non-collapsed distributions with meaningful weight on earlier blocks, supporting that AttnRes mitigates dilution rather than reducing to a standard residual. revision: yes

Circularity Check

0 steps flagged

No derivation chain; claims rest on direct pre-training and ablations

full rationale

The paper introduces AttnRes as an architectural replacement for fixed residual summation, then validates it via scaling-law experiments and 1.4T-token pre-training on the Kimi Linear model. No equations or first-principles steps are presented that reduce to fitted parameters or self-citations by construction. The uniformity and performance claims are measured outcomes, not predictions forced by the input design. A single self-reference to the authors' prior Kimi Linear work appears but is not load-bearing for the AttnRes contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard transformer assumptions of PreNorm and additive residuals; the attention weights themselves are learned parameters rather than hand-chosen free parameters. No new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Standard transformer architecture with PreNorm and residual connections is the baseline.
    The paper treats these as given in modern LLMs and measures improvement relative to them.

pith-pipeline@v0.9.0 · 5894 in / 1241 out tokens · 46644 ms · 2026-05-21T06:33:10.060969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

    cs.CL 2026-05 conditional novelty 7.0

    Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.

  2. Delta Attention Residuals

    cs.LG 2026-05 unverdicted novelty 7.0

    Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals...

  3. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  4. Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

    cs.LG 2026-05 unverdicted novelty 7.0

    Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

  5. NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

    cs.CV 2026-05 unverdicted novelty 7.0

    NavOne enables one-step global navigation planning on top-down maps using a unified multi-modal framework, achieving state-of-the-art results and up to 80x speedup on the new R2R-TopDown dataset.

  6. NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

    cs.CV 2026-05 unverdicted novelty 7.0

    NavOne reformulates vision-language navigation as single-step global path planning on top-down maps, delivering state-of-the-art results and 8x-80x speedups over prior map-based and egocentric baselines.

  7. Transformers with Selective Access to Early Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

  8. Transformers with Selective Access to Early Representations

    cs.LG 2026-05 unverdicted novelty 7.0

    SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.

  9. Gradient Boosting within a Single Attention Layer

    cs.LG 2026-04 conditional novelty 7.0

    Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...

  10. XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation

    cs.CV 2026-03 unverdicted novelty 7.0

    XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines eve...

  11. Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

    cs.LG 2026-05 conditional novelty 6.0

    Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.

  12. Rethinking Cross-Layer Information Routing in Diffusion Transformers

    cs.CV 2026-05 conditional novelty 6.0

    DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.

  13. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.

  14. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...

  15. Attention Drift: What Autoregressive Speculative Decoding Models Learn

    cs.LG 2026-05 unverdicted novelty 6.0

    Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improv...

  16. RigidFormer: Learning Rigid Dynamics using Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.

  17. L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation

    cs.CV 2026-05 unverdicted novelty 6.0

    A new method accumulates historical pose features across layers in a Transformer network to reach state-of-the-art 3D human pose estimation accuracy.

  18. Cubit: Token Mixer with Kernel Ridge Regression

    cs.LG 2026-05 unverdicted novelty 6.0

    Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.

  19. NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

    cs.CV 2026-05 unverdicted novelty 6.0

    NavOne enables one-step global path planning for vision-language navigation on top-down maps via a unified neural framework, achieving SOTA among map-based methods with 8x and 80x speedups on the new R2R-TopDown dataset.

  20. When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...

  21. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...

  22. L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation

    cs.CV 2026-05 unverdicted novelty 5.0

    L2A achieves state-of-the-art 3D human pose estimation by maintaining consistent feature spaces across layers and adaptively aggregating historical pose representations to reuse early-layer spatial and motion cues.

  23. Cubit: Token Mixer with Kernel Ridge Regression

    cs.LG 2026-05 unverdicted novelty 5.0

    Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

  24. Hyperloop Transformers

    cs.LG 2026-04 unverdicted novelty 5.0

    Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

  25. DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation

    cs.CL 2026-04 unverdicted novelty 5.0

    DALM is a proposed language model architecture that enforces algebraic constraints via a three-phase process over domain lattices to prevent cross-domain knowledge contamination during generation.

  26. Attention Sinks and Outliers in Attention Residuals

    cs.LG 2026-05 unverdicted novelty 4.0

    OASIS mitigates attention sinks and outliers in AttnResidual models via Softmax1 null space and inter-layer signals, reporting norm and kurtosis reductions plus large gains in quantized perplexity and task accuracy.

  27. BARFI-Q: Quantum-Enhanced Block Attention Residual Fusion Framework for Multivariate Time-Series Forecasting in Atom Interferometry

    quant-ph 2026-05 unverdicted novelty 4.0

    BARFI-Q integrates patch-based embedding, dual-branch temporal modeling, hierarchical fusion, adaptive block-attention residuals, and quantum feature mapping to forecast atom interferometry time-series, outperforming ...

  28. A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma

    cs.AI 2026-05 unverdicted novelty 3.0

    AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 21 Pith papers · 31 internal anchors

  1. [1]

    Jacob Austin et al.Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732 [cs.PL].URL: https://arxiv.org/abs/2108.07732

  2. [2]

    Thomas Bachlechner et al.ReZero is All You Need: Fast Convergence at Large Depth. 2020. arXiv:2003.04887 [cs.LG].URL:https://arxiv.org/abs/2003.04887

  3. [3]

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural Machine Translation by Jointly Learning to Align and Translate. 2016. arXiv:1409.0473 [cs.CL].URL:https://arxiv.org/abs/1409.0473

  4. [4]

    Chen Chen and Lai Wei.Post-LayerNorm Is Back: Stable, ExpressivE, and Deep. 2026. arXiv: 2601.19895 [cs.LG].URL:https://arxiv.org/abs/2601.19895

  5. [5]

    Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv: 2107.03374 [cs.LG]. URL:https://arxiv.org/abs/2107.03374

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: arXiv:1803.05457v1(2018)

  7. [7]

    Karl Cobbe et al.Training Verifiers to Solve Math Word Problems. 2021. arXiv:2110.14168 [cs.LG].URL: https://arxiv.org/abs/2110.14168

  8. [8]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060. arXiv: 2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060

  9. [9]

    DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL].URL: https://arxiv. org/abs/2412.19437

  10. [10]

    Yanwen Fang et al.Cross-Layer Retrospective Retrieving via Layer Attention. 2023. arXiv: 2302 . 03985 [cs.CV].URL:https://arxiv.org/abs/2302.03985

  11. [11]

    Andrey Gromov et al.The Unreasonable Ineffectiveness of the Deeper Layers. 2025. arXiv: 2403 . 17887 [cs.CL].URL:https://arxiv.org/abs/2403.17887

  12. [12]

    Kaiming He et al.Deep Residual Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV].URL: https://arxiv.org/abs/1512.03385

  13. [13]

    Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009 . 03300 [cs.CY].URL:https://arxiv.org/abs/2009.03300

  14. [14]

    Dan Hendrycks et al.Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103. 03874 [cs.LG].URL:https://arxiv.org/abs/2103.03874

  15. [15]

    Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv:2203.15556 [cs.CL]. URL:https://arxiv.org/abs/2203.15556

  16. [16]

    Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. 2024. arXiv:2404.06395 [cs.CL].URL:https://arxiv.org/abs/2404.06395

  17. [17]

    Gao Huang et al.Densely Connected Convolutional Networks. 2018. arXiv: 1608 . 06993 [cs.CV].URL: https://arxiv.org/abs/1608.06993

  18. [18]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”. In: Advances in NeurIPS. 2019

  19. [19]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In: Advances in NeurIPS36 (2023), pp. 62991–63010

  20. [20]

    Adaptive Mixtures of Local Experts

    Robert A. Jacobs et al. “Adaptive Mixtures of Local Experts”. In:Neural Computation3.1 (1991), pp. 79–87. DOI:10.1162/neco.1991.3.1.79

  21. [21]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”. In:arXiv preprint arXiv:1705.03551(2017)

  22. [22]

    Jared Kaplan et al.Scaling Laws for Neural Language Models. 2020. arXiv: 2001 . 08361 [cs.LG].URL: https://arxiv.org/abs/2001.08361

  23. [23]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

    Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”. In:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156–5165.URL: https: //proceedings.mlr.press/v119/katharopoulos20a.html

  24. [24]

    Jonas Knupp et al.Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves. 2026. arXiv:2601.21582 [cs.AI].URL:https://arxiv.org/abs/2601.21582

  25. [25]

    Aitor Lewkowycz et al.Solving Quantitative Reasoning Problems with Language Models. 2022. arXiv: 2206. 14858 [cs.CL].URL:https://arxiv.org/abs/2206.14858. 17 Attention ResidualsTECHNICALREPORT

  26. [26]

    CMMLU: Measuring massive multitask language understanding in Chinese

    Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260–11285.DOI: 10 . 18653 / v1 / 2024 . findings - acl . 671.UR...

  27. [27]

    Tianyu Li et al.SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm. 2026. arXiv: 2602.08064 [cs.LG].URL:https://arxiv.org/abs/2602.08064

  28. [28]

    Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https: //arxiv.org/abs/2502.16982

  29. [29]

    Brian Mak and Jeffrey Flanigan.Residual Matrix Transformers: Scaling the Size of the Residual Stream. 2025. arXiv:2506.22696 [cs.LG].URL:https://arxiv.org/abs/2506.22696

  30. [30]

    Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar.LAuReL: Learned Augmented Residual Layer. 2025. arXiv: 2411.07501 [cs.LG].URL:https://arxiv.org/abs/2411.07501

  31. [31]

    Maxim Milakov and Natalia Gimelshein.Online normalizer calculation for softmax. 2018. arXiv: 1805.02867 [cs.PF].URL:https://arxiv.org/abs/1805.02867

  32. [32]

    Metalearned Neural Memory

    Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https: //api.semanticscholar.org/CorpusID:198179407

  33. [33]

    Deepak Narayanan et al.Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

  34. [34]

    arXiv:2104.04473 [cs.CL].URL:https://arxiv.org/abs/2104.04473

  35. [35]

    Transformers without Tears: Improving the Normalization of Self- Attention

    Toan Q. Nguyen and Julian Salazar. “Transformers without Tears: Improving the Normalization of Self- Attention”. In:Proceedings of IWSLT. Ed. by Jan Niehues et al. 2019.URL: https : / / aclanthology . org/2019.iwslt-1.17/

  36. [36]

    OpenAI et al.GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL].URL: https://arxiv.org/abs/ 2303.08774

  37. [37]

    Matteo Pagliardini et al.DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging. 2024. arXiv:2402.02622 [cs.CL].URL:https://arxiv.org/abs/2402.02622

  38. [38]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)

  39. [39]

    Deep Contextualized Word Representations

    Matthew E. Peters et al. “Deep Contextualized Word Representations”. In:Proceedings of NAACL. 2018, pp. 2227–2237.URL:https://aclanthology.org/N18-1202/

  40. [40]

    Reiner Pope et al.Efficiently Scaling Transformer Inference. 2022. arXiv:2211.05102 [cs.LG]

  41. [41]

    Zhen Qin et al.HGRN2: Gated Linear RNNs with State Expansion. 2024. arXiv:2404.07904 [cs.CL]

  42. [42]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language Modeling. 2024

  43. [43]

    Linear Transformers Are Secretly Fast Weight Program- mers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. “Linear Transformers Are Secretly Fast Weight Program- mers”. In:Proceedings of ICML. Ed. by Marina Meila and Tong Zhang. PMLR, 2021, pp. 9355–9366.URL: https://proceedings.mlr.press/v139/schlag21a.html

  44. [44]

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks

    Jürgen Schmidhuber. “Learning to control fast-weight memories: An alternative to dynamic recurrent networks”. In:Neural Computation4.1 (1992), pp. 131–139

  45. [45]

    Freda Shi et al.Language Models are Multilingual Chain-of-Thought Reasoners. 2022. arXiv: 2210.03057 [cs.CL].URL:https://arxiv.org/abs/2210.03057

  46. [46]

    Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber.Highway Networks. 2015. arXiv: 1505.00387 [cs.LG].URL:https://arxiv.org/abs/1505.00387

  47. [47]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun et al. “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”. In:ArXivabs/2407.04620 (2024).URL:https://api.semanticscholar.org/CorpusID:271039606

  48. [48]

    Yutao Sun et al.Retentive Network: A Successor to Transformer for Large Language Models. 2023. arXiv: 2307.08621 [cs.CL]

  49. [49]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun et al. “Challenging big-bench tasks and whether chain-of-thought can solve them”. In:arXiv preprint arXiv:2210.09261(2022)

  50. [50]

    Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

    Shawn Tan et al. “Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study”. In: Proceedings of ICLR. 2025

  51. [51]

    Hugo Touvron et al.Going deeper with Image Transformers. 2021. arXiv: 2103.17239 [cs.CV].URL: https: //arxiv.org/abs/2103.17239

  52. [52]

    Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv: 2302.13971 [cs.CL]. 18 Attention ResidualsTECHNICALREPORT

  53. [53]

    Attention is All you Need

    Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  54. [54]

    Attention is All you Need

    Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. V ol. 30. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  55. [55]

    Hongyu Wang et al.DeepNet: Scaling Transformers to 1,000 Layers. 2022. arXiv: 2203.00555 [cs.CL].URL: https://arxiv.org/abs/2203.00555

  56. [56]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang et al. “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark”. In: Advances in NeurIPS37 (2024), pp. 95266–95290

  57. [57]

    MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

    Da Xiao et al. “MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections”. In:Proceedings of ICML. 2025

  58. [58]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao et al. “Efficient streaming language models with attention sinks”. In:arXiv preprint arXiv:2309.17453(2023)

  59. [59]

    Zhihu blog post

    Tian Xie.Your DeepSeek mHC Might Not Need the “m”. Zhihu blog post. 2026.URL: https://zhuanlan. zhihu.com/p/2010852389670908320

  60. [60]

    Zhenda Xie et al.mHC: Manifold-Constrained Hyper-Connections. 2026. arXiv: 2512.24880 [cs.CL].URL: https://arxiv.org/abs/2512.24880

  61. [61]

    Ruibin Xiong et al.On Layer Normalization in the Transformer Architecture. 2020. arXiv:2002.04745 [cs.LG]. URL:https://arxiv.org/abs/2002.04745

  62. [62]

    Bowen Yang et al.Rope to Nope and Back Again: A New Hybrid Attention Strategy. 2025. arXiv: 2501.18795 [cs.CL].URL:https://arxiv.org/abs/2501.18795

  63. [63]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”. In:Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=r8H7xhYPwz

  64. [64]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang et al. “Gated Linear Attention Transformers with Hardware-Efficient Training”. In:Proceedings of ICML. PMLR, 2024

  65. [65]

    Yongyi Yang and Jianyang Gao.mHC-lite: You Don’t Need 20 Sinkhorn-Knopp Iterations. 2026. arXiv:2601. 05732 [cs.LG].URL:https://arxiv.org/abs/2601.05732

  66. [66]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

  67. [67]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. “Root mean square layer normalization”. In:Advances in NeurIPS32 (2019)

  68. [68]

    Yifan Zhang et al.Deep Delta Learning. 2026. arXiv: 2601.00417 [cs.LG] .URL: https://arxiv.org/ abs/2601.00417

  69. [69]

    Yilang Zhang et al.ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling. 2026. arXiv: 2602.09009 [cs.LG].URL:https://arxiv.org/abs/2602.09009

  70. [70]

    Yu Zhang et al.Kimi Linear: An Expressive, Efficient Attention Architecture. 2025. arXiv:2510.26692 [cs.CL]

  71. [71]

    Shu Zhong et al.Understanding Transformer from the Perspective of Associative Memory. 2025. arXiv: 2505. 19488 [cs.LG].URL:https://arxiv.org/abs/2505.19488

  72. [72]

    Value Residual Learning

    Zhanchao Zhou et al. “Value Residual Learning”. In:Proceedings of ACL. Ed. by Wanxiang Che et al. Vienna, Austria, 2025, pp. 28341–28356.URL:https://aclanthology.org/2025.acl-long.1375/

  73. [73]

    Defa Zhu et al.Hyper-Connections. 2025. arXiv: 2409.19606 [cs.LG] .URL: https://arxiv.org/abs/ 2409.19606

  74. [74]

    Zhijian Zhuo et al.HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

  75. [75]

    arXiv:2503.04598 [cs.CL].URL:https://arxiv.org/abs/2503.04598. 19 Attention ResidualsTECHNICALREPORT A Contributions The authors are listed in order of the significance of their contributions, with those in project leadership roles appearing last. Guangyu Chen∗ Yu Zhang∗ Jianlin Su∗ Weixin Xu Siyuan Pan Yaoyu Wang Yucheng Wang Guanduo Chen Bohong Yin Yuti...