Attention Residuals
Pith reviewed 2026-05-21 06:33 UTC · model grok-4.3
The pith
Attention Residuals replace fixed residual sums with learned attention over prior layers to reduce PreNorm dilution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attention Residuals (AttnRes) replace the fixed unit-weight accumulation of residual connections with softmax attention computed over the outputs of preceding layers. This lets each layer selectively and dynamically weight earlier representations according to the current input. Block AttnRes partitions the stack into blocks and attends over block representations to control memory use, while cache-based pipeline communication and two-phase computation keep overhead low. The change produces more uniform hidden-state magnitudes and gradient distributions across depth, yielding measurable gains when the method is applied at scale.
What carries the argument
Attention Residuals (AttnRes), which computes softmax attention weights over preceding layer outputs to replace fixed residual additions.
If this is right
- Hidden-state magnitudes remain more uniform across network depth instead of growing uncontrollably.
- Gradient norms are distributed more evenly, reducing the dilution of early-layer signals.
- Downstream task performance improves consistently when the replacement is applied at scale.
- The gains hold across different model sizes according to scaling-law experiments.
- Block AttnRes preserves most benefits while keeping memory and communication costs low enough for practical training.
Where Pith is reading between the lines
- The same selective-aggregation idea could be tested in non-transformer stacks where depth causes similar contribution decay.
- Attention entropy over layers during training would indicate how selective the mechanism actually becomes in practice.
- The method might reduce the need for extra normalization tricks when pushing transformer depth further.
- Content-dependent depth weighting could interact usefully with mixture-of-experts routing that already selects across experts.
Load-bearing premise
The learned attention weights over preceding layers or blocks will remain stable and effective throughout large-scale pre-training without introducing new optimization difficulties.
What would settle it
Train two identical models to the same token count, one with standard PreNorm residuals and one with AttnRes; if the variance of layer-output magnitudes stays unchanged and downstream scores show no improvement, the central benefit is falsified.
read the original abstract
Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Attention Residuals (AttnRes), replacing the fixed unit-weight summation of standard PreNorm residuals with softmax attention over preceding layer outputs to enable learned, input-dependent aggregation that mitigates dilution of earlier contributions. A Block AttnRes variant is introduced to reduce memory overhead by attending over block representations, supported by cache-based pipeline communication and a two-phase strategy. Scaling-law experiments, ablations, and a 48B-parameter (3B activated) pre-training run on 1.4T tokens are reported to yield more uniform output magnitudes and gradient distributions across depth, with downstream gains on all evaluated tasks.
Significance. If the uniformity gains and downstream improvements prove robust and attributable to the content-dependent weighting rather than added parameters or compute, the approach could provide a practical architectural fix for a known limitation in deep transformer training dynamics, potentially improving capacity utilization in very deep models.
major comments (2)
- [Abstract and scaling-law experiments] Abstract and scaling-law section: the claim of consistent improvement across model sizes and better downstream performance lacks any reported effect sizes, baseline comparisons (e.g., standard PreNorm with matched parameter count), or statistical significance tests, leaving the magnitude and reliability of the gains unclear.
- [Ablations and pre-training results] Ablations and uniformity results: while ablations are stated to validate content-dependent selection, no quantitative checks are provided on attention entropy, weight distributions over early vs. recent blocks, or layer-wise contribution magnitudes during the 1.4T-token run. This directly bears on the central claim, as collapse to recent blocks would reduce AttnRes to a standard residual plus overhead without dilution mitigation.
minor comments (2)
- [Block AttnRes description] The description of the two-phase computation strategy and cache-based pipeline communication would benefit from a concise pseudocode or diagram to clarify implementation overhead.
- [Method section] Notation for block partitioning and attention over block representations should be made fully explicit, including how block-level outputs are cached and reused.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major point below and indicate the changes made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and scaling-law experiments] Abstract and scaling-law section: the claim of consistent improvement across model sizes and better downstream performance lacks any reported effect sizes, baseline comparisons (e.g., standard PreNorm with matched parameter count), or statistical significance tests, leaving the magnitude and reliability of the gains unclear.
Authors: We agree that explicit effect sizes and matched-parameter baseline comparisons would improve clarity. In the revised manuscript we have added a table of relative improvements (with effect sizes) across the scaling-law models and confirmed that all PreNorm baselines use identical parameter counts and training budgets. For statistical significance, smaller-scale experiments include multiple random seeds with consistent trends; the 48B run is reported as a single large-scale experiment due to compute limits, but the accompanying uniformity and gradient analyses provide corroborating evidence that the gains are not artifacts of a single run. revision: yes
-
Referee: [Ablations and pre-training results] Ablations and uniformity results: while ablations are stated to validate content-dependent selection, no quantitative checks are provided on attention entropy, weight distributions over early vs. recent blocks, or layer-wise contribution magnitudes during the 1.4T-token run. This directly bears on the central claim, as collapse to recent blocks would reduce AttnRes to a standard residual plus overhead without dilution mitigation.
Authors: This observation is correct and directly relevant to the core claim. We have added the requested quantitative checks in the revised version: attention entropy statistics across layers, histograms of attention weights on early versus recent blocks, and layer-wise contribution magnitude plots measured during the 1.4T-token pre-training. These analyses show non-collapsed distributions with meaningful weight on earlier blocks, supporting that AttnRes mitigates dilution rather than reducing to a standard residual. revision: yes
Circularity Check
No derivation chain; claims rest on direct pre-training and ablations
full rationale
The paper introduces AttnRes as an architectural replacement for fixed residual summation, then validates it via scaling-law experiments and 1.4T-token pre-training on the Kimi Linear model. No equations or first-principles steps are presented that reduce to fitted parameters or self-citations by construction. The uniformity and performance claims are measured outcomes, not predictions forced by the input design. A single self-reference to the authors' prior Kimi Linear work appears but is not load-bearing for the AttnRes contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard transformer architecture with PreNorm and residual connections is the baseline.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 28 Pith papers
-
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.
-
Delta Attention Residuals
Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals...
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
-
NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
NavOne enables one-step global navigation planning on top-down maps using a unified multi-modal framework, achieving state-of-the-art results and up to 80x speedup on the new R2R-TopDown dataset.
-
NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
NavOne reformulates vision-language navigation as single-step global path planning on top-down maps, delivering state-of-the-art results and 8x-80x speedups over prior map-based and egocentric baselines.
-
Transformers with Selective Access to Early Representations
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
-
Transformers with Selective Access to Early Representations
SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.
-
Gradient Boosting within a Single Attention Layer
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...
-
XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation
XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines eve...
-
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
-
Rethinking Cross-Layer Information Routing in Diffusion Transformers
DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
-
Attention Drift: What Autoregressive Speculative Decoding Models Learn
Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improv...
-
RigidFormer: Learning Rigid Dynamics using Transformers
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
-
L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation
A new method accumulates historical pose features across layers in a Transformer network to reach state-of-the-art 3D human pose estimation accuracy.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
NavOne enables one-step global path planning for vision-language navigation on top-down maps via a unified neural framework, achieving SOTA among map-based methods with 8x and 80x speedups on the new R2R-TopDown dataset.
-
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
-
L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation
L2A achieves state-of-the-art 3D human pose estimation by maintaining consistent feature spaces across layers and adaptively aggregating historical pose representations to reuse early-layer spatial and motion cues.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
-
Hyperloop Transformers
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
-
DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
DALM is a proposed language model architecture that enforces algebraic constraints via a three-phase process over domain lattices to prevent cross-domain knowledge contamination during generation.
-
Attention Sinks and Outliers in Attention Residuals
OASIS mitigates attention sinks and outliers in AttnResidual models via Softmax1 null space and inter-layer signals, reporting norm and kurtosis reductions plus large gains in quantized perplexity and task accuracy.
-
BARFI-Q: Quantum-Enhanced Block Attention Residual Fusion Framework for Multivariate Time-Series Forecasting in Atom Interferometry
BARFI-Q integrates patch-based embedding, dual-branch temporal modeling, hierarchical fusion, adaptive block-attention residuals, and quantum feature mapping to forecast atom interferometry time-series, outperforming ...
-
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma
AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.
Reference graph
Works this paper leans on
-
[1]
Jacob Austin et al.Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732 [cs.PL].URL: https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [2]
-
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural Machine Translation by Jointly Learning to Align and Translate. 2016. arXiv:1409.0473 [cs.CL].URL:https://arxiv.org/abs/1409.0473
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [4]
-
[5]
Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv: 2107.03374 [cs.LG]. URL:https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: arXiv:1803.05457v1(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Karl Cobbe et al.Training Verifiers to Solve Math Word Problems. 2021. arXiv:2110.14168 [cs.LG].URL: https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060. arXiv: 2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.21060 2024
-
[9]
DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL].URL: https://arxiv. org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [10]
- [11]
-
[12]
Kaiming He et al.Deep Residual Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV].URL: https://arxiv.org/abs/1512.03385
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009 . 03300 [cs.CY].URL:https://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Dan Hendrycks et al.Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103. 03874 [cs.LG].URL:https://arxiv.org/abs/2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv:2203.15556 [cs.CL]. URL:https://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. 2024. arXiv:2404.06395 [cs.CL].URL:https://arxiv.org/abs/2404.06395
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Gao Huang et al.Densely Connected Convolutional Networks. 2018. arXiv: 1608 . 06993 [cs.CV].URL: https://arxiv.org/abs/1608.06993
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”. In: Advances in NeurIPS. 2019
work page 2019
-
[19]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In: Advances in NeurIPS36 (2023), pp. 62991–63010
work page 2023
-
[20]
Adaptive Mixtures of Local Experts
Robert A. Jacobs et al. “Adaptive Mixtures of Local Experts”. In:Neural Computation3.1 (1991), pp. 79–87. DOI:10.1162/neco.1991.3.1.79
-
[21]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”. In:arXiv preprint arXiv:1705.03551(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Jared Kaplan et al.Scaling Laws for Neural Language Models. 2020. arXiv: 2001 . 08361 [cs.LG].URL: https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[23]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”. In:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156–5165.URL: https: //proceedings.mlr.press/v119/katharopoulos20a.html
work page 2020
- [24]
-
[25]
Aitor Lewkowycz et al.Solving Quantitative Reasoning Problems with Language Models. 2022. arXiv: 2206. 14858 [cs.CL].URL:https://arxiv.org/abs/2206.14858. 17 Attention ResidualsTECHNICALREPORT
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
CMMLU: Measuring massive multitask language understanding in Chinese
Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260–11285.DOI: 10 . 18653 / v1 / 2024 . findings - acl . 671.UR...
work page 2024
-
[27]
Tianyu Li et al.SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm. 2026. arXiv: 2602.08064 [cs.LG].URL:https://arxiv.org/abs/2602.08064
work page internal anchor Pith review arXiv 2026
-
[28]
Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https: //arxiv.org/abs/2502.16982
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [29]
- [30]
-
[31]
Maxim Milakov and Natalia Gimelshein.Online normalizer calculation for softmax. 2018. arXiv: 1805.02867 [cs.PF].URL:https://arxiv.org/abs/1805.02867
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https: //api.semanticscholar.org/CorpusID:198179407
-
[33]
Deepak Narayanan et al.Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- [34]
-
[35]
Transformers without Tears: Improving the Normalization of Self- Attention
Toan Q. Nguyen and Julian Salazar. “Transformers without Tears: Improving the Normalization of Self- Attention”. In:Proceedings of IWSLT. Ed. by Jan Niehues et al. 2019.URL: https : / / aclanthology . org/2019.iwslt-1.17/
work page 2019
-
[36]
OpenAI et al.GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL].URL: https://arxiv.org/abs/ 2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [37]
-
[38]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Deep Contextualized Word Representations
Matthew E. Peters et al. “Deep Contextualized Word Representations”. In:Proceedings of NAACL. 2018, pp. 2227–2237.URL:https://aclanthology.org/N18-1202/
work page 2018
- [40]
- [41]
-
[42]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language Modeling. 2024
work page 2024
-
[43]
Linear Transformers Are Secretly Fast Weight Program- mers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. “Linear Transformers Are Secretly Fast Weight Program- mers”. In:Proceedings of ICML. Ed. by Marina Meila and Tong Zhang. PMLR, 2021, pp. 9355–9366.URL: https://proceedings.mlr.press/v139/schlag21a.html
work page 2021
-
[44]
Learning to control fast-weight memories: An alternative to dynamic recurrent networks
Jürgen Schmidhuber. “Learning to control fast-weight memories: An alternative to dynamic recurrent networks”. In:Neural Computation4.1 (1992), pp. 131–139
work page 1992
-
[45]
Freda Shi et al.Language Models are Multilingual Chain-of-Thought Reasoners. 2022. arXiv: 2210.03057 [cs.CL].URL:https://arxiv.org/abs/2210.03057
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber.Highway Networks. 2015. arXiv: 1505.00387 [cs.LG].URL:https://arxiv.org/abs/1505.00387
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[47]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun et al. “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”. In:ArXivabs/2407.04620 (2024).URL:https://api.semanticscholar.org/CorpusID:271039606
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Yutao Sun et al.Retentive Network: A Successor to Transformer for Large Language Models. 2023. arXiv: 2307.08621 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun et al. “Challenging big-bench tasks and whether chain-of-thought can solve them”. In:arXiv preprint arXiv:2210.09261(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study
Shawn Tan et al. “Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study”. In: Proceedings of ICLR. 2025
work page 2025
- [51]
-
[52]
Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv: 2302.13971 [cs.CL]. 18 Attention ResidualsTECHNICALREPORT
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
work page 2017
-
[54]
Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. V ol. 30. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
work page 2017
- [55]
-
[56]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang et al. “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark”. In: Advances in NeurIPS37 (2024), pp. 95266–95290
work page 2024
-
[57]
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Da Xiao et al. “MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections”. In:Proceedings of ICML. 2025
work page 2025
-
[58]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao et al. “Efficient streaming language models with attention sinks”. In:arXiv preprint arXiv:2309.17453(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Tian Xie.Your DeepSeek mHC Might Not Need the “m”. Zhihu blog post. 2026.URL: https://zhuanlan. zhihu.com/p/2010852389670908320
-
[60]
Zhenda Xie et al.mHC: Manifold-Constrained Hyper-Connections. 2026. arXiv: 2512.24880 [cs.CL].URL: https://arxiv.org/abs/2512.24880
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [61]
- [62]
-
[63]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”. In:Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=r8H7xhYPwz
work page 2025
-
[64]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang et al. “Gated Linear Attention Transformers with Hardware-Efficient Training”. In:Proceedings of ICML. PMLR, 2024
work page 2024
- [65]
-
[66]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
work page 2019
-
[67]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. “Root mean square layer normalization”. In:Advances in NeurIPS32 (2019)
work page 2019
-
[68]
Yifan Zhang et al.Deep Delta Learning. 2026. arXiv: 2601.00417 [cs.LG] .URL: https://arxiv.org/ abs/2601.00417
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [69]
-
[70]
Yu Zhang et al.Kimi Linear: An Expressive, Efficient Attention Architecture. 2025. arXiv:2510.26692 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [71]
-
[72]
Zhanchao Zhou et al. “Value Residual Learning”. In:Proceedings of ACL. Ed. by Wanxiang Che et al. Vienna, Austria, 2025, pp. 28341–28356.URL:https://aclanthology.org/2025.acl-long.1375/
work page 2025
- [73]
-
[74]
Zhijian Zhuo et al.HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
-
[75]
arXiv:2503.04598 [cs.CL].URL:https://arxiv.org/abs/2503.04598. 19 Attention ResidualsTECHNICALREPORT A Contributions The authors are listed in order of the significance of their contributions, with those in project leadership roles appearing last. Guangyu Chen∗ Yu Zhang∗ Jianlin Su∗ Weixin Xu Siyuan Pan Yaoyu Wang Yucheng Wang Guanduo Chen Bohong Yin Yuti...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.