Attention Residuals

Bohong Yin; Chao Hong; Enzhe Lu; Fanqing Meng; Guanduo Chen; Guokun Lai; Haiqing Guo; Haoyu Lu; Jianlin Su; Jinguo Zhu

arxiv: 2603.15031 · v1 · pith:D427NFWTnew · submitted 2026-03-16 · 💻 cs.CL

Attention Residuals

Kimi Team: Guangyu Chen , Yu Zhang , Jianlin Su , Weixin Xu , Siyuan Pan , Yaoyu Wang , Yucheng Wang , Guanduo Chen

show 28 more authors

Bohong Yin Yutian Chen Junjie Yan Ming Wei Y. Zhang Fanqing Meng Chao Hong Xiaotong Xie Shaowei Liu Enzhe Lu Yunpeng Tai Yanru Chen Xin Men Haiqing Guo Y. Charles Haoyu Lu Lin Sui Jinguo Zhu Zaida Zhou Weiran He Weixiao Huang Xinran Xu Yuzhi Wang Guokun Lai Yulun Du Yuxin Wu Zhilin Yang Xinyu Zhou

This is my paper

Pith reviewed 2026-05-21 06:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords attention residualsresidual connectionspre-normlarge language modelstransformer architecturegradient distributionmodel scalingblock attention

0 comments

The pith

Attention Residuals replace fixed residual sums with learned attention over prior layers to reduce PreNorm dilution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard residual connections in modern LLMs add every preceding layer output with equal fixed weights, causing hidden states to grow uncontrollably and diluting each layer's contribution deeper in the network. Attention Residuals address this by computing softmax attention over preceding layer outputs, so each layer assigns input-dependent weights to earlier representations instead of treating them uniformly. The result is more even output magnitudes and gradient flow across depth. A blocked variant attends only over block-level aggregates to keep memory costs manageable during large-scale training. When inserted into a 48B model and pre-trained on 1.4T tokens, the change improves results on every downstream task evaluated.

Core claim

Attention Residuals (AttnRes) replace the fixed unit-weight accumulation of residual connections with softmax attention computed over the outputs of preceding layers. This lets each layer selectively and dynamically weight earlier representations according to the current input. Block AttnRes partitions the stack into blocks and attends over block representations to control memory use, while cache-based pipeline communication and two-phase computation keep overhead low. The change produces more uniform hidden-state magnitudes and gradient distributions across depth, yielding measurable gains when the method is applied at scale.

What carries the argument

Attention Residuals (AttnRes), which computes softmax attention weights over preceding layer outputs to replace fixed residual additions.

If this is right

Hidden-state magnitudes remain more uniform across network depth instead of growing uncontrollably.
Gradient norms are distributed more evenly, reducing the dilution of early-layer signals.
Downstream task performance improves consistently when the replacement is applied at scale.
The gains hold across different model sizes according to scaling-law experiments.
Block AttnRes preserves most benefits while keeping memory and communication costs low enough for practical training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-aggregation idea could be tested in non-transformer stacks where depth causes similar contribution decay.
Attention entropy over layers during training would indicate how selective the mechanism actually becomes in practice.
The method might reduce the need for extra normalization tricks when pushing transformer depth further.
Content-dependent depth weighting could interact usefully with mixture-of-experts routing that already selects across experts.

Load-bearing premise

The learned attention weights over preceding layers or blocks will remain stable and effective throughout large-scale pre-training without introducing new optimization difficulties.

What would settle it

Train two identical models to the same token count, one with standard PreNorm residuals and one with AttnRes; if the variance of layer-output magnitudes stays unchanged and downstream scores show no improvement, the central benefit is falsified.

read the original abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AttnRes replaces fixed residuals with learned attention over layers to fix dilution, shows uniformity and task gains in a 48B model, but lacks direct checks on whether attention actually stays distributed.

read the letter

Colleague, The key takeaway is that Attention Residuals (AttnRes) swaps the usual fixed residual addition for a softmax attention over earlier layer outputs, aiming to prevent the dilution of contributions as depth increases in PreNorm setups. They also offer a blocked version to keep it practical. What the paper does well is motivate the change from the accumulation problem and then implement it efficiently with Block AttnRes, using caches and phased computation to minimize extra cost. The scaling law tests across different sizes back up that the gains are consistent, and the ablations check the content-dependent aspect. Most importantly, they actually train a 48B parameter model on 1.4T tokens using this in their Kimi Linear setup, reporting more even output magnitudes, better gradient distribution, and improvements on downstream tasks. This is a logical step from existing residual and attention ideas, and showing it at that scale gives it some weight. The uniformity metrics they track address a concrete pathology in deep transformers. Where it could be stronger is in the details around the evidence. The abstract mentions the gains but without specific numbers on baselines or controls, it's tough to gauge how substantial the effect is. The concern that the attention might collapse to recent layers, turning it into something close to standard residuals with added overhead, isn't explicitly ruled out by checks like attention entropy or layer contribution analysis. If those are in the full paper, great; if not, it leaves room for doubt on whether the mechanism works as intended or if the benefits come from elsewhere. This kind of paper is for people in the LLM architecture and training dynamics area. A reader interested in scaling fixes or alternative residual designs would get value from the proposal and the large run. It has enough substance to warrant a serious referee, as the central problem is real and the experiment is at a relevant size. I'd recommend sending it out for peer review with requests for more mechanistic validation. Cheers,

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Attention Residuals (AttnRes), replacing the fixed unit-weight summation of standard PreNorm residuals with softmax attention over preceding layer outputs to enable learned, input-dependent aggregation that mitigates dilution of earlier contributions. A Block AttnRes variant is introduced to reduce memory overhead by attending over block representations, supported by cache-based pipeline communication and a two-phase strategy. Scaling-law experiments, ablations, and a 48B-parameter (3B activated) pre-training run on 1.4T tokens are reported to yield more uniform output magnitudes and gradient distributions across depth, with downstream gains on all evaluated tasks.

Significance. If the uniformity gains and downstream improvements prove robust and attributable to the content-dependent weighting rather than added parameters or compute, the approach could provide a practical architectural fix for a known limitation in deep transformer training dynamics, potentially improving capacity utilization in very deep models.

major comments (2)

[Abstract and scaling-law experiments] Abstract and scaling-law section: the claim of consistent improvement across model sizes and better downstream performance lacks any reported effect sizes, baseline comparisons (e.g., standard PreNorm with matched parameter count), or statistical significance tests, leaving the magnitude and reliability of the gains unclear.
[Ablations and pre-training results] Ablations and uniformity results: while ablations are stated to validate content-dependent selection, no quantitative checks are provided on attention entropy, weight distributions over early vs. recent blocks, or layer-wise contribution magnitudes during the 1.4T-token run. This directly bears on the central claim, as collapse to recent blocks would reduce AttnRes to a standard residual plus overhead without dilution mitigation.

minor comments (2)

[Block AttnRes description] The description of the two-phase computation strategy and cache-based pipeline communication would benefit from a concise pseudocode or diagram to clarify implementation overhead.
[Method section] Notation for block partitioning and attention over block representations should be made fully explicit, including how block-level outputs are cached and reused.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and indicate the changes made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and scaling-law experiments] Abstract and scaling-law section: the claim of consistent improvement across model sizes and better downstream performance lacks any reported effect sizes, baseline comparisons (e.g., standard PreNorm with matched parameter count), or statistical significance tests, leaving the magnitude and reliability of the gains unclear.

Authors: We agree that explicit effect sizes and matched-parameter baseline comparisons would improve clarity. In the revised manuscript we have added a table of relative improvements (with effect sizes) across the scaling-law models and confirmed that all PreNorm baselines use identical parameter counts and training budgets. For statistical significance, smaller-scale experiments include multiple random seeds with consistent trends; the 48B run is reported as a single large-scale experiment due to compute limits, but the accompanying uniformity and gradient analyses provide corroborating evidence that the gains are not artifacts of a single run. revision: yes
Referee: [Ablations and pre-training results] Ablations and uniformity results: while ablations are stated to validate content-dependent selection, no quantitative checks are provided on attention entropy, weight distributions over early vs. recent blocks, or layer-wise contribution magnitudes during the 1.4T-token run. This directly bears on the central claim, as collapse to recent blocks would reduce AttnRes to a standard residual plus overhead without dilution mitigation.

Authors: This observation is correct and directly relevant to the core claim. We have added the requested quantitative checks in the revised version: attention entropy statistics across layers, histograms of attention weights on early versus recent blocks, and layer-wise contribution magnitude plots measured during the 1.4T-token pre-training. These analyses show non-collapsed distributions with meaningful weight on earlier blocks, supporting that AttnRes mitigates dilution rather than reducing to a standard residual. revision: yes

Circularity Check

0 steps flagged

No derivation chain; claims rest on direct pre-training and ablations

full rationale

The paper introduces AttnRes as an architectural replacement for fixed residual summation, then validates it via scaling-law experiments and 1.4T-token pre-training on the Kimi Linear model. No equations or first-principles steps are presented that reduce to fitted parameters or self-citations by construction. The uniformity and performance claims are measured outcomes, not predictions forced by the input design. A single self-reference to the authors' prior Kimi Linear work appears but is not load-bearing for the AttnRes contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard transformer assumptions of PreNorm and additive residuals; the attention weights themselves are learned parameters rather than hand-chosen free parameters. No new physical or mathematical entities are postulated.

axioms (1)

domain assumption Standard transformer architecture with PreNorm and residual connections is the baseline.
The paper treats these as given in modern LLMs and measures improvement relative to them.

pith-pipeline@v0.9.0 · 5894 in / 1241 out tokens · 46644 ms · 2026-05-21T06:33:10.060969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
cs.CL 2026-05 conditional novelty 7.0

Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.
Delta Attention Residuals
cs.LG 2026-05 unverdicted novelty 7.0

Delta Attention Residuals attend over per-sublayer deltas instead of cumulative hidden states, producing higher-contrast attention weights and 1.7-8.2% validation perplexity gains over standard and attention residuals...
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
cs.LG 2026-05 unverdicted novelty 7.0

Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
cs.CV 2026-05 unverdicted novelty 7.0

NavOne enables one-step global navigation planning on top-down maps using a unified multi-modal framework, achieving state-of-the-art results and up to 80x speedup on the new R2R-TopDown dataset.
NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
cs.CV 2026-05 unverdicted novelty 7.0

NavOne reformulates vision-language navigation as single-step global path planning on top-down maps, delivering state-of-the-art results and 8x-80x speedups over prior map-based and egocentric baselines.
Transformers with Selective Access to Early Representations
cs.LG 2026-05 unverdicted novelty 7.0

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Transformers with Selective Access to Early Representations
cs.LG 2026-05 unverdicted novelty 7.0

SATFormer uses a learned context-dependent gate for selective access to early-layer value representations in Transformers, improving loss and accuracy over static residual baselines.
Gradient Boosting within a Single Attention Layer
cs.LG 2026-04 conditional novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...
XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation
cs.CV 2026-03 unverdicted novelty 7.0

XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines eve...
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
cs.LG 2026-05 conditional novelty 6.0

Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
Rethinking Cross-Layer Information Routing in Diffusion Transformers
cs.CV 2026-05 conditional novelty 6.0

DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...
Attention Drift: What Autoregressive Speculative Decoding Models Learn
cs.LG 2026-05 unverdicted novelty 6.0

Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improv...
RigidFormer: Learning Rigid Dynamics using Transformers
cs.CV 2026-05 unverdicted novelty 6.0

RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation
cs.CV 2026-05 unverdicted novelty 6.0

A new method accumulates historical pose features across layers in a Transformer network to reach state-of-the-art 3D human pose estimation accuracy.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
cs.CV 2026-05 unverdicted novelty 6.0

NavOne enables one-step global path planning for vision-language navigation on top-down maps via a unified neural framework, achieving SOTA among map-based methods with 8x and 80x speedups on the new R2R-TopDown dataset.
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
cs.LG 2026-05 unverdicted novelty 6.0

A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 5.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation
cs.CV 2026-05 unverdicted novelty 5.0

L2A achieves state-of-the-art 3D human pose estimation by maintaining consistent feature spaces across layers and adaptively aggregating historical pose representations to reuse early-layer spatial and motion cues.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 5.0

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
Hyperloop Transformers
cs.LG 2026-04 unverdicted novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
cs.CL 2026-04 unverdicted novelty 5.0

DALM is a proposed language model architecture that enforces algebraic constraints via a three-phase process over domain lattices to prevent cross-domain knowledge contamination during generation.
Attention Sinks and Outliers in Attention Residuals
cs.LG 2026-05 unverdicted novelty 4.0

OASIS mitigates attention sinks and outliers in AttnResidual models via Softmax1 null space and inter-layer signals, reporting norm and kurtosis reductions plus large gains in quantized perplexity and task accuracy.
BARFI-Q: Quantum-Enhanced Block Attention Residual Fusion Framework for Multivariate Time-Series Forecasting in Atom Interferometry
quant-ph 2026-05 unverdicted novelty 4.0

BARFI-Q integrates patch-based embedding, dual-branch temporal modeling, hierarchical fusion, adaptive block-attention residuals, and quantum feature mapping to forecast atom interferometry time-series, outperforming ...
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma
cs.AI 2026-05 unverdicted novelty 3.0

AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 21 Pith papers · 31 internal anchors

[1]

Jacob Austin et al.Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732 [cs.PL].URL: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Thomas Bachlechner et al.ReZero is All You Need: Fast Convergence at Large Depth. 2020. arXiv:2003.04887 [cs.LG].URL:https://arxiv.org/abs/2003.04887

work page arXiv 2020
[3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural Machine Translation by Jointly Learning to Align and Translate. 2016. arXiv:1409.0473 [cs.CL].URL:https://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Chen Chen and Lai Wei.Post-LayerNorm Is Back: Stable, ExpressivE, and Deep. 2026. arXiv: 2601.19895 [cs.LG].URL:https://arxiv.org/abs/2601.19895

work page arXiv 2026
[5]

Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv: 2107.03374 [cs.LG]. URL:https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: arXiv:1803.05457v1(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Karl Cobbe et al.Training Verifiers to Solve Math Word Problems. 2021. arXiv:2110.14168 [cs.LG].URL: https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060. arXiv: 2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.21060 2024
[9]

DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL].URL: https://arxiv. org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Yanwen Fang et al.Cross-Layer Retrospective Retrieving via Layer Attention. 2023. arXiv: 2302 . 03985 [cs.CV].URL:https://arxiv.org/abs/2302.03985

work page arXiv 2023
[11]

Andrey Gromov et al.The Unreasonable Ineffectiveness of the Deeper Layers. 2025. arXiv: 2403 . 17887 [cs.CL].URL:https://arxiv.org/abs/2403.17887

work page arXiv 2025
[12]

Kaiming He et al.Deep Residual Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV].URL: https://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009 . 03300 [cs.CY].URL:https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Dan Hendrycks et al.Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103. 03874 [cs.LG].URL:https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv:2203.15556 [cs.CL]. URL:https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. 2024. arXiv:2404.06395 [cs.CL].URL:https://arxiv.org/abs/2404.06395

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Gao Huang et al.Densely Connected Convolutional Networks. 2018. arXiv: 1608 . 06993 [cs.CV].URL: https://arxiv.org/abs/1608.06993

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”. In: Advances in NeurIPS. 2019

work page 2019
[19]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In: Advances in NeurIPS36 (2023), pp. 62991–63010

work page 2023
[20]

Adaptive Mixtures of Local Experts

Robert A. Jacobs et al. “Adaptive Mixtures of Local Experts”. In:Neural Computation3.1 (1991), pp. 79–87. DOI:10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991
[21]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”. In:arXiv preprint arXiv:1705.03551(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Jared Kaplan et al.Scaling Laws for Neural Language Models. 2020. arXiv: 2001 . 08361 [cs.LG].URL: https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”. In:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156–5165.URL: https: //proceedings.mlr.press/v119/katharopoulos20a.html

work page 2020
[24]

Jonas Knupp et al.Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves. 2026. arXiv:2601.21582 [cs.AI].URL:https://arxiv.org/abs/2601.21582

work page arXiv 2026
[25]

Aitor Lewkowycz et al.Solving Quantitative Reasoning Problems with Language Models. 2022. arXiv: 2206. 14858 [cs.CL].URL:https://arxiv.org/abs/2206.14858. 17 Attention ResidualsTECHNICALREPORT

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260–11285.DOI: 10 . 18653 / v1 / 2024 . findings - acl . 671.UR...

work page 2024
[27]

Tianyu Li et al.SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm. 2026. arXiv: 2602.08064 [cs.LG].URL:https://arxiv.org/abs/2602.08064

work page internal anchor Pith review arXiv 2026
[28]

Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https: //arxiv.org/abs/2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Brian Mak and Jeffrey Flanigan.Residual Matrix Transformers: Scaling the Size of the Residual Stream. 2025. arXiv:2506.22696 [cs.LG].URL:https://arxiv.org/abs/2506.22696

work page arXiv 2025
[30]

Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar.LAuReL: Learned Augmented Residual Layer. 2025. arXiv: 2411.07501 [cs.LG].URL:https://arxiv.org/abs/2411.07501

work page arXiv 2025
[31]

Maxim Milakov and Natalia Gimelshein.Online normalizer calculation for softmax. 2018. arXiv: 1805.02867 [cs.PF].URL:https://arxiv.org/abs/1805.02867

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Metalearned Neural Memory

Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https: //api.semanticscholar.org/CorpusID:198179407

work page arXiv 1907
[33]

Deepak Narayanan et al.Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

work page
[34]

arXiv:2104.04473 [cs.CL].URL:https://arxiv.org/abs/2104.04473

work page arXiv
[35]

Transformers without Tears: Improving the Normalization of Self- Attention

Toan Q. Nguyen and Julian Salazar. “Transformers without Tears: Improving the Normalization of Self- Attention”. In:Proceedings of IWSLT. Ed. by Jan Niehues et al. 2019.URL: https : / / aclanthology . org/2019.iwslt-1.17/

work page 2019
[36]

OpenAI et al.GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL].URL: https://arxiv.org/abs/ 2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Matteo Pagliardini et al.DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging. 2024. arXiv:2402.02622 [cs.CL].URL:https://arxiv.org/abs/2402.02622

work page arXiv 2024
[38]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Deep Contextualized Word Representations

Matthew E. Peters et al. “Deep Contextualized Word Representations”. In:Proceedings of NAACL. 2018, pp. 2227–2237.URL:https://aclanthology.org/N18-1202/

work page 2018
[40]

Reiner Pope et al.Efficiently Scaling Transformer Inference. 2022. arXiv:2211.05102 [cs.LG]

work page arXiv 2022
[41]

Zhen Qin et al.HGRN2: Gated Linear RNNs with State Expansion. 2024. arXiv:2404.07904 [cs.CL]

work page arXiv 2024
[42]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language Modeling. 2024

work page 2024
[43]

Linear Transformers Are Secretly Fast Weight Program- mers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. “Linear Transformers Are Secretly Fast Weight Program- mers”. In:Proceedings of ICML. Ed. by Marina Meila and Tong Zhang. PMLR, 2021, pp. 9355–9366.URL: https://proceedings.mlr.press/v139/schlag21a.html

work page 2021
[44]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks

Jürgen Schmidhuber. “Learning to control fast-weight memories: An alternative to dynamic recurrent networks”. In:Neural Computation4.1 (1992), pp. 131–139

work page 1992
[45]

Freda Shi et al.Language Models are Multilingual Chain-of-Thought Reasoners. 2022. arXiv: 2210.03057 [cs.CL].URL:https://arxiv.org/abs/2210.03057

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber.Highway Networks. 2015. arXiv: 1505.00387 [cs.LG].URL:https://arxiv.org/abs/1505.00387

work page internal anchor Pith review Pith/arXiv arXiv 2015
[47]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun et al. “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”. In:ArXivabs/2407.04620 (2024).URL:https://api.semanticscholar.org/CorpusID:271039606

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Yutao Sun et al.Retentive Network: A Successor to Transformer for Large Language Models. 2023. arXiv: 2307.08621 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun et al. “Challenging big-bench tasks and whether chain-of-thought can solve them”. In:arXiv preprint arXiv:2210.09261(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Shawn Tan et al. “Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study”. In: Proceedings of ICLR. 2025

work page 2025
[51]

Hugo Touvron et al.Going deeper with Image Transformers. 2021. arXiv: 2103.17239 [cs.CV].URL: https: //arxiv.org/abs/2103.17239

work page arXiv 2021
[52]

Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv: 2302.13971 [cs.CL]. 18 Attention ResidualsTECHNICALREPORT

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Attention is All you Need

Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

work page 2017
[54]

Attention is All you Need

Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. V ol. 30. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

work page 2017
[55]

Hongyu Wang et al.DeepNet: Scaling Transformers to 1,000 Layers. 2022. arXiv: 2203.00555 [cs.CL].URL: https://arxiv.org/abs/2203.00555

work page arXiv 2022
[56]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang et al. “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark”. In: Advances in NeurIPS37 (2024), pp. 95266–95290

work page 2024
[57]

MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

Da Xiao et al. “MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections”. In:Proceedings of ICML. 2025

work page 2025
[58]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao et al. “Efficient streaming language models with attention sinks”. In:arXiv preprint arXiv:2309.17453(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Zhihu blog post

Tian Xie.Your DeepSeek mHC Might Not Need the “m”. Zhihu blog post. 2026.URL: https://zhuanlan. zhihu.com/p/2010852389670908320

work page arXiv 2026
[60]

Zhenda Xie et al.mHC: Manifold-Constrained Hyper-Connections. 2026. arXiv: 2512.24880 [cs.CL].URL: https://arxiv.org/abs/2512.24880

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Ruibin Xiong et al.On Layer Normalization in the Transformer Architecture. 2020. arXiv:2002.04745 [cs.LG]. URL:https://arxiv.org/abs/2002.04745

work page arXiv 2020
[62]

Bowen Yang et al.Rope to Nope and Back Again: A New Hybrid Attention Strategy. 2025. arXiv: 2501.18795 [cs.CL].URL:https://arxiv.org/abs/2501.18795

work page arXiv 2025
[63]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”. In:Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=r8H7xhYPwz

work page 2025
[64]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang et al. “Gated Linear Attention Transformers with Hardware-Efficient Training”. In:Proceedings of ICML. PMLR, 2024

work page 2024
[65]

Yongyi Yang and Jianyang Gao.mHC-lite: You Don’t Need 20 Sinkhorn-Knopp Iterations. 2026. arXiv:2601. 05732 [cs.LG].URL:https://arxiv.org/abs/2601.05732

work page arXiv 2026
[66]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

work page 2019
[67]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. “Root mean square layer normalization”. In:Advances in NeurIPS32 (2019)

work page 2019
[68]

Yifan Zhang et al.Deep Delta Learning. 2026. arXiv: 2601.00417 [cs.LG] .URL: https://arxiv.org/ abs/2601.00417

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

Yilang Zhang et al.ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling. 2026. arXiv: 2602.09009 [cs.LG].URL:https://arxiv.org/abs/2602.09009

work page arXiv 2026
[70]

Yu Zhang et al.Kimi Linear: An Expressive, Efficient Attention Architecture. 2025. arXiv:2510.26692 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Shu Zhong et al.Understanding Transformer from the Perspective of Associative Memory. 2025. arXiv: 2505. 19488 [cs.LG].URL:https://arxiv.org/abs/2505.19488

work page arXiv 2025
[72]

Value Residual Learning

Zhanchao Zhou et al. “Value Residual Learning”. In:Proceedings of ACL. Ed. by Wanxiang Che et al. Vienna, Austria, 2025, pp. 28341–28356.URL:https://aclanthology.org/2025.acl-long.1375/

work page 2025
[73]

Defa Zhu et al.Hyper-Connections. 2025. arXiv: 2409.19606 [cs.LG] .URL: https://arxiv.org/abs/ 2409.19606

work page arXiv 2025
[74]

Zhijian Zhuo et al.HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

work page
[75]

arXiv:2503.04598 [cs.CL].URL:https://arxiv.org/abs/2503.04598. 19 Attention ResidualsTECHNICALREPORT A Contributions The authors are listed in order of the significance of their contributions, with those in project leadership roles appearing last. Guangyu Chen∗ Yu Zhang∗ Jianlin Su∗ Weixin Xu Siyuan Pan Yaoyu Wang Yucheng Wang Guanduo Chen Bohong Yin Yuti...

work page arXiv

[1] [1]

Jacob Austin et al.Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732 [cs.PL].URL: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Thomas Bachlechner et al.ReZero is All You Need: Fast Convergence at Large Depth. 2020. arXiv:2003.04887 [cs.LG].URL:https://arxiv.org/abs/2003.04887

work page arXiv 2020

[3] [3]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural Machine Translation by Jointly Learning to Align and Translate. 2016. arXiv:1409.0473 [cs.CL].URL:https://arxiv.org/abs/1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Chen Chen and Lai Wei.Post-LayerNorm Is Back: Stable, ExpressivE, and Deep. 2026. arXiv: 2601.19895 [cs.LG].URL:https://arxiv.org/abs/2601.19895

work page arXiv 2026

[5] [5]

Mark Chen et al.Evaluating Large Language Models Trained on Code. 2021. arXiv: 2107.03374 [cs.LG]. URL:https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: arXiv:1803.05457v1(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Karl Cobbe et al.Training Verifiers to Solve Math Word Problems. 2021. arXiv:2110.14168 [cs.LG].URL: https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060. arXiv: 2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.21060 2024

[9] [9]

DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL].URL: https://arxiv. org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Yanwen Fang et al.Cross-Layer Retrospective Retrieving via Layer Attention. 2023. arXiv: 2302 . 03985 [cs.CV].URL:https://arxiv.org/abs/2302.03985

work page arXiv 2023

[11] [11]

Andrey Gromov et al.The Unreasonable Ineffectiveness of the Deeper Layers. 2025. arXiv: 2403 . 17887 [cs.CL].URL:https://arxiv.org/abs/2403.17887

work page arXiv 2025

[12] [12]

Kaiming He et al.Deep Residual Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV].URL: https://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009 . 03300 [cs.CY].URL:https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Dan Hendrycks et al.Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103. 03874 [cs.LG].URL:https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv:2203.15556 [cs.CL]. URL:https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Shengding Hu et al.MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. 2024. arXiv:2404.06395 [cs.CL].URL:https://arxiv.org/abs/2404.06395

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Gao Huang et al.Densely Connected Convolutional Networks. 2018. arXiv: 1608 . 06993 [cs.CV].URL: https://arxiv.org/abs/1608.06993

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”. In: Advances in NeurIPS. 2019

work page 2019

[19] [19]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In: Advances in NeurIPS36 (2023), pp. 62991–63010

work page 2023

[20] [20]

Adaptive Mixtures of Local Experts

Robert A. Jacobs et al. “Adaptive Mixtures of Local Experts”. In:Neural Computation3.1 (1991), pp. 79–87. DOI:10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991

[21] [21]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”. In:arXiv preprint arXiv:1705.03551(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Jared Kaplan et al.Scaling Laws for Neural Language Models. 2020. arXiv: 2001 . 08361 [cs.LG].URL: https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[23] [23]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”. In:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156–5165.URL: https: //proceedings.mlr.press/v119/katharopoulos20a.html

work page 2020

[24] [24]

Jonas Knupp et al.Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves. 2026. arXiv:2601.21582 [cs.AI].URL:https://arxiv.org/abs/2601.21582

work page arXiv 2026

[25] [25]

Aitor Lewkowycz et al.Solving Quantitative Reasoning Problems with Language Models. 2022. arXiv: 2206. 14858 [cs.CL].URL:https://arxiv.org/abs/2206.14858. 17 Attention ResidualsTECHNICALREPORT

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260–11285.DOI: 10 . 18653 / v1 / 2024 . findings - acl . 671.UR...

work page 2024

[27] [27]

Tianyu Li et al.SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm. 2026. arXiv: 2602.08064 [cs.LG].URL:https://arxiv.org/abs/2602.08064

work page internal anchor Pith review arXiv 2026

[28] [28]

Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https: //arxiv.org/abs/2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Brian Mak and Jeffrey Flanigan.Residual Matrix Transformers: Scaling the Size of the Residual Stream. 2025. arXiv:2506.22696 [cs.LG].URL:https://arxiv.org/abs/2506.22696

work page arXiv 2025

[30] [30]

Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar.LAuReL: Learned Augmented Residual Layer. 2025. arXiv: 2411.07501 [cs.LG].URL:https://arxiv.org/abs/2411.07501

work page arXiv 2025

[31] [31]

Maxim Milakov and Natalia Gimelshein.Online normalizer calculation for softmax. 2018. arXiv: 1805.02867 [cs.PF].URL:https://arxiv.org/abs/1805.02867

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Metalearned Neural Memory

Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https: //api.semanticscholar.org/CorpusID:198179407

work page arXiv 1907

[33] [33]

Deepak Narayanan et al.Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

work page

[34] [34]

arXiv:2104.04473 [cs.CL].URL:https://arxiv.org/abs/2104.04473

work page arXiv

[35] [35]

Transformers without Tears: Improving the Normalization of Self- Attention

Toan Q. Nguyen and Julian Salazar. “Transformers without Tears: Improving the Normalization of Self- Attention”. In:Proceedings of IWSLT. Ed. by Jan Niehues et al. 2019.URL: https : / / aclanthology . org/2019.iwslt-1.17/

work page 2019

[36] [36]

OpenAI et al.GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL].URL: https://arxiv.org/abs/ 2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Matteo Pagliardini et al.DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging. 2024. arXiv:2402.02622 [cs.CL].URL:https://arxiv.org/abs/2402.02622

work page arXiv 2024

[38] [38]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Deep Contextualized Word Representations

Matthew E. Peters et al. “Deep Contextualized Word Representations”. In:Proceedings of NAACL. 2018, pp. 2227–2237.URL:https://aclanthology.org/N18-1202/

work page 2018

[40] [40]

Reiner Pope et al.Efficiently Scaling Transformer Inference. 2022. arXiv:2211.05102 [cs.LG]

work page arXiv 2022

[41] [41]

Zhen Qin et al.HGRN2: Gated Linear RNNs with State Expansion. 2024. arXiv:2404.07904 [cs.CL]

work page arXiv 2024

[42] [42]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language Modeling. 2024

work page 2024

[43] [43]

Linear Transformers Are Secretly Fast Weight Program- mers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. “Linear Transformers Are Secretly Fast Weight Program- mers”. In:Proceedings of ICML. Ed. by Marina Meila and Tong Zhang. PMLR, 2021, pp. 9355–9366.URL: https://proceedings.mlr.press/v139/schlag21a.html

work page 2021

[44] [44]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks

Jürgen Schmidhuber. “Learning to control fast-weight memories: An alternative to dynamic recurrent networks”. In:Neural Computation4.1 (1992), pp. 131–139

work page 1992

[45] [45]

Freda Shi et al.Language Models are Multilingual Chain-of-Thought Reasoners. 2022. arXiv: 2210.03057 [cs.CL].URL:https://arxiv.org/abs/2210.03057

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber.Highway Networks. 2015. arXiv: 1505.00387 [cs.LG].URL:https://arxiv.org/abs/1505.00387

work page internal anchor Pith review Pith/arXiv arXiv 2015

[47] [47]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun et al. “Learning to (Learn at Test Time): RNNs with Expressive Hidden States”. In:ArXivabs/2407.04620 (2024).URL:https://api.semanticscholar.org/CorpusID:271039606

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Yutao Sun et al.Retentive Network: A Successor to Transformer for Large Language Models. 2023. arXiv: 2307.08621 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun et al. “Challenging big-bench tasks and whether chain-of-thought can solve them”. In:arXiv preprint arXiv:2210.09261(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [50]

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Shawn Tan et al. “Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study”. In: Proceedings of ICLR. 2025

work page 2025

[51] [51]

Hugo Touvron et al.Going deeper with Image Transformers. 2021. arXiv: 2103.17239 [cs.CV].URL: https: //arxiv.org/abs/2103.17239

work page arXiv 2021

[52] [52]

Hugo Touvron et al.LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv: 2302.13971 [cs.CL]. 18 Attention ResidualsTECHNICALREPORT

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Attention is All you Need

Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

work page 2017

[54] [54]

Attention is All you Need

Ashish Vaswani et al. “Attention is All you Need”. In:Advances in NeurIPS. Ed. by I. Guyon et al. V ol. 30. Curran Associates, Inc., 2017.URL: https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

work page 2017

[55] [55]

Hongyu Wang et al.DeepNet: Scaling Transformers to 1,000 Layers. 2022. arXiv: 2203.00555 [cs.CL].URL: https://arxiv.org/abs/2203.00555

work page arXiv 2022

[56] [56]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang et al. “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark”. In: Advances in NeurIPS37 (2024), pp. 95266–95290

work page 2024

[57] [57]

MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

Da Xiao et al. “MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections”. In:Proceedings of ICML. 2025

work page 2025

[58] [58]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao et al. “Efficient streaming language models with attention sinks”. In:arXiv preprint arXiv:2309.17453(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Zhihu blog post

Tian Xie.Your DeepSeek mHC Might Not Need the “m”. Zhihu blog post. 2026.URL: https://zhuanlan. zhihu.com/p/2010852389670908320

work page arXiv 2026

[60] [60]

Zhenda Xie et al.mHC: Manifold-Constrained Hyper-Connections. 2026. arXiv: 2512.24880 [cs.CL].URL: https://arxiv.org/abs/2512.24880

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Ruibin Xiong et al.On Layer Normalization in the Transformer Architecture. 2020. arXiv:2002.04745 [cs.LG]. URL:https://arxiv.org/abs/2002.04745

work page arXiv 2020

[62] [62]

Bowen Yang et al.Rope to Nope and Back Again: A New Hybrid Attention Strategy. 2025. arXiv: 2501.18795 [cs.CL].URL:https://arxiv.org/abs/2501.18795

work page arXiv 2025

[63] [63]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”. In:Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=r8H7xhYPwz

work page 2025

[64] [64]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang et al. “Gated Linear Attention Transformers with Hardware-Efficient Training”. In:Proceedings of ICML. PMLR, 2024

work page 2024

[65] [65]

Yongyi Yang and Jianyang Gao.mHC-lite: You Don’t Need 20 Sinkhorn-Knopp Iterations. 2026. arXiv:2601. 05732 [cs.LG].URL:https://arxiv.org/abs/2601.05732

work page arXiv 2026

[66] [66]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

work page 2019

[67] [67]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. “Root mean square layer normalization”. In:Advances in NeurIPS32 (2019)

work page 2019

[68] [68]

Yifan Zhang et al.Deep Delta Learning. 2026. arXiv: 2601.00417 [cs.LG] .URL: https://arxiv.org/ abs/2601.00417

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [69]

Yilang Zhang et al.ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling. 2026. arXiv: 2602.09009 [cs.LG].URL:https://arxiv.org/abs/2602.09009

work page arXiv 2026

[70] [70]

Yu Zhang et al.Kimi Linear: An Expressive, Efficient Attention Architecture. 2025. arXiv:2510.26692 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

Shu Zhong et al.Understanding Transformer from the Perspective of Associative Memory. 2025. arXiv: 2505. 19488 [cs.LG].URL:https://arxiv.org/abs/2505.19488

work page arXiv 2025

[72] [72]

Value Residual Learning

Zhanchao Zhou et al. “Value Residual Learning”. In:Proceedings of ACL. Ed. by Wanxiang Che et al. Vienna, Austria, 2025, pp. 28341–28356.URL:https://aclanthology.org/2025.acl-long.1375/

work page 2025

[73] [73]

Defa Zhu et al.Hyper-Connections. 2025. arXiv: 2409.19606 [cs.LG] .URL: https://arxiv.org/abs/ 2409.19606

work page arXiv 2025

[74] [74]

Zhijian Zhuo et al.HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

work page

[75] [75]

arXiv:2503.04598 [cs.CL].URL:https://arxiv.org/abs/2503.04598. 19 Attention ResidualsTECHNICALREPORT A Contributions The authors are listed in order of the significance of their contributions, with those in project leadership roles appearing last. Guangyu Chen∗ Yu Zhang∗ Jianlin Su∗ Weixin Xu Siyuan Pan Yaoyu Wang Yucheng Wang Guanduo Chen Bohong Yin Yuti...

work page arXiv