Attention Sinks and Outliers in Attention Residuals

Binghui Wang; Chenghao Qiu; Chenwei Xu; Eric Hanchen Jiang; Haoran Dai; Haotian Zhang; Haozheng Luo; Jingyuan Huang; Shaoyang Zhang; Xi Chen

arxiv: 2605.17887 · v1 · pith:JJEO6DL3new · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Attention Sinks and Outliers in Attention Residuals

Haozheng Luo , Haoran Dai , Shaoyang Zhang , Xi Chen , Eric Hanchen Jiang , Yijiang Li , Jingyuan Huang , Chenghao Qiu

show 5 more authors

Chenwei Xu Zhenyu Pan Haotian Zhang Binghui Wang Yan Chen

This is my paper

Pith reviewed 2026-05-20 12:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords attention sinksactivation outliersAttnResidualquantization robustnessnull signalinginter-layer routingSoftmax1transformer stability

0 comments

The pith

AttnResidual architectures intensify attention sinks and outliers through dual normalization, which OASIS counters via Softmax1 null spaces and inter-layer signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adding depth-wise normalization in attention residuals improves routing flexibility but worsens sink-dominated attention and activation outliers, which in turn degrades inference stability and quantization performance. OASIS counters this by creating a null space with Softmax1 and routing token-level null evidence across layers through an inter-layer signal. If correct, this reduces the maximum infinity norm of activations and kurtosis while preserving model capacity, leading to substantially lower perplexity after quantization and higher accuracy on downstream tasks. A reader would care because reliable low-precision inference is essential for deploying large models on hardware with limited precision support.

Core claim

The dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness; introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal reduces sink-dominated routing and improves structural robustness.

What carries the argument

The inter-layer null signal that couples token-level null evidence from a Softmax1-based null space to depth routing.

If this is right

Lower maximum infinity norm and average kurtosis across attention layers.
Reduced perplexity degradation under W8A8 quantization.
Higher GSM8K Pass@1 accuracy under W4A4 quantization.
Consistent gains in attention sink metrics and post-quantization performance on real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same null-signaling idea could be tested on other residual architectures that add extra normalization channels.
If the inter-layer signal preserves capacity, it might support training deeper attention stacks without proportional growth in outlier severity.
A direct test would measure whether removing the inter-layer component alone restores the original sink levels while keeping other OASIS parts fixed.

Load-bearing premise

The dual-normalization design of AttnResidual is the primary driver of intensified sink formation and quantization brittleness, and coupling token-level null evidence to depth routing via inter-layer signals will reduce sinks without introducing new instabilities or capacity loss.

What would settle it

Applying the inter-layer null signal produces no measurable drop in maximum infinity norm or kurtosis and no reduction in W8A8 perplexity relative to the five baselines on the three evaluated datasets.

Figures

Figures reproduced from arXiv: 2605.17887 by Binghui Wang, Chenghao Qiu, Chenwei Xu, Eric Hanchen Jiang, Haoran Dai, Haotian Zhang, Haozheng Luo, Jingyuan Huang, Shaoyang Zhang, Xi Chen, Yan Chen, Yijiang Li, Zhenyu Pan.

**Figure 2.** Figure 2: Attention sink visualization. We visualize token-level attention maps for the (A) vanilla Transformer and (B) AttnResidual variant at layers 0, 9, and 14. The results show that the <|begin_of_text|> token acts as a persistent attention sink, and that its concentration becomes progressively stronger with depth, particularly under the dual-normalization design of AttnResidual. 0 1 2 3 4 5 6 7 8 9 1011121314… view at source ↗

**Figure 3.** Figure 3: Outlier amplification in AttnResidual. We visualize the hidden-state kurtosis (left) and infinity norm (right) across layers for the original Transformer and the AttnResidual variant. The results show that AttnResidual produces substantially larger kurtosis and activation magnitudes throughout the network, indicating that dual Softmax normalization amplifies outlier channels relative to the single-normaliz… view at source ↗

**Figure 4.** Figure 4: Attention sink mitigation in representative attention maps. We visualize representative head0 token-level attention maps on a short causal prompt for trained attention-residual variants. Strong vertical concentration on the leading <|begin_of_text|> token indicates an attention sink. Relative to standard variants, null-aware routing weakens both the dominant first-token sink and secondary sink bands, yiel… view at source ↗

read the original abstract

We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes OASIS, an outlier- and sink-aware technique for AttnResidual architectures that introduces a Softmax1-based null space and couples token-level null evidence to depth routing via an inter-layer null signal. It claims that the dual-normalization design of AttnResidual intensifies attention sinks, activation outliers, and quantization brittleness, and reports experimental reductions of 9.26% in maximum infinity norm, 2.60% in average kurtosis, 75.85% in perplexity under W8A8, and a 12.42% improvement in GSM8K Pass@1 under W4A4 relative to five baselines across three datasets.

Significance. If the theoretical isolation of dual-normalization effects and the experimental gains hold under rigorous controls, this could provide a targeted mechanism for mitigating sink-dominated routing in residual attention layers, with direct implications for post-training quantization stability in large language models.

major comments (3)

Abstract: the theoretical demonstration that dual-normalization intensifies sink formation supplies no equation, bounding argument, or isolation step showing how the added depth-wise normalization channel specifically increases max infinity norm or kurtosis beyond what single-norm residuals already produce; this mechanism is load-bearing for the motivation of the Softmax1 null-space fix.
Experimental section: the reported 9.26% reduction in maximum infinity norm and 75.85% perplexity drop under W8A8 are stated without naming the five baselines, dataset splits, number of runs, or controls for confounding factors such as layer depth or quantization parameter choices, undermining attribution to the inter-layer null signal.
Theoretical analysis: the central assumption that AttnResidual's added depth-wise normalization channel (rather than its interaction with the original layer-norm or residual scaling) is the primary driver of worsened sinks is not isolated; without this separation the proposed null-space and inter-layer signaling may target the wrong mechanism.

minor comments (2)

Abstract: the five baselines are referenced but not identified; listing them (and their relation to prior sink-mitigation work) in the introduction would improve readability.
Notation: the term 'null space' in the Softmax1 construction should be defined explicitly with respect to attention score distributions to prevent confusion with standard attention nulling techniques.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We have prepared point-by-point responses to the major comments and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: Abstract: the theoretical demonstration that dual-normalization intensifies sink formation supplies no equation, bounding argument, or isolation step showing how the added depth-wise normalization channel specifically increases max infinity norm or kurtosis beyond what single-norm residuals already produce; this mechanism is load-bearing for the motivation of the Softmax1 null-space fix.

Authors: We agree that the abstract would benefit from including the key theoretical element. In the revised manuscript, we will modify the abstract to briefly mention the bounding argument that demonstrates how the dual-normalization increases the max infinity norm and kurtosis, with a reference to the detailed derivation in the theoretical analysis section. This will better motivate the Softmax1 null-space fix. revision: yes
Referee: Experimental section: the reported 9.26% reduction in maximum infinity norm and 75.85% perplexity drop under W8A8 are stated without naming the five baselines, dataset splits, number of runs, or controls for confounding factors such as layer depth or quantization parameter choices, undermining attribution to the inter-layer null signal.

Authors: We appreciate the need for more experimental details. We will revise the experimental section to name the five baselines explicitly, describe the dataset splits, report the number of runs (with variance), and detail the controls for layer depth and quantization parameters. This will allow better attribution of the improvements to the inter-layer null signal. revision: yes
Referee: Theoretical analysis: the central assumption that AttnResidual's added depth-wise normalization channel (rather than its interaction with the original layer-norm or residual scaling) is the primary driver of worsened sinks is not isolated; without this separation the proposed null-space and inter-layer signaling may target the wrong mechanism.

Authors: We acknowledge that the isolation of the depth-wise normalization effect could be strengthened. In the revision, we will add further analysis or experiments to separate the contribution of the added normalization channel from interactions with layer-norm and residual scaling. This may involve additional ablation studies to confirm the primary driver. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claim and empirical results remain independent of inputs by construction.

full rationale

The abstract asserts a theoretical demonstration that dual-normalization intensifies sinks and brittleness, yet supplies no equations, fitted parameters, or self-citations that would reduce this claim to a renaming or tautological restatement of the method itself. Reported improvements (e.g., 9.26% infinity-norm reduction) are presented as experimental observations on external datasets rather than predictions forced by post-hoc fitting or inter-layer signaling definitions. No load-bearing self-citation chains, ansatz smuggling, or uniqueness theorems imported from prior author work appear in the given text; the derivation chain therefore stays self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the unshown theoretical link between dual normalization and sink formation plus the assumption that the proposed null-space mechanism transfers cleanly across layers; no free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness
Stated as a theoretical result in the abstract without derivation provided.

invented entities (1)

OASIS technique with Softmax1-based null space and inter-layer null signal no independent evidence
purpose: To reduce sink-dominated routing and improve structural robustness
New method introduced in the abstract; no independent falsifiable evidence outside the reported experiments is described.

pith-pipeline@v0.9.0 · 5762 in / 1370 out tokens · 36716 ms · 2026-05-20T12:08:40.058780+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-normalization design of AttnResidual intensifies sink formation... Softmax1-based null space and inter-layer null signaling
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.2 (Softmax1 reduces structural pressure from outliers, sinks, and collapse)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 9 internal anchors

[1]

Mitigating attention sinks and massive activations in audio-visual speech recognition with llms

Anand Anand, Umberto Cappellazzo, Stavros Petridis, and Maja Pantic. Mitigating attention sinks and massive activations in audio-visual speech recognition with llms. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17942–17946. IEEE,

work page 2026
[2]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286,

work page 2019
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, and Dan Al- istarh. Gsq: Highly-accurate low-precision scalar quantization for llms via gumbel-softmax sampling.arXiv preprint arXiv:2604.18556,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

The Llama 3 Herd of Models

GitHub repository. Unofficial PyTorch implementation of Attention Residuals. Ac- cessed: 2026-04-08. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, and Warren J. Gross. Innerq: Hardware-aware tuning-free quantization of kv cache for large language models.arXiv preprint arXiv:2602.23200,

work page internal anchor Pith review arXiv
[7]

Attention is not only a weight: Analyzing transformers with vector norms

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7057–7075,

work page 2020
[8]

Revealing the dark secrets of BERT

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of BERT. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4365–4374,

work page 2019
[9]

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo. Attention sinks are provably necessary in softmax transformers: Evidence from trigger-conditional tasks.arXiv preprint arXiv:2603.11487,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based lan- guage models.arXiv preprint arXiv:2404.02258,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Prefixing attention sinks can mitigate activation outliers for large language model quantization

Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, and Jaeho Lee. Prefixing attention sinks can mitigate activation outliers for large language model quantization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

work page 2024
[12]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031,

work page internal anchor Pith review arXiv
[13]

Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, and Yiren Zhao. On the existence and behavior of secondary attention sinks. InICLR 2026 Workshop on Unify- ing Concept Representation Learning,

work page 2026
[14]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Zayd M. K. Zuhri, Erland Hilman Fuadi, and Alham Fikri Aji. Softpick: No attention sink, no massive activations with rectified softmax.arXiv preprint arXiv:2504.20966,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Mitigating attention sinks and massive activations in audio-visual speech recognition with llms

Anand Anand, Umberto Cappellazzo, Stavros Petridis, and Maja Pantic. Mitigating attention sinks and massive activations in audio-visual speech recognition with llms. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17942–17946. IEEE,

work page 2026

[2] [2]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286,

work page 2019

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, and Dan Al- istarh. Gsq: Highly-accurate low-precision scalar quantization for llms via gumbel-softmax sampling.arXiv preprint arXiv:2604.18556,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

The Llama 3 Herd of Models

GitHub repository. Unofficial PyTorch implementation of Attention Residuals. Ac- cessed: 2026-04-08. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, and Warren J. Gross. Innerq: Hardware-aware tuning-free quantization of kv cache for large language models.arXiv preprint arXiv:2602.23200,

work page internal anchor Pith review arXiv

[7] [7]

Attention is not only a weight: Analyzing transformers with vector norms

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7057–7075,

work page 2020

[8] [8]

Revealing the dark secrets of BERT

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of BERT. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4365–4374,

work page 2019

[9] [9]

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Yuval Ran-Milo. Attention sinks are provably necessary in softmax transformers: Evidence from trigger-conditional tasks.arXiv preprint arXiv:2603.11487,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based lan- guage models.arXiv preprint arXiv:2404.02258,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Prefixing attention sinks can mitigate activation outliers for large language model quantization

Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, and Jaeho Lee. Prefixing attention sinks can mitigate activation outliers for large language model quantization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

work page 2024

[12] [12]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031,

work page internal anchor Pith review arXiv

[13] [13]

Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, and Yiren Zhao. On the existence and behavior of secondary attention sinks. InICLR 2026 Workshop on Unify- ing Concept Representation Learning,

work page 2026

[14] [14]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Zayd M. K. Zuhri, Erland Hilman Fuadi, and Alham Fikri Aji. Softpick: No attention sink, no massive activations with rectified softmax.arXiv preprint arXiv:2504.20966,

work page internal anchor Pith review Pith/arXiv arXiv