A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation
Pith reviewed 2026-05-10 11:02 UTC · model grok-4.3
The pith
Attention sinks in GPT-2 arise from the interaction of a learned query bias, first-layer MLP on positional encodings, and structure in the key projection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the attention sink arises from the interaction among a learned query bias, the first-layer MLP transformation of the positional encoding, and structure in the key projection. Each of these components is individually dispensable: models that omit any one of them do not develop the sink even when the other two remain. The finding is validated through interventions across multiple input domains and indicates that attention sinks can emerge via distinct circuits in different architectures.
What carries the argument
The three-component circuit consisting of learned query bias interacting with first-layer MLP-transformed positional encodings and key-projection structure.
If this is right
- Removing the learned query bias eliminates the sink while preserving other model capabilities.
- Altering the first-layer MLP to avoid transforming positional encodings prevents sink formation.
- Randomizing or restructuring the key projection stops the sink from appearing.
- Sinks can be mitigated by targeting any one of the three components rather than the entire attention mechanism.
- Different transformer variants may require different interventions because sinks can arise through separate circuits.
Where Pith is reading between the lines
- The same mechanistic lens could be applied to larger models to check whether the same three components still dominate or whether new circuits appear at scale.
- Architectures that avoid absolute positional embeddings or learned query biases by design may sidestep sinks without post-training fixes.
- The dispensability result suggests that sink mitigation could be incorporated early in model design rather than applied only after training.
- Similar circuit-level analysis might reveal why sinks appear or disappear when training data or objectives change.
Load-bearing premise
The causal interventions cleanly isolate the effects of the query bias, positional MLP, and key projection without unintended side effects on other computations.
What would settle it
A GPT-2 variant in which the query bias is ablated, the first-layer MLP no longer transforms positional encodings, and key-projection structure is randomized, yet the model still shows strong attention to the first position on the same inputs.
Figures
read the original abstract
Transformers commonly exhibit an attention sink: disproportionately high attention to the first position. We study this behavior in GPT-2-style models with learned query biases and absolute positional embeddings. Combining structural analysis with causal interventions, validated across natural-language, mathematical, and code inputs, we find that the sink arises from the interaction among (i) a learned query bias, (ii) the first-layer MLP transformation of the positional encoding, and (iii) structure in the key projection. Crucially, each component we identify is individually dispensable: architectures omitting each of them robustly exhibit sinks. This indicates that attention sinks may arise through distinct circuits across architectures. These findings inform mitigation of sinks, and motivate broader investigation into why sinks emerge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that attention sinks in GPT-2-style models with learned query biases and absolute positional embeddings arise from the interaction of three components: (i) a learned query bias, (ii) the first-layer MLP transformation of the positional encoding, and (iii) structure in the key projection. Through structural analysis combined with causal interventions, validated on natural-language, mathematical, and code inputs, the authors show that each component is individually dispensable—models omitting any one still exhibit robust sinks—indicating that sinks can emerge via distinct circuits across architectures. The work aims to inform mitigation strategies.
Significance. If the causal account holds, this manuscript offers a concrete mechanistic explanation for a widespread transformer phenomenon, advancing beyond purely observational studies. The cross-domain validation and the explicit demonstration of component dispensability are strengths, as they suggest multiple possible mechanisms and reduce reliance on any single fitted parameter. The empirical focus on interventions rather than post-hoc fitting provides a falsifiable basis for the claims and directly supports practical mitigation efforts.
major comments (1)
- [Experimental Validation / Causal Interventions] The description of the causal interventions (mentioned in the abstract and experimental sections) lacks quantitative details on intervention strength, control conditions, and effect sizes (e.g., delta in attention mass to position 0 before/after ablation). Without these, it is difficult to evaluate whether the interventions cleanly isolate the three components or introduce side effects on other behaviors, which is load-bearing for the central claim that the identified interaction is the source of the sink.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We appreciate the positive assessment of the work's significance and the specific feedback on experimental details, which has prompted us to strengthen the manuscript. Below we address the major comment point-by-point and describe the revisions made.
read point-by-point responses
-
Referee: [Experimental Validation / Causal Interventions] The description of the causal interventions (mentioned in the abstract and experimental sections) lacks quantitative details on intervention strength, control conditions, and effect sizes (e.g., delta in attention mass to position 0 before/after ablation). Without these, it is difficult to evaluate whether the interventions cleanly isolate the three components or introduce side effects on other behaviors, which is load-bearing for the central claim that the identified interaction is the source of the sink.
Authors: We agree that the original manuscript would benefit from more explicit quantitative reporting on the interventions to allow readers to assess their precision and specificity. In the revised version we have added a dedicated subsection (Section 4.2) and Appendix C that now specify: (i) intervention strengths, including the exact operations performed (zeroing the learned query bias vector, scaling the first-layer MLP output on positional encodings by a factor of zero, and targeted modifications to the key projection matrix); (ii) control conditions, consisting of matched-magnitude interventions applied to later-layer components and to randomly selected attention heads; and (iii) effect sizes, reported as the mean and standard deviation of the change in attention mass allocated to position 0, computed over 500 examples per domain (natural language, mathematics, code) before versus after each intervention, together with the corresponding change in model perplexity. These additions confirm that the interventions produce large, consistent reductions in sink behavior while leaving overall model performance and non-sink attention patterns largely intact. We believe the expanded reporting directly supports the claim that the identified components are causally implicated. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's account of attention sinks rests on structural analysis of GPT-2 components combined with causal ablations that demonstrate each identified factor (query bias, first-layer MLP on positional encodings, key projection structure) is individually dispensable. These interventions are reported to hold across natural language, math, and code inputs, providing independent empirical support rather than any derivation that reduces by construction to fitted parameters, self-citations, or renamed inputs. No equations or uniqueness theorems are invoked that collapse the central claim to its own assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal interventions on model components accurately isolate their contribution to attention patterns without side effects
Reference graph
Works this paper leans on
-
[1]
Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Enrique Queipo de Llano, Álvaro Arroyo, Federico Bar- bero, Xiaowen Dong, Michael Bronstein, Yann Le- Cun, and Ravid Shwartz-Ziv. 2026. Attention sinks and compression valleys in llms are two sides of the same coin.Preprint, arXiv:2510.06477. Jack Dial. 2025. The curious case of ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.Preprint, arXiv:2601.22966. Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Gated attention for large lan- guage models: Non-linear...
-
[4]
What are you sinking? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546, 2025
What are you sinking? a geometric approach on attention sink.Preprint, arXiv:2508.02546. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empiri- cal Methods in Natural...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.