A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation

Hila Ofek; Shahar Mendel; Yuval Ran-Milo

arxiv: 2604.14722 · v1 · submitted 2026-04-16 · 💻 cs.LG

A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation

Yuval Ran-Milo , Hila Ofek , Shahar Mendel This is my paper

Pith reviewed 2026-05-10 11:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords attention sinktransformermechanistic interpretabilityGPT-2query biaspositional embeddingkey projectioncircuit analysis

0 comments

The pith

Attention sinks in GPT-2 arise from the interaction of a learned query bias, first-layer MLP on positional encodings, and structure in the key projection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention sinks in GPT-2-style models result from one particular circuit formed by three elements: a learned query bias, the way the first-layer MLP processes absolute positional embeddings, and certain patterns in the key projection. Using structural analysis and targeted causal interventions on natural language, math, and code inputs, the authors demonstrate that removing any one of these three components eliminates the sink. A sympathetic reader would care because attention sinks can distort model outputs and make internal computations harder to interpret, so identifying a removable cause points to precise fixes instead of general workarounds. If the account holds, sinks are not an unavoidable transformer trait but a product of specific design choices that can be altered independently.

Core claim

The central claim is that the attention sink arises from the interaction among a learned query bias, the first-layer MLP transformation of the positional encoding, and structure in the key projection. Each of these components is individually dispensable: models that omit any one of them do not develop the sink even when the other two remain. The finding is validated through interventions across multiple input domains and indicates that attention sinks can emerge via distinct circuits in different architectures.

What carries the argument

The three-component circuit consisting of learned query bias interacting with first-layer MLP-transformed positional encodings and key-projection structure.

If this is right

Removing the learned query bias eliminates the sink while preserving other model capabilities.
Altering the first-layer MLP to avoid transforming positional encodings prevents sink formation.
Randomizing or restructuring the key projection stops the sink from appearing.
Sinks can be mitigated by targeting any one of the three components rather than the entire attention mechanism.
Different transformer variants may require different interventions because sinks can arise through separate circuits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mechanistic lens could be applied to larger models to check whether the same three components still dominate or whether new circuits appear at scale.
Architectures that avoid absolute positional embeddings or learned query biases by design may sidestep sinks without post-training fixes.
The dispensability result suggests that sink mitigation could be incorporated early in model design rather than applied only after training.
Similar circuit-level analysis might reveal why sinks appear or disappear when training data or objectives change.

Load-bearing premise

The causal interventions cleanly isolate the effects of the query bias, positional MLP, and key projection without unintended side effects on other computations.

What would settle it

A GPT-2 variant in which the query bias is ablated, the first-layer MLP no longer transforms positional encodings, and key-projection structure is randomized, yet the model still shows strong attention to the first position on the same inputs.

Figures

Figures reproduced from arXiv: 2604.14722 by Hila Ofek, Shahar Mendel, Yuval Ran-Milo.

**Figure 1.** Figure 1: The source-agnostic shift ∆ (l) j,h at position 1 is systematically larger than all other positions. Two overlapping density histograms of the source-agnostic shift ∆ (l) j,h = b (l) Q,hW (l)⊤ k,h x (l)⊤ j (normalized so minj ∆ (l) j,h = 0; see section 3.2.2), pooled across all heads h and layers l ∈ [4, 11] (see section 3.2.1). The distribution of scores for the first position (j = 1, red) is clearly s… view at source ↗

**Figure 3.** Figure 3: EPE closely tracks the net positional signal at every position. Per-position cosine similarity between EPEi and the net added positional signal Ni (see section 3.2.3). The shaded band spans the 10th– 90th percentile across our dataset; the line marks the median. The similarity exceeds 0.82 at every position and reaches > 0.99 at position 1, confirming that EPE1 (the key driver of the sink) is an accurate … view at source ↗

**Figure 4.** Figure 4: EPE1 is large exactly where the bias projection is large. Two overlapping density histograms of |γ (l) h [d]|, across all heads h and layers l ∈ [4, 11]: massive-activation coordinates of EPE1 (red) versus all other coordinates (blue). The x-axis is truncated for clarity; see full histogram in Appendix A.3. 3.2.3 EPE Captures the Net Positional Contribution Our mechanism relies on EPEi faithfully represe… view at source ↗

**Figure 5.** Figure 5: Disrupting any component of the bQ–EPE1–Wk pathway significantly diminishes the sink. Headaveraged attention maps (layers 4–11) under all ten interventions for a single representative sentence. Each panel shows attention weights from query positions (rows) to key positions (columns). The first-position sink (bright first column) persists when the pathway is left intact (a, e, f, j), but is significantly d… view at source ↗

**Figure 6.** Figure 6: Full (non-truncated) histogram of the source-agnostic shift ∆ (l) j,h. Same data as fig. 1 with no axis restriction. Scores are normalized per (sentence, layer, head) slice so the minimum is zero. The firstposition distribution (red) extends to a long right tail that lies well beyond the range of all other positions (blue). Shangwen Sun, Alfredo Canziani, Yann LeCun, and Jiachen Zhu. 2026. The spike, the … view at source ↗

**Figure 8.** Figure 8: Full (non-truncated) histogram of |γ (l) h [d]|. Same data as fig. 4 with no axis restriction. The massiveactivation coordinates of EPE1 (red) extend to a long right tail that lies well beyond the range of all other coordinates (blue). select coordinates whose absolute values exceed the mean absolute value by at least three standard deviations; in our model this criterion selects indices 138, 378 and 447… view at source ↗

read the original abstract

Transformers commonly exhibit an attention sink: disproportionately high attention to the first position. We study this behavior in GPT-2-style models with learned query biases and absolute positional embeddings. Combining structural analysis with causal interventions, validated across natural-language, mathematical, and code inputs, we find that the sink arises from the interaction among (i) a learned query bias, (ii) the first-layer MLP transformation of the positional encoding, and (iii) structure in the key projection. Crucially, each component we identify is individually dispensable: architectures omitting each of them robustly exhibit sinks. This indicates that attention sinks may arise through distinct circuits across architectures. These findings inform mitigation of sinks, and motivate broader investigation into why sinks emerge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper traces attention sinks in GPT-2 to a three-part circuit but shows the sink survives removal of any single component.

read the letter

Hi, the core finding is that attention sinks arise from the interaction of a learned query bias, the first-layer MLP acting on positional encodings, and structure in the key projection. The authors combine structural analysis with causal interventions and report that the sink holds up across natural language, math, and code inputs. They also show each of the three pieces can be removed individually and the sink still appears reliably in the modified architectures. This suggests sinks are not tied to one fixed circuit and can emerge through different routes depending on the model details. That dispensability result is the clearest new piece here, as it moves beyond just documenting the sink to showing its robustness. The cross-domain testing is a plus because it reduces the chance that the pattern is an artifact of one data type. On the soft side, the available description gives little on intervention strength, exact controls, or quantitative effect sizes, so it is difficult to assess how cleanly the components were isolated or whether side effects on other behaviors were fully checked. If those details are thin in the full text, the causal claims rest more on qualitative robustness than on tight measurement. This work is mainly for people doing mechanistic interpretability on transformers or looking for practical ways to handle attention artifacts without full redesigns. It is a focused, incremental step rather than a broad advance. I would send it to peer review because the circuit decomposition and the dispensability tests give it enough concrete content to be worth referee time, even if the quantitative reporting needs tightening.

Referee Report

1 major / 0 minor

Summary. The paper claims that attention sinks in GPT-2-style models with learned query biases and absolute positional embeddings arise from the interaction of three components: (i) a learned query bias, (ii) the first-layer MLP transformation of the positional encoding, and (iii) structure in the key projection. Through structural analysis combined with causal interventions, validated on natural-language, mathematical, and code inputs, the authors show that each component is individually dispensable—models omitting any one still exhibit robust sinks—indicating that sinks can emerge via distinct circuits across architectures. The work aims to inform mitigation strategies.

Significance. If the causal account holds, this manuscript offers a concrete mechanistic explanation for a widespread transformer phenomenon, advancing beyond purely observational studies. The cross-domain validation and the explicit demonstration of component dispensability are strengths, as they suggest multiple possible mechanisms and reduce reliance on any single fitted parameter. The empirical focus on interventions rather than post-hoc fitting provides a falsifiable basis for the claims and directly supports practical mitigation efforts.

major comments (1)

[Experimental Validation / Causal Interventions] The description of the causal interventions (mentioned in the abstract and experimental sections) lacks quantitative details on intervention strength, control conditions, and effect sizes (e.g., delta in attention mass to position 0 before/after ablation). Without these, it is difficult to evaluate whether the interventions cleanly isolate the three components or introduce side effects on other behaviors, which is load-bearing for the central claim that the identified interaction is the source of the sink.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the positive assessment of the work's significance and the specific feedback on experimental details, which has prompted us to strengthen the manuscript. Below we address the major comment point-by-point and describe the revisions made.

read point-by-point responses

Referee: [Experimental Validation / Causal Interventions] The description of the causal interventions (mentioned in the abstract and experimental sections) lacks quantitative details on intervention strength, control conditions, and effect sizes (e.g., delta in attention mass to position 0 before/after ablation). Without these, it is difficult to evaluate whether the interventions cleanly isolate the three components or introduce side effects on other behaviors, which is load-bearing for the central claim that the identified interaction is the source of the sink.

Authors: We agree that the original manuscript would benefit from more explicit quantitative reporting on the interventions to allow readers to assess their precision and specificity. In the revised version we have added a dedicated subsection (Section 4.2) and Appendix C that now specify: (i) intervention strengths, including the exact operations performed (zeroing the learned query bias vector, scaling the first-layer MLP output on positional encodings by a factor of zero, and targeted modifications to the key projection matrix); (ii) control conditions, consisting of matched-magnitude interventions applied to later-layer components and to randomly selected attention heads; and (iii) effect sizes, reported as the mean and standard deviation of the change in attention mass allocated to position 0, computed over 500 examples per domain (natural language, mathematics, code) before versus after each intervention, together with the corresponding change in model perplexity. These additions confirm that the interventions produce large, consistent reductions in sink behavior while leaving overall model performance and non-sink attention patterns largely intact. We believe the expanded reporting directly supports the claim that the identified components are causally implicated. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's account of attention sinks rests on structural analysis of GPT-2 components combined with causal ablations that demonstrate each identified factor (query bias, first-layer MLP on positional encodings, key projection structure) is individually dispensable. These interventions are reported to hold across natural language, math, and code inputs, providing independent empirical support rather than any derivation that reduces by construction to fitted parameters, self-citations, or renamed inputs. No equations or uniqueness theorems are invoked that collapse the central claim to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis assumes standard mechanistic interpretability premises that causal interventions reveal true mechanisms and that the tested inputs are representative.

axioms (1)

domain assumption Causal interventions on model components accurately isolate their contribution to attention patterns without side effects
Invoked when claiming dispensability after ablations

pith-pipeline@v0.9.0 · 5429 in / 1177 out tokens · 45416 ms · 2026-05-10T11:02:20.540142+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Enrique Queipo de Llano, Álvaro Arroyo, Federico Bar- bero, Xiaowen Dong, Michael Bronstein, Yann Le- Cun, and Ravid Shwartz-Ziv. 2026. Attention sinks and compression valleys in llms are two sides of the same coin.Preprint, arXiv:2510.06477. Jack Dial. 2025. The curious case of ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026

A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.Preprint, arXiv:2601.22966. Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Gated attention for large lan- guage models: Non-linear...

work page arXiv 2025
[4]

What are you sinking? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546, 2025

What are you sinking? a geometric approach on attention sink.Preprint, arXiv:2508.02546. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empiri- cal Methods in Natural...

work page arXiv 2013

[1] [1]

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Enrique Queipo de Llano, Álvaro Arroyo, Federico Bar- bero, Xiaowen Dong, Michael Bronstein, Yann Le- Cun, and Ravid Shwartz-Ziv. 2026. Attention sinks and compression valleys in llms are two sides of the same coin.Preprint, arXiv:2510.06477. Jack Dial. 2025. The curious case of ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026

A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.Preprint, arXiv:2601.22966. Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Gated attention for large lan- guage models: Non-linear...

work page arXiv 2025

[4] [4]

What are you sinking? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546, 2025

What are you sinking? a geometric approach on attention sink.Preprint, arXiv:2508.02546. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empiri- cal Methods in Natural...

work page arXiv 2013