pith. machine review for the scientific record. sign in

arxiv: 2605.06611 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords attention sinkself-attentionfeed-forward networkvariance discrepancysuper neuronsdimension disparitylarge language modelsRMSNorm
0
0 comments X

The pith

Attention sinks in LLMs arise because channel-sparse down-projections in feed-forward layers create a dimension disparity that forces the first token to serve as a structural anchor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces the attention sink phenomenon, where initial tokens monopolize attention scores, to a variance discrepancy created by value aggregation inside self-attention. This discrepancy is sharply amplified by super neurons in the feed-forward layers. Their channel-sparse down-projections produce a dimension disparity specifically for the first-token representation, which the model resolves by forming attention sinks. Two controlled interventions confirm the causal chain by moving sinks to arbitrary positions, and a head-wise RMSNorm modification is shown to restore statistical parity and speed convergence.

Core claim

Value aggregation in self-attention induces a systematic variance discrepancy that super neurons in FFN layers amplify through channel-sparse down-projections; these projections create a dimension disparity in the first-token representation, which necessitates attention sinks as a structural anchor.

What carries the argument

Channel-sparse down-projections inside super neurons that trigger dimension disparity in the first-token representation and thereby require attention sinks as a structural anchor.

If this is right

  • Attention masks that isolate the aggregation effect can move sinks to any chosen position.
  • Amplifying variance at targeted positions can induce sinks at those positions.
  • Head-wise RMSNorm applied during pre-training restores parity across positions and accelerates convergence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that avoid channel sparsity in down-projections might eliminate the need for sinks without losing representational power.
  • The same variance-discrepancy analysis could be applied to other positional biases observed in transformer training.
  • Early detection of super-neuron activation patterns could predict sink locations before full training completes.

Load-bearing premise

Super neurons and the dimension disparity they produce are the primary cause of attention sinks rather than secondary effects of training or architecture.

What would settle it

Remove or densify the down-projections of the identified super neurons while keeping all other weights and training dynamics fixed, then measure whether attention sinks disappear or shift.

Figures

Figures reproduced from arXiv: 2605.06611 by Jiacheng Sun, Kaiqi Jiang, Siquan Li, Tianyang Hu.

Figure 1
Figure 1. Figure 1: Schematic Overview of the Attention Sink Mechanism. Value aggregation causes dimension-wise variance decay for subsequent tokens, while the first token acts as a high-variance outlier. This discrepancy is preserved by output projections, activating super neurons in FFNs. Subsequently, the channel-sparse down-projections induce dimension disparity, resulting in the attention sink. 0 1 2 3 4 5 6 7 8 9 10 11 … view at source ↗
Figure 2
Figure 2. Figure 2: Mitigation of Attention Sinks. Comparison of the aver￾aged attention to the first token across layers. The Baseline (red) exhibits attention sink from the 5th layer, whereas both Sigmoid attention (green) and our Head-wise RMSNorm (blue) method successfully suppress this artifact. out the sum-to-one constraint, the first token is no longer a high-variance outlier. As expected, attention sinks are significa… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise evolution of attention sink and represen￾tation norms. We plot the attention score of the first token (left axis, blue) and its input representation l2-norm (right axis, red) for Llama-2. The synchronized spike indicates that the arrival of a high-norm representation triggers the attention sink. The precursor: massive representation norms To iden￾tify what triggers this sudden spike, we analyze … view at source ↗
Figure 5
Figure 5. Figure 5: Causal validation via mask intervention. We visualize the average attention score received by each token on Llama-2. The red line shows the result of blocking aggregation at index 10, which causes it to transform into a new attention sink. Direct Variance Amplification We can also mimic the initial-token variance behavior at any position k by directly amplifying the corresponding variance. First, we comput… view at source ↗
Figure 6
Figure 6. Figure 6: Inducing attention sinks via variance amplification. We apply a factor λ to amplify the variance of an arbitrary token (index 10). Increasing λ directly increases the attention score received by the token. A control experiment shows that merely scaling the representation norm fails to induce such a sink. To rule out the possibility that the induced sink is merely an artifact of an enlarged representation n… view at source ↗
Figure 7
Figure 7. Figure 7: Structural alignment and outlier status preservation in WO (Layer 1). (Left) The distribution of rank correlations between WO neuron weights and Token 0 input variance. The positive shift (mean=0.32) indicates structural alignment. (Right) Even after passing through WO, the first token maintains signifi￾cantly higher variance than subsequent tokens. Post-Projection Variance Decay To verify if this structur… view at source ↗
Figure 9
Figure 9. Figure 9: Selective activation of super neuron 7890. The Left Axis shows the cosine similarity with W(7890) gate . The Right Axis shows the raw activation via W(7890) up . The first token uniquely achieves both high alignment and massive activation, whereas subsequent tokens are effectively suppressed. Selective Activation We track the interaction between in￾put tokens and super neuron (index 7890) and evaluate whet… view at source ↗
Figure 8
Figure 8. Figure 8: Structural Identification of Super Neurons in Layer 1 FFN. We visualize the l2 norms of the weight vectors for each hidden neuron in Wgate (Left) and Wup (Right). A distinct subset of neurons exhibits significantly larger norms. Super Neurons in Weight Matrices We first investigate the structural bias in Wgate and Wup. We calculate the l2 norms of the weight vectors for each hidden neuron. As shown in view at source ↗
Figure 11
Figure 11. Figure 11: Dimension disparity analysis. We calculate the layer￾wise Dominance Ratio (max /mean) of the first token on WikiText￾2. The sharp rise in early layers suggests that the representation is dominated by a few massive outlier dimensions. Next, we reveal how this interacts with RMSNorm and the query/key projections in layer 2. RMSNorm as Directional Filter Let the first token input x0 be dominated by a massive… view at source ↗
Figure 10
Figure 10. Figure 10: Sparse channeling in down-projection. The weight distribution corresponding to super neuron in Wdown is heavy￾tailed. Massive activation is channeled solely into specific outlier dimensions. 4.3. Dimension Disparity and Structural Locking in QK Projections In this section, we analyze the properties of the first to￾ken’s FFN output, which is generated by massive selective activations of super neurons passi… view at source ↗
Figure 12
Figure 12. Figure 12: Head-wise analysis in layer 2. The x-axis represents head indices. The bars (left axis) show structural alignment be￾tween the sink key and the query matrix’s principal direction. The red line (right axis) shows the ratio of positive attention scores. High alignment correlates with high positivity. 5. Practical Implication Our analysis identifies the structural origin of attention sink: the variance discr… view at source ↗
Figure 13
Figure 13. Figure 13: Head imbalance and signal magnitude. Attention heads are sorted by entropy. Low-entropy heads produce high￾variance outputs, while high-entropy heads produce low-variance outputs. To resolve both the positional variance discrepancy and this head imbalance, we propose head-wise RMSNorm, im￾mediately after value aggregation and before the output projection WO. Specifically, for a head h at position t, we no… view at source ↗
Figure 14
Figure 14. Figure 14: compares the layer-wise trajectory. The base￾line (red) exhibits a sharp escalation in the dominance ratio starting from early layers, indicating that the first token’s representation is effectively hijacked by a single outlier di￾mension. In contrast, both Sigmoid (green) and ours (blue) maintain a consistently low dominance ratio. This result pro￾vides strong evidence that the extreme dimension disparit… view at source ↗
Figure 15
Figure 15. Figure 15: Layer-wise effective rank of the hidden states. The baseline shows a distinct drop in effective rank, indicating manifold collapse caused by outlier dimensions. Our method maintains a higher effective rank, preserving representational capacity. Pre-training Convergence Speed We analyze the vali￾dation loss trajectories during the pre-training to evaluate optimization efficiency and generalization. As show… view at source ↗
Figure 16
Figure 16. Figure 16: Validation Loss view at source ↗
Figure 18
Figure 18. Figure 18: Layer-wise evolution of attention sink and representation norms. We plot the attention score of the first token (left axis, blue) and its input representation l2-norm (right axis, red) for Llama-3. The synchronized spike indicates that the arrival of a high-norm representation triggers the attention sink. A.2. Attention Mask Intervention To validate that the variance discrepancy is the structural root cau… view at source ↗
Figure 19
Figure 19. Figure 19: Inducing attention sinks via mask intervention on Llama-3. We intervene by applying a mask to an arbitrary intermediate token to prevent it from aggregating values (blocking its attention to prior tokens). This intervention effectively induces an attention sink on the targeted non-aggregating token. A.3. Direct Variance Amplification Results We quantitatively demonstrate the causal link between variance a… view at source ↗
Figure 20
Figure 20. Figure 20: Inducing attention sinks via variance amplification. We apply a factor λ to amplify the variance of an arbitrary token (index 10). Increasing λ directly increases the attention score received by the token. B. Experimental Setup Details To ensure reproducibility, we provide the detailed configurations used for the pre-training experiments. B.1. Model Architecture Our baseline models follow the standard Lla… view at source ↗
Figure 21
Figure 21. Figure 21: Effective Rank Dynamics in Open-Source Models. We visualize the effective rank of hidden states across layers. The Blue Line represents Llama-2-7B, and the Orange Line represents Llama-3-8B. Both models exhibit a characteristic precipitous drop in rank within the first few layers (Shallow Layer Collapse), mirroring the behavior of our baseline. This validates that manifold collapse is a ubiquitous structu… view at source ↗
Figure 23
Figure 23. Figure 23: Comparison of dimension-wise variance between the baseline and gated models in Layer 1. The application of gating mechanisms successfully eliminates the severe variance discrep￾ancy present in the baseline. 16 view at source ↗
read the original abstract

Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a mechanistic explanation for the attention sink phenomenon in LLMs. It traces the root to variance discrepancy induced by value aggregation in self-attention, which is amplified by super neurons in FFN layers via channel-sparse down-projections leading to dimension disparity in the first-token representation. This disparity necessitates attention sinks as a structural anchor. The authors validate the causal chain with two interventions—attention mask modifications to isolate aggregation effects and targeted variance amplification—that replicate sinks at arbitrary positions. They also propose head-wise RMSNorm as an architectural change to stabilize value aggregation and accelerate convergence during pre-training.

Significance. If the proposed causal mechanism holds, this work offers a significant advance in understanding a common but poorly explained behavior in transformer-based LLMs. The controlled interventions provide direct support by demonstrating inducibility of sinks at new positions, and the head-wise RMSNorm modification demonstrates a practical application that improves training dynamics. This could lead to better architectural designs and training strategies for large models.

major comments (2)
  1. [Validation experiments (abstract and §4)] The two interventions are presented as validating the dimension-disparity-to-sink causal chain, but attention mask modifications can alter global attention entropy and position-wise value mixing independently of the first-token dimension disparity. The paper should provide quantitative comparisons of attention statistics before and after the mask change to confirm isolation.
  2. [Validation experiments (abstract and §4)] Similarly, the variance amplification intervention may affect super-neuron activations or downstream normalization in ways unrelated to the original mechanism. Additional ablations or controls are needed to show that the replicated sinks are driven specifically by the induced dimension disparity rather than side effects.
minor comments (2)
  1. The definition of 'super neurons' and 'head-wise RMSNorm' should be made more precise with equations in the main text for reproducibility.
  2. The manuscript would benefit from including full experimental details, such as model sizes, datasets, and exact quantitative results for the interventions, which are currently summarized at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments on the validation experiments are well-taken and point to opportunities for strengthening the isolation of causal effects. We address each major comment below and have revised the paper to include the requested quantitative analyses and controls.

read point-by-point responses
  1. Referee: [Validation experiments (abstract and §4)] The two interventions are presented as validating the dimension-disparity-to-sink causal chain, but attention mask modifications can alter global attention entropy and position-wise value mixing independently of the first-token dimension disparity. The paper should provide quantitative comparisons of attention statistics before and after the mask change to confirm isolation.

    Authors: We agree that explicit quantification is necessary to rule out confounding changes in attention entropy or value mixing. Our mask modifications were constructed to selectively block non-first-token aggregation while leaving the first-token representation and its dimension disparity intact. In the revised manuscript we now include direct pre/post comparisons of global attention entropy, per-position value mixing distributions, and variance statistics (new Figure 4 and accompanying table in §4). These metrics show that entropy shifts are small and uncorrelated with the replicated sink positions, which instead track the preserved dimension disparity. This addition confirms the intended isolation without altering the original experimental design. revision: yes

  2. Referee: [Validation experiments (abstract and §4)] Similarly, the variance amplification intervention may affect super-neuron activations or downstream normalization in ways unrelated to the original mechanism. Additional ablations or controls are needed to show that the replicated sinks are driven specifically by the induced dimension disparity rather than side effects.

    Authors: We acknowledge the possibility of unintended effects on super-neuron firing or normalization. The variance amplification was applied only to targeted token hidden states chosen to mimic the observed dimension disparity. In the revised §4 we now report additional ablation controls that (i) monitor super-neuron activation histograms before and during amplification and (ii) track downstream LayerNorm/RMSNorm statistics. When these side-effect variables are held constant across conditions, the attention-sink replication remains tied to the induced dimension disparity and disappears when the disparity is removed. These controls are presented alongside the original intervention results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper traces attention sinks to value aggregation in self-attention (inducing variance discrepancy), amplified by observed super-neuron activation in FFN down-projections that create first-token dimension disparity. This is presented as a mechanistic explanation rather than a definitional equivalence. Validation relies on two independent interventions (attention-mask modifications and targeted variance amplification) that replicate sinks at arbitrary positions, plus a proposed architectural change (head-wise RMSNorm). No load-bearing step reduces by construction to fitted parameters, self-citations, or renamed inputs; the central claim rests on empirical tracing and controlled experiments that are falsifiable outside the paper's own equations. This is the expected non-finding for a paper whose core argument is experimentally grounded rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard transformer mechanics plus empirical observations of super neurons and dimension disparity within the studied models; no free parameters are explicitly fitted to produce the explanation, and the interventions serve as external checks.

axioms (1)
  • standard math Standard mathematical properties of self-attention value aggregation and FFN down-projections in transformers
    The variance-discrepancy argument begins from the known aggregation step in attention and the channel-sparse nature of FFN projections.
invented entities (2)
  • super neurons no independent evidence
    purpose: Amplify variance discrepancy between first-token and other-token representations
    Observed activation pattern within FFN layers that the paper links to the sink mechanism
  • head-wise RMSNorm no independent evidence
    purpose: Stabilize value aggregation outputs across positions during pre-training
    New architectural modification proposed as a proof-of-concept fix

pith-pipeline@v0.9.0 · 5518 in / 1376 out tokens · 65896 ms · 2026-05-08T12:07:05.922352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 23 canonical work pages · 7 internal anchors

  1. [1]

    Layer Normalization

    Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normal- ization.arXiv preprint arXiv:1607.06450, 2016

  2. [2]

    Bronstein and Petar Velickovic and Razvan Pascanu , title =

    Barbero, F., Arroyo, A., Gu, X., Perivolaropoulos, C., Bronstein, M., Veli ˇckovi´c, P., and Pascanu, R. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732, 2025

  3. [3]

    Quan- tizable transformers: Removing outliers by helping attention heads do nothing.Advances in Neural Infor- mation Processing Systems, 36:75067–75096, 2023

    Bondarenko, Y ., Nagel, M., and Blankevoort, T. Quan- tizable transformers: Removing outliers by helping attention heads do nothing.Advances in Neural Infor- mation Processing Systems, 36:75067–75096, 2023

  4. [4]

    URL https://arxiv

    Cancedda, N. Spectral filters, dark signals, and atten- tion sinks.arXiv preprint arXiv:2402.09221, 2024

  5. [5]

    Conversa- tional agents in therapeutic interventions for neurode- velopmental disorders: a survey.ACM Computing Surveys, 55(10):1–34, 2023

    Catania, F., Spitale, M., and Garzotto, F. Conversa- tional agents in therapeutic interventions for neurode- velopmental disorders: a survey.ACM Computing Surveys, 55(10):1–34, 2023

  6. [6]

    Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers

    Chen, Y . and Yao, Q. Attention sinks induce gradient sinks.arXiv preprint arXiv:2603.17771, 2026

  7. [7]

    Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. What does bert look at? an analysis of bert’s attention.arXiv preprint arXiv:1906.04341, 2019

  8. [8]

    Openwebtext corpus

    Gokaslan, A., Cohen, V ., Pavlick, E., and Tellex, S. Openwebtext corpus. http://Skylion007. github.io/OpenWebTextCorpus, 2019

  9. [9]

    When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

    Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y ., and Lin, M. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

  10. [10]

    Lm-infinite: Simple on-the-fly length generalization for large language models

    Han, C., Wang, Q., Peng, H., Xiong, W., Chen, Y ., Ji, H., and Wang, S. Lm-infinite: Zero-shot extreme length generalization for large language models.arXiv preprint arXiv:2308.16137, 2023

  11. [11]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016

  12. [12]

    KENDALL, M. G. A new measure of rank correla- tion.Biometrika, 30(1-2):81–93, 06 1938. ISSN 0006-

  13. [13]

    Biometrika30(1-2), 81–93 (1938) https://doi.org/10.1093/biomet/30.1-2.81

    doi: 10.1093/biomet/30.1-2.81. URL https: //doi.org/10.1093/biomet/30.1-2.81

  14. [14]

    Transformers are born biased: Structural inductive biases at random initialization and their practical consequences.arXiv preprint arXiv:2602.05927, 2026

    Li, S., Tong, Y ., Wang, H., and Hu, T. Transformers are born biased: Structural inductive biases at random initialization and their practical consequences.arXiv preprint arXiv:2602.05927, 2026

  15. [15]

    Intactkv: Improving large language model quantization by keeping pivot tokens intact.arXiv preprint arXiv:2403.01241, 2024

    Liu, R., Bai, H., Lin, H., Li, Y ., Gao, H., Xu, Z., Hou, L., Yao, J., and Yuan, C. Intactkv: Improving large language model quantization by keeping pivot tokens intact.arXiv preprint arXiv:2403.01241, 2024

  16. [16]

    Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

    Loshchilov, I., Hsieh, C.-P., Sun, S., and Ginsburg, B. ngpt: Normalized transformer with representa- tion learning on the hypersphere.arXiv preprint arXiv:2410.01131, 2024

  17. [17]

    The devil in linear transformer

    Qin, Z., Han, X., Sun, W., Li, D., Kong, L., Barnes, N., and Zhong, Y . The devil in linear transformer.arXiv preprint arXiv:2210.10340, 2022

  18. [18]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

  19. [19]

    Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,

    Ramapuram, J., Danieli, F., Dhekane, E., Weers, F., Busbridge, D., Ablin, P., Likhomanenko, T., Digani, J., Gu, Z., Shidani, A., et al. Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431, 2024

  20. [20]

    and Vetterli, M

    Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. In2007 15th European Signal Processing Conference, pp. 606–610, 2007

  21. [21]

    GLU Variants Improve Transformer

    Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  22. [22]

    Prefix- ing attention sinks can mitigate activation outliers for large language model quantization

    Son, S., Park, W., Han, W., Kim, K., and Lee, J. Prefix- ing attention sinks can mitigate activation outliers for large language model quantization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2242–2252, 2024

  23. [23]

    SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

    Tong, Y ., Wang, H., Li, S., Kawaguchi, K., and Hu, T. Seedprints: Fingerprints can even tell which seed your large language model was trained from.arXiv preprint arXiv:2509.26404, 2025

  24. [24]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Alma- hairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhar- gava, P., Bhosale, S., et al. Llama 2: Open foun- dation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 10 The Structural Origin of Attention Sink

  25. [25]

    N., Kaiser, Ł., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural informa- tion processing systems, 30, 2017

  26. [26]

    Analyzing the Structure of Attention in a Transformer Language Model

    Vig, J. and Belinkov, Y . Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284, 2019

  27. [27]

    Look-m: Look- once optimization in kv cache for efficient multimodal long- context inference.arXiv preprint arXiv:2406.18139, 2024

    Wan, Z., Wu, Z., Liu, C., Huang, J., Zhu, Z., Jin, P., Wang, L., and Yuan, L. Look-m: Look-once optimiza- tion in kv cache for efficient multimodal long-context inference.arXiv preprint arXiv:2406.18139, 2024

  28. [28]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  29. [29]

    Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

    Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long- context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024

  30. [30]

    Unveiling and controlling anomalous attention distribution in transformers.arXiv preprint arXiv:2407.01601, 2024

    Yan, R., Du, X., Deng, H., Zheng, L., Sun, Q., Hu, J., Shao, Y ., Jiang, P., Jiang, J., and Zhao, L. Unveiling and controlling anomalous attention distribution in transformers.arXiv preprint arXiv:2407.01601, 2024

  31. [31]

    Interpretingtherepeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908,

    Yona, I., Shumailov, I., Hayes, J., Barbero, F., and Gandelsman, Y . Interpreting the repeated token phe- nomenon in large language models.arXiv preprint arXiv:2503.08908, 2025. 11 The Structural Origin of Attention Sink A. Extended LLMs Analysis In this section, we provide additional empirical evidence supporting the mechanistic origin of attention sinks...