hub Canonical reference

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du · 2024 · cs.CL · arXiv 2410.10781

Canonical reference. 89% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 8 support 1

representative citing papers

Rover: Context-aware Conflict Resolution with LLM

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

Rover uses a new Multi-layer Code Property Graph and clustering to supply LLMs with dependency-aware contexts, outperforming standalone LLMs, MergeGen, and WizardMerge on similarity to ground-truth conflict resolutions.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

cs.CL · 2026-05-08 · conditional · novelty 7.0 · 2 refs

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

cs.LG · 2026-04-23 · unverdicted · novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

A Mechanistic Analysis of Looped Reasoning Language Models

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.

Registers Matter for Pixel-Space Diffusion Transformers

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

Conditional Memory Enhanced Item Representation for Generative Recommendation

cs.IR · 2026-05-12 · unverdicted · novelty 6.0

ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

cs.CL · 2026-04-07 · conditional · novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

cs.IR · 2026-04-07 · unverdicted · novelty 6.0

STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.

InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation

cs.RO · 2026-02-26 · unverdicted · novelty 6.0

InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

cs.CV · 2026-02-08 · unverdicted · novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

cs.CL · 2025-05-10 · conditional · novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

cs.CL · 2024-10-17 · unverdicted · novelty 6.0

LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.

ASAP: Attention Sink Anchored Pruning

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.

Dynamic Execution Commitment of Vision-Language-Action Models

cs.CV · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.

Exploring Motion-Language Alignment for Text-driven Motion Generation

cs.CV · 2026-04-03 · unverdicted · novelty 5.0

MLA-Gen advances text-driven motion synthesis by aligning global motion patterns with fine-grained text semantics and mitigating attention sink effects via new masking techniques.

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

cs.LG · 2026-02-04 · unverdicted · novelty 5.0

BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

cs.CL · 2026-02-01 · unverdicted · novelty 5.0

Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

cs.CL · 2025-11-26 · unverdicted · novelty 5.0

Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.

citing papers explorer

Showing 23 of 23 citing papers.

Rover: Context-aware Conflict Resolution with LLM cs.SE · 2026-05-17 · unverdicted · none · ref 20 · internal anchor
Rover uses a new Multi-layer Code Property Graph and clustering to supply LLMs with dependency-aware contexts, outperforming standalone LLMs, MergeGen, and WizardMerge on similarity to ground-truth conflict resolutions.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models cs.CL · 2026-05-08 · conditional · none · ref 8 · 2 links · internal anchor
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning cs.LG · 2026-05-06 · unverdicted · none · ref 26 · internal anchor
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training cs.LG · 2026-04-23 · unverdicted · none · ref 27 · internal anchor
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
A Mechanistic Analysis of Looped Reasoning Language Models cs.LG · 2026-04-13 · unverdicted · none · ref 13 · internal anchor
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
Registers Matter for Pixel-Space Diffusion Transformers cs.CV · 2026-05-15 · unverdicted · none · ref 31 · internal anchor
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 34 · internal anchor
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Conditional Memory Enhanced Item Representation for Generative Recommendation cs.IR · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
ComeIR introduces dual-level Engram memory and memory-restoring prediction to reconstruct SID-token embeddings and restore token granularity in generative recommendation.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity cs.LG · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL · 2026-04-07 · conditional · none · ref 21 · internal anchor
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation cs.IR · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation cs.RO · 2026-02-26 · unverdicted · none · ref 45 · internal anchor
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 24 · internal anchor
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 32 · internal anchor
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free cs.CL · 2025-05-10 · conditional · none · ref 12 · internal anchor
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation cs.CL · 2024-10-17 · unverdicted · none · ref 17 · internal anchor
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
ASAP: Attention Sink Anchored Pruning cs.LG · 2026-05-21 · unverdicted · none · ref 13 · internal anchor
ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.
Dynamic Execution Commitment of Vision-Language-Action Models cs.CV · 2026-05-12 · unverdicted · none · ref 13 · 2 links · internal anchor
A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.
Exploring Motion-Language Alignment for Text-driven Motion Generation cs.CV · 2026-04-03 · unverdicted · none · ref 15 · internal anchor
MLA-Gen advances text-driven motion synthesis by aligning global motion patterns with fine-grained text semantics and mitigating attention sink effects via new masking techniques.
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models cs.LG · 2026-02-04 · unverdicted · none · ref 8 · internal anchor
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unverdicted · none · ref 15 · internal anchor
Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.
MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 21 · internal anchor
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining cs.CL · 2025-11-26 · unverdicted · none · ref 10 · internal anchor
Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.

When Attention Sink Emerges in Language Models: An Empirical View

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer