Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms

· 2024 · arXiv 2410.13835

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

When Attention Sink Emerges in Language Models: An Empirical View

cs.CL · 2024-10-14 · accept · novelty 6.0

Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

cs.LG · 2026-04-16 · unverdicted · novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

cs.CL · 2026-02-01

citing papers explorer

Showing 5 of 5 citing papers.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 163
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond cs.LG · 2026-05-19 · unverdicted · none · ref 14
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
When Attention Sink Emerges in Language Models: An Empirical View cs.CL · 2024-10-14 · accept · none · ref 19
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning cs.LG · 2026-04-16 · unverdicted · none · ref 11
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unreviewed · ref 16

Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer