Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms

Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I Jordan, Song Mei · 2024 · arXiv 2410.13835

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Transformers Provably Learn to Internalize Chain-of-Thought

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

When Attention Sink Emerges in Language Models: An Empirical View

cs.CL · 2024-10-14 · accept · novelty 6.0

Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

cs.LG · 2026-04-16 · unverdicted · novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

cs.CL · 2026-02-01

citing papers explorer

Showing 1 of 1 citing paper after filters.

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unreviewed · ref 16

Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer