Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
What are you sinking? a geometric approach on attention sink
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8roles
background 1polarities
background 1representative citing papers
ESM2 predicts N-terminal methionine via retrieval of a positional prior from the BOS token through distributed attention circuits rather than direct recognition, revealed by a norm-direction decomposition of rotary attention scores.
Attention sinks in GPT-2 arise from the interaction of learned query bias, first-layer MLP on positional encodings, and key projection structure, with each component individually dispensable.
LARS constrains activation subspaces to decouple memory use from sequence length, cutting GPU memory by 33.5% and CPU memory by 52% versus LoRA while keeping accuracy comparable.
Massive spikes in LLMs are identified as rigid vector biases preserved in rotational stability zones; INSERTQUANT clamps them with template vectors to achieve spike-free PTQ with SOTA parity on LLMs and generalization to ViTs.
ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.
SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.
citing papers explorer
-
Retrieval and competition: how a protein foundation model starts a protein
ESM2 predicts N-terminal methionine via retrieval of a positional prior from the BOS token through distributed attention circuits rather than direct recognition, revealed by a norm-direction decomposition of rotary attention scores.
-
A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation
Attention sinks in GPT-2 arise from the interaction of learned query bias, first-layer MLP on positional encodings, and key projection structure, with each component individually dispensable.
-
Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation
LARS constrains activation subspaces to decouple memory use from sequence length, cutting GPU memory by 33.5% and CPU memory by 52% versus LoRA while keeping accuracy comparable.
-
Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization
Massive spikes in LLMs are identified as rigid vector biases preserved in rotational stability zones; INSERTQUANT clamps them with template vectors to achieve spike-free PTQ with SOTA parity on LLMs and generalization to ViTs.
-
ASAP: Attention Sink Anchored Pruning
ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.
-
SinkTrack: Attention Sink based Context Anchoring for Large Language Models
SinkTrack anchors LLMs to initial context by modifying the attention sink token with injected features, yielding gains on textual and multimodal tasks.