Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 3representative citing papers
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Attention sinks in GPT-2 arise from the interaction of learned query bias, first-layer MLP on positional encodings, and key projection structure, with each component individually dispensable.
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
citing papers explorer
-
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention
Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation
Attention sinks in GPT-2 arise from the interaction of learned query bias, first-layer MLP on positional encodings, and key projection structure, with each component individually dispensable.
-
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
-
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
-
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.