Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.