Limitations of Normalization in Attention Mechanism
read the original abstract
This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects
RopeDreamer uses quaternionic kinematic chains in a recurrent state space model with a dual decoder to cut open-loop prediction error by 40.52% over 50 steps on simulated DLO trajectories while preserving physical con...
-
Evaluating Memory Capability in Continuous Lifelog Scenario
Sophisticated memory systems do not beat a basic RAG baseline in continuous lifelog scenarios, showing that high-fidelity context preservation matters more than complex compression.
-
SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification
SpecSyn generates formal specifications with over 90% precision and 75% recall, successfully verifying 1071 out of 1365 target properties on open-source programs.
-
Sessa: Selective State Space Attention
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
-
You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection
Gated-CNN applies independent 1D convolutions and sigmoid gating to IMU streams from smartwatches, achieving 90-93% F1 on five datasets and 97% F1 with zero missed falls in real-time Pixel Watch testing, outperforming...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.