Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Attention is all you need
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.
EventFlow applies flow matching to learn joint distributions over event times for temporal point processes, reporting 20-53% lower forecast error than autoregressive baselines on standard TPP benchmarks with fewer sampling calls.
citing papers explorer
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.
-
EventFlow: Forecasting Temporal Point Processes with Flow Matching
EventFlow applies flow matching to learn joint distributions over event times for temporal point processes, reporting 20-53% lower forecast error than autoregressive baselines on standard TPP benchmarks with fewer sampling calls.