pith. sign in

arxiv: 2606.02288 · v1 · pith:D2V54BGJnew · submitted 2026-06-01 · 💻 cs.LG

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

Pith reviewed 2026-06-28 15:15 UTC · model grok-4.3

classification 💻 cs.LG
keywords activation spikesLLM quantizationattention sinkvector biasesrotational stabilitypost-training quantizationRotary Positional Embedding
0
0 comments X

The pith

Massive activation spikes in LLMs are the scalar form of rigid structural vector biases in spike-carrying tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that massive activation spikes are not random but the scalar intermediates of fixed vector biases in particular tokens. These tokens normalize to constant vectors that trigger attention sink and value-state drain. The model maintains these biases in zones of rotational stability using low-frequency bands and coherent channel pairs to counter positional embedding effects. This understanding allows a quantization method that clamps spikes and substitutes pre-computed template vectors to keep the biases intact. The result is spike-free activations that support high-fidelity low-bit quantization for language models and vision transformers alike.

Core claim

Spike-carrying tokens converge to constant vectors after normalization, with W_K contrastively amplifying the vector, W_Q aligning semantic tokens toward it, and W_V projecting it into the spectral null-space. The model preserves these structural biases against RoPE perturbations by localizing them in zones of rotational stability utilizing low-frequency bands and coherent channel pairs. This leads to INSERTQUANT, a PTQ framework that clamps spikes and restores their function via pre-computed template vectors, rendering activations strictly spike-free.

What carries the argument

zones of rotational stability that localize structural vector biases using low-frequency bands and coherent channel pairs to preserve them against Rotary Positional Embedding perturbations

If this is right

  • Clamping spikes and using template vectors maintains model function without new errors from the substitution.
  • Activations become strictly spike-free, enabling robust low-bit quantization.
  • The method achieves parity with state-of-the-art per-tensor quantization methods on LLMs.
  • It generalizes beyond text to other modalities such as vision transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar bias mechanisms may exist in other transformer architectures beyond the tested LLMs.
  • Understanding these stability zones could inform better positional embedding designs.
  • Template restoration might be adapted for dynamic input distributions by learning templates on the fly.

Load-bearing premise

Pre-computed template vectors can fully restore the function of the original spike-carrying tokens without introducing new errors in downstream layers or across different input distributions.

What would settle it

Measuring whether model performance drops when template vectors are used on a new input distribution or in a different model variant where the spike mechanism differs.

Figures

Figures reproduced from arXiv: 2606.02288 by Chung Peng Lee, Naveen Verma, Yung-Chin Chen, Ze-Wei Liou.

Figure 1
Figure 1. Figure 1: Visualizing the Bias Vector Hypothesis. Bottom: RMSNorm standardizes extreme input spikes (e.g., “\n”) into a stable direction. Top: While semantic tokens vary significantly to encode information (high variance), spike-carrying tokens converge to rigid Bias Vector (b) after RMSNorm, exhibiting negligible vari￾ance independent of the input sequence. across all tokens) via smoothing (Xiao et al., 2023a) or r… view at source ↗
Figure 2
Figure 2. Figure 2: Mechanistic Lifecycle of the Bias Vector in Self-Attention. The rigid bias vector b orchestrates to allow no-update via three coordinated stages: (1) Attention Sink: WK isolate this bias b from semantic tokens while WQ universally aligns queries toward it to create attention sinks; (2) Preservation under RoPE: The bias b is encoded in zones of rotational stability (low-frequency bands and coherent pairs) t… view at source ↗
Figure 3
Figure 3. Figure 3: Geometric Mechanics of the Attention Sink. (a) WK separates the sink token (ksink) from semantics, while WQ strictly aligns semantic queries (qsem) to the sink key. (b) Sink tokens (red) concentrate energy exclusively in the spectral tail of WV (σ ≈ 0), unlike semantic tokens (blue) which span the amplification modes. (c) The channels driving the attention sink mechanism (top-2 score) strictly localize in … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the INSERTQUANT Framework. INSERTQUANT eliminates massive spikes to enhance quantization through a three-step process. Clamp Spikes: Sink tokens carrying spikes are identified and clamped to zero in the Emergence Phase, as these tokens function as static bias terms. Spike-free Quantization: With spikes removed, the remaining activations become quantization-friendly. Insert Bias Effects Back: Th… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of Bias Templates. Comparison of the learned bias prototypes (bk) at the RMSNorm output. (a) The two LLaMA2 templates are nearly identical, overlapping in both spike location and magnitude. (b) The three Mistral templates share a global structure (e.g., spike at ∼ 2100) but exhibit visible variations in shape and secondary features. • Sink Tokens (xˆsink ∼ b): Randomly sam￾pled from positions… view at source ↗
Figure 6
Figure 6. Figure 6: Channel-wise variance at the Pre-Attention RMSNorm output across layers. The sink tokens (blue) exhibit extremely low variance compared to normal semantic tokens (orange) until the disappearance phase at Layer 31. The learnable scaling factor is merged to the subsequent weights. Here for sink tokens, we show the second bias vector b2 of LLaMA2-7B, which has slightly larger variances compared to b1 ology fo… view at source ↗
Figure 7
Figure 7. Figure 7: Channel-wise variance at the Post-Attention RMSNorm output across layers. The trend remains consistent with the pre-attention observations, confirming the stability of the template vector through the attention block. The variance for sink tokens rise significantly since the disappearance phase at Layer 31. F.2 Methodology: Channel Contribution Analysis To identify the specific features driving the atten￾ti… view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise Consistency of Spectral Null-space Projection. Analysis of the Cumulative Spectral Energy (CSE) for Normal Tokens (Blue) and Template Vectors (Red) with 5th–95th percentile bands across all layers in the persistence phase (Layers 2–30). The Grey dashed line indicates the normalized singular value spectrum of WV . The pattern is strictly consistent across the entire network depth: normal tokens d… view at source ↗
Figure 9
Figure 9. Figure 9: Rotation-based method amortizes spike across channels, but the token-wise extremes remain. Prefix-based method does not guarantee spike elimina￾tion when a <BOS> token exists in the start of sequence [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Massive activation spikes in Large Language Models (LLMs) severely degrade quantization by stretching dynamic ranges. While prior hypotheses characterize these as high-level scalar biases, we argue that they are merely the scalar intermediates of rigid, structural vector biases in the spike-carrying tokens. We show that these tokens converge to constant vectors after normalization that drive the attention sink and value-state drain mechanisms. We geometrically substantiate this by analyzing the coordination of projection weights: $W_K$ contrastively amplifies the vector, $W_Q$ aligns semantic tokens toward it, and $W_V$ projects it into the spectral null-space. Furthermore, we reveal that the model actively preserves these structural biases against Rotary Positional Embedding (RoPE) perturbations by localizing them in "zones of rotational stability" utilizing low-frequency bands and coherent channel pairs. Leveraging this, we propose INSERTQUANT, a post-training quantization (PTQ) framework that clamps spikes and restores their function via pre-computed template vectors. This renders activations strictly spike-free, enabling robust low-bit quantization with high fidelity. INSERTQUANT achieves parity with state-of-the-art per-tensor quantization methods on LLMs and uniquely generalizes beyond text to other modalities such as ViTs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that massive activation spikes in LLMs are scalar intermediates of rigid structural vector biases in spike-carrying tokens. These tokens converge to constant vectors after normalization, driving attention sink and value-state drain. The authors geometrically analyze the coordination of W_K (contrastive amplification), W_Q (semantic alignment), and W_V (spectral null-space projection), and show that the model preserves these biases in 'zones of rotational stability' via low-frequency bands and coherent channel pairs despite RoPE perturbations. They propose INSERTQUANT, a PTQ framework that clamps spikes and restores function with pre-computed template vectors to produce strictly spike-free activations, achieving parity with SOTA per-tensor quantization on LLMs while generalizing to ViTs.

Significance. If the mechanistic reinterpretation and template-based restoration hold with the required invariance properties, the work would advance understanding of activation spikes beyond scalar-bias hypotheses and provide a practical route to robust low-bit quantization. The cross-modal generalization, if experimentally supported, would increase impact. The attempt to tie geometric weight coordination and rotational stability to quantization is a positive direction.

major comments (2)
  1. [Abstract] Abstract (final paragraph): the claim that pre-computed template vectors fully restore the original function of spike-carrying tokens without introducing new errors in downstream layers or across input distributions lacks any reported validation of downstream activation invariance, attention-map preservation, or cross-distribution stability. This assumption is load-bearing for both the spike-free property and the reported parity with SOTA methods.
  2. [Abstract] Abstract: the geometric substantiation that W_K contrastively amplifies the vector, W_Q aligns semantic tokens, and W_V projects into the spectral null-space is stated at a high level without explicit equations, derivation steps, or quantitative measures of these effects. This analysis underpins the central claim that spikes are structural vector biases rather than scalar phenomena.
minor comments (1)
  1. [Abstract] The terms 'zones of rotational stability' and 'coherent channel pairs' are introduced without a forward reference to their definition or operationalization in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, proposing revisions where the manuscript can be strengthened.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final paragraph): the claim that pre-computed template vectors fully restore the original function of spike-carrying tokens without introducing new errors in downstream layers or across input distributions lacks any reported validation of downstream activation invariance, attention-map preservation, or cross-distribution stability. This assumption is load-bearing for both the spike-free property and the reported parity with SOTA methods.

    Authors: We acknowledge this point. The manuscript reports that INSERTQUANT achieves parity with SOTA per-tensor quantization on LLMs and generalizes to ViTs, which serves as empirical support for the restoration of function. However, to directly address the request for validation of downstream activation invariance, attention-map preservation, and cross-distribution stability, we will add targeted experiments and analysis in the revised version of the paper. revision: yes

  2. Referee: [Abstract] Abstract: the geometric substantiation that W_K contrastively amplifies the vector, W_Q aligns semantic tokens, and W_V projects into the spectral null-space is stated at a high level without explicit equations, derivation steps, or quantitative measures of these effects. This analysis underpins the central claim that spikes are structural vector biases rather than scalar phenomena.

    Authors: The full geometric analysis with equations, derivation steps, and quantitative measures is provided in Section 3 of the manuscript. To improve the self-contained nature of the abstract, we will incorporate a concise reference to these elements or key quantitative findings in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives its mechanistic claims about spikes as vector biases from geometric analysis of W_Q/W_K/W_V coordination and RoPE stability zones, then proposes INSERTQUANT as a separate PTQ technique using pre-computed templates. No step reduces a claimed result to its own inputs by definition, no fitted parameter is relabeled as a prediction, and no load-bearing self-citation chain is present. The template restoration is a design choice whose fidelity is evaluated externally against SOTA methods rather than guaranteed by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the existence of 'zones of rotational stability' and the functional equivalence of template vectors; both appear introduced by the paper without external derivation or machine-checked proof.

free parameters (1)
  • template vectors
    Pre-computed vectors used to replace spikes; their construction is not shown to be parameter-free or derived from first principles.
axioms (1)
  • domain assumption Spike-carrying tokens converge to constant vectors after normalization
    Invoked in the second sentence of the abstract as the basis for the vector-bias claim.
invented entities (1)
  • zones of rotational stability no independent evidence
    purpose: Explain preservation of bias vectors under RoPE perturbations
    Newly introduced construct localized to low-frequency bands and coherent channel pairs; no independent falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5759 in / 1550 out tokens · 21452 ms · 2026-06-28T15:15:51.277947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages

  1. [1]

    Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

    Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240. Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veliˇckovi´c, and Razvan Pascanu. 2025. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732. Yonatan Bisk, Rowan Zelle...

  2. [2]

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and 1 others

    What are you sinking? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and 1 others. 2015. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3...

  3. [3]

    Mitigating quantization errors due to activation spikes in glu-based llms,

    Mitigating quantization errors due to acti- vation spikes in glu-based llms.arXiv preprint arXiv:2405.14428. Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- enmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic in- ference over event descriptions.Transactions of the association for computational linguistic...

  4. [4]

    Push- Pull

    Consequently, by Layer 31, these tokens no longer carry the rigid sink signal and return to a standard semantic variance profile. E Geometric and Spectral Characterization of Attention Sinks In this section, we provide the detailed experimen- tal setup and mathematical formulation used to derive the mechanistic analysis of attention sinks presented in Sec...

  5. [5]

    modulo 64

    The learnable scaling factor is merged to the subsequent weights. Here for sink tokens, we show the second bias vectorb 2 of LLaMA2-7B, which has slightly larger variances compared tob 1 ology for computing channel contribution scores, and finally present granular evidence of channel pairing. F.1 RoPE Mechanics and Implementation Differences While the ori...

  6. [6]

    Due to the nascent state of the activation patterns, only the subset of sink tokens exhibiting early-onset spikes are detected

    Stage 1 (Layer 0):We perform an initial scan at the input of the Layer 0 Down-Projection us- ing another threshold (α0). Due to the nascent state of the activation patterns, only the subset of sink tokens exhibiting early-onset spikes are detected. The remaining sink tokens, whose spikes have not yet fully formed, are simply not detected (false negatives)...

  7. [7]

    This stage captures any remaining sink tokens that did not spark in Layer 0

    Stage 2 (Layer 1):We perform the defini- tive scan at the input of the Layer 1 Down- Projection (the standard emergence point). This stage captures any remaining sink tokens that did not spark in Layer 0. The union of indices detected in Stage 1 and Stage 2 forms the final set of sink token indices S, which is frozen and shared for all subsequent layers. ...

  8. [8]

    Sequential Alignment ( Nsink =K ):In standard sequences where the number of de- tected sink tokens matches the template count exactly, we assign templates based on their sequential order (i.e., the first detected sink maps to t1, the second to t2, etc.). We hy- pothesize that template types naturally follow a consistent sequential order due to the causal ...

  9. [9]

    MSE-Based Fallback (Nsink ̸=K ):In cases where the counts do not match, we resolve the mapping ambiguity by computing the Mean Squared Error (MSE) between the detected token’s statexs and the candidate templates. The token is assigned to the template k∗ that minimizes the reconstruction error: k∗ = arg min k ∥xs −t (detection) k ∥2 2 (11) Runtime Persiste...

  10. [10]

    Semantic-Only MSE Objective.Since IN- SERTQUANTintentionally clamps sink tokens to zero—introducing a massive theoretical error at the current layer that is corrected via insertion in the subsequent layer—including these tokens in the loss calculation would skew the optimization. Instead, we compute the reconstruction error exclu- sively over the semantic...

  11. [11]

    sink token

    Structure Preservation Constraint.Aggres- sive clipping can suppress activation magnitudes, potentially altering the set of tokens identified as spikes by our detection algorithm (false negatives). To prevent this, we constrain the search space. We define thedetection correctnessR correct as: Rcorrect = |Squant ∩ Sf p| |Sf p| (13) where Squant and Sf p ar...

  12. [12]

    (CC BY-SA 4.0), BoolQ (Clark et al.,

  13. [13]

    (CC BY-SA 3.0), PIQA (Bisk et al.,

  14. [14]

    • Vision: ImageNet-1K (Deng et al., 2009) (Custom; Non-Commercial), and Flickr30k (Young et al., 2014) (Custom; Research Use) via [HuggingFace]

    (Apache 2.0), OpenBookQA (Mihaylov et al., 2018) (Apache 2.0), Winogrande (Sak- aguchi et al., 2021) (Apache 2.0), and So- cialIQA (Sap et al., 2019) (CC BY 4.0). • Vision: ImageNet-1K (Deng et al., 2009) (Custom; Non-Commercial), and Flickr30k (Young et al., 2014) (Custom; Research Use) via [HuggingFace]. K.3 Software and Frameworks The following open-so...