pith. sign in

arxiv: 2606.06521 · v1 · pith:UMJCWGZ4new · submitted 2026-06-02 · 💻 cs.AR · cs.AI· cs.DC· cs.LG· cs.PF

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2⁸

Pith reviewed 2026-06-28 07:53 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.LGcs.PF
keywords FP8 attentionP-collapseattention sinkKV iteration orderstatic scalingE4M3underflowsoftmax precision
0
0 comments X

The pith

Forward KV iteration causes P-collapse in FP8 attention by underflowing a normal-tail fraction of non-sink probabilities, which reverse iteration with S=256 prevents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines precision loss when the attention probability matrix P is cast to FP8 (E4M3) before the P·V multiplication. It shows that forward iteration over KV blocks produces P-collapse, in which a fraction Φ(Δ + δ_k - 6.93 - ln S) of non-sink P values underflow to zero because of the attention sink. Reverse iteration order eliminates the collapse, and pairing it with the static scale S=256 supplies a zero-underflow guarantee. The paper characterizes S=256 as the unique power-of-two scale that simultaneously meets bit-exact IEEE 754 scaling, the lower envelope of the sawtooth quantization-step function dp(S), and the largest normal-range coverage among such scales. Kernel experiments that isolate the P-cast step report 3–10× lower MSE when the two fixes are used.

Core claim

Forward KV iteration causes P-collapse -- to leading order a fraction Φ(Δ + δ_k - 6.93 - ln S) of non-sink P values underflow to zero, where the small shift δ_k ≈ 1 (for k_sink = 4) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with S=256. S = 256 = 2^8 is the static scale that simultaneously satisfies (i) bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function dp(S) over the E4M3 number line (dp = 2^{-4}), and (iii) the maximum normal-range coverage among bit-exact (2^k) scales.

What carries the argument

The P-collapse quantified by the normal-tail expression Φ(Δ + δ_k - 6.93 - ln S) under forward KV iteration, together with the three optimality conditions that single out S=256 for FP8 casting of P.

If this is right

  • Reverse KV iteration removes P-collapse.
  • Reverse iteration combined with S=256 supplies a zero-underflow guarantee.
  • S=256 simultaneously satisfies bit-exact IEEE 754 scaling, the lower envelope of dp(S), and maximum normal-range coverage among 2^k scales.
  • Kernel-faithful experiments show 3–10× MSE improvement at moderate sink strengths.
  • The two fixes together saturate to the same precision floor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The closed-form threshold Δ_c = 6.93 + ln S - δ_k supplies a direct way to predict kernel-level precision loss from sink strength alone.
  • Non-bit-exact scales such as 448 attain slightly higher coverage but sacrifice the bit-exact property that S=256 preserves.
  • If per-row score distributions deviate from the assumed normal tails, the predicted collapse fraction will no longer match observations.
  • The same iteration-order and scale considerations may apply to other low-precision formats when attention sinks are present.

Load-bearing premise

The leading-order underflow model assumes that non-sink attention scores are distributed such that their underflow probability is captured by the normal tail Φ(Δ + δ_k - 6.93 - ln S) with δ_k ≈ 1 for k_sink = 4.

What would settle it

Measure the actual fraction of non-sink P entries that underflow to zero in a controlled kernel run with known Δ, δ_k, and S, then compare the measured fraction against the predicted value of Φ(Δ + δ_k - 6.93 - ln S).

Figures

Figures reproduced from arXiv: 2606.06521 by Reed Lau.

Figure 1
Figure 1. Figure 1: The normalized quantization step dp(S) for E4M3. All 2 k values (blue) sit on the lower envelope at dp = 2−4 . S = 256 (green star) is the rightmost before overflow. S = 448 (red) sits 14% above the envelope. Theorem 6 (dp(S) structure). (i) For all S = 2k with in￾teger k ∈ {0, 1, . . . , 8} (i.e. S ∈ {1, 2, 4, . . . , 256}): dp(2k ) = 2−4 . (ii) For all S ∈ [2−6 , 448] with S ̸= 2k : dp(S) > 2 −4 . (iii) … view at source ↗
Figure 2
Figure 2. Figure 2: Top: MSE vs. sink strength. Forward+S=1 exhibits a peak at ∆ = 7–8 (the transition region). All other con￾figurations remain at the quantization-noise floor. Bottom: Diagnostic—non-sink probability mass (blue bars) and frac￾tion of P values zeroed by S=1 cast (red bars) vs. S=256 cast (green line) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MSE vs. sequence length at ∆ = 7. The improve￾ment of Forward+S=256 over Forward+S=1 grows from 1.3× (N=512) to 10× (N=16384) as non-sink probability mass increases [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix~$P$ is cast to FP8 before the $P \cdot V$ matrix multiplication. We analyze two implementation choices that affect output precision under the \emph{Attention Sink} phenomenon: (1)~the KV block iteration order, and (2) the static scaling factor applied to $P$ before casting. We show that forward KV iteration causes \emph{P-collapse} -- to leading order a fraction $\Phi(\Delta + \delta_k - 6.93 - \ln S)$ of non-sink $P$ values underflow to zero, where the small shift $\delta_k \approx 1$ (for $k_{\text{sink}}{=}4$) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with $S{=}256$. We further give a constructive characterization of $S = 256 = 2^8$ as the static scale that simultaneously satisfies (i)~bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function $dp(S)$ over the E4M3 number line ($dp = 2^{-4}$, the minimum worst-case quantization step), and (iii)~the maximum normal-range coverage \emph{among bit-exact ($2^k$) scales} (a non-bit-exact scale such as $448$ attains slightly higher coverage; sec.5}). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of \emph{why} these choices are good and a closed-form threshold $\Delta_c = 6.93 + \ln S - \delta_k$ for predicting kernel-level precision loss. Kernel-faithful experiments ($Q, K, V$ in FP32 to isolate the P-cast effect) show $3$-$10\times$ MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined -- which motivated updating the hpc-ops kernel from $S{=}1$ to $S{=}256$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that forward KV iteration in FP8 (E4M3) attention causes P-collapse, with the fraction of underflowing non-sink P values given to leading order by the normal CDF Φ(Δ + δ_k - 6.93 - ln S) (δ_k ≈ 1 for k_sink = 4), while reverse iteration removes it and guarantees zero underflow when combined with S = 256. It further provides a constructive characterization of S = 256 = 2^8 as the static scale that simultaneously satisfies bit-exact IEEE 754 scaling, the lower envelope of the sawtooth dp(S) (dp = 2^{-4}), and maximum normal-range coverage among bit-exact (2^k) scales. Kernel-faithful experiments (Q, K, V in FP32) report 3–10× MSE improvement at moderate sink strengths and confirm that both fixes saturate to the same precision floor.

Significance. If the results hold, the work supplies a quantitative account of why reverse iteration and S = 256 are effective in FlashAttention-3/4, together with the closed-form threshold Δ_c = 6.93 + ln S - δ_k for predicting kernel-level precision loss. The kernel-faithful experiments that isolate the P-cast effect are a strength, as is the explicit multi-criterion characterization of the optimal scale.

major comments (2)
  1. [Abstract (Φ expression) and corresponding derivation section] The leading-order underflow model (Abstract, paragraph describing the Φ expression): the fraction Φ(Δ + δ_k - 6.93 - ln S) and the derived threshold Δ_c assume that non-sink attention scores are distributed such that their underflow probability matches the standard normal tail. No histogram, QQ-plot, or Kolmogorov-Smirnov test validating normality in the critical tail region is supplied; the kernel-faithful experiments report only aggregate MSE. This assumption is load-bearing for the claimed collapse fraction and zero-underflow guarantee.
  2. [Section 5] Section 5 (optimality characterization): the claim that S = 256 simultaneously satisfies the three listed criteria is scoped to bit-exact (2^k) scales, yet the text notes that a non-bit-exact scale such as 448 attains slightly higher coverage. The optimality statement should be restated with explicit scope to avoid implying global optimality.
minor comments (2)
  1. [Abstract] The parameter δ_k is introduced as the expected within-sink-block score maximum with an approximate value; a precise definition or derivation equation should be given at first use.
  2. [Experiments section] The paired-test results confirming saturation to the same precision floor would be clearer with a table of MSE values across forward, reverse, and combined configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and agree that both points warrant revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract (Φ expression) and corresponding derivation section] The leading-order underflow model (Abstract, paragraph describing the Φ expression): the fraction Φ(Δ + δ_k - 6.93 - ln S) and the derived threshold Δ_c assume that non-sink attention scores are distributed such that their underflow probability matches the standard normal tail. No histogram, QQ-plot, or Kolmogorov-Smirnov test validating normality in the critical tail region is supplied; the kernel-faithful experiments report only aggregate MSE. This assumption is load-bearing for the claimed collapse fraction and zero-underflow guarantee.

    Authors: We agree that the derivation relies on a normality assumption for non-sink attention scores in the tail region and that this is load-bearing for the closed-form collapse fraction. While the kernel-faithful experiments produce MSE results consistent with the model predictions, we acknowledge the absence of direct statistical validation. In the revised manuscript we will add QQ-plots of the relevant score distributions together with a Kolmogorov-Smirnov test focused on the critical tail to substantiate the assumption. revision: yes

  2. Referee: [Section 5] Section 5 (optimality characterization): the claim that S = 256 simultaneously satisfies the three listed criteria is scoped to bit-exact (2^k) scales, yet the text notes that a non-bit-exact scale such as 448 attains slightly higher coverage. The optimality statement should be restated with explicit scope to avoid implying global optimality.

    Authors: The abstract already qualifies the optimality claim as holding among bit-exact (2^k) scales and explicitly notes that a non-bit-exact scale such as 448 attains slightly higher coverage. To prevent any ambiguity, we will revise the wording in Section 5 to restate this scope with equal explicitness in the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: model derives from explicit distributional assumption and constructive characterization, not self-fit or self-citation

full rationale

The leading-order expression Φ(Δ + δ_k - 6.93 - ln S) is presented as following from a normal-tail model of non-sink scores together with the FP8 underflow threshold; the constant 6.93 is introduced as part of that derivation rather than obtained by fitting the target collapse fraction. The optimality claim for S=256 is supported by three independent constructive properties (IEEE bit-exact scaling, sawtooth dp(S) lower envelope, and maximal normal-range coverage among 2^k scales) with no reduction to a fitted parameter or prior self-result. No self-citations appear in the load-bearing steps. The paper explicitly flags the normality assumption as the weakest link, rendering the claim falsifiable rather than tautological by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full list of assumptions and parameters cannot be extracted. The visible elements are the normal-tail model and the fitted δ_k shift.

free parameters (1)
  • δ_k = ~1
    Expected within-sink-block score maximum (≈1 for k_sink=4) used inside the underflow fraction Φ(...).
axioms (1)
  • domain assumption Non-sink attention scores are distributed such that their underflow probability after scaling is given by the normal tail Φ(Δ + δ_k - 6.93 - ln S).
    Invoked to obtain the leading-order collapse fraction; location: abstract paragraph describing P-collapse.

pith-pipeline@v0.9.1-grok · 5970 in / 1606 out tokens · 19666 ms · 2026-06-28T07:53:34.255285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

    Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veli ˇckovi´c, and Razvan Pascanu. Why do LLMs attend to the first token? arXiv preprint arXiv:2504.02732, 2025

  2. [2]

    FlashAttention-2: Faster attention with better par- allelism and work partitioning.International Conference on Learning Representations, 2024

    Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning.International Conference on Learning Representations, 2024

  3. [3]

    Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35, 2022

  4. [4]

    When Attention Sink Emerges in Language Models: An Empirical View

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

  5. [5]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

  6. [6]

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024

  7. [7]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

  8. [8]

    Efficient streaming language models with atten- tion sinks.International Conference on Learning Representa- tions, 2024

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with atten- tion sinks.International Conference on Learning Representa- tions, 2024

  9. [9]

    SageAttention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. SageAttention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization. arXiv preprint arXiv:2411.10958, 2024

  10. [10]

    SageAt- tention2++: A more efficient implementation of SageAtten- tion2.arXiv preprint arXiv:2505.21136, 2025

    Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, and Jianfei Chen. SageAt- tention2++: A more efficient implementation of SageAtten- tion2.arXiv preprint arXiv:2505.21136, 2025. A Completedp(S)Verification We verify Theorem 6 by exhaustive computation over all 126 positive E4M3 values: for each S, take the binade w...