P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8

Reed Lau

arxiv: 2606.06521 · v1 · pith:UMJCWGZ4new · submitted 2026-06-02 · 💻 cs.AR · cs.AI· cs.DC· cs.LG· cs.PF

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2⁸

Reed Lau This is my paper

Pith reviewed 2026-06-28 07:53 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.LGcs.PF

keywords FP8 attentionP-collapseattention sinkKV iteration orderstatic scalingE4M3underflowsoftmax precision

0 comments

The pith

Forward KV iteration causes P-collapse in FP8 attention by underflowing a normal-tail fraction of non-sink probabilities, which reverse iteration with S=256 prevents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines precision loss when the attention probability matrix P is cast to FP8 (E4M3) before the P·V multiplication. It shows that forward iteration over KV blocks produces P-collapse, in which a fraction Φ(Δ + δ_k - 6.93 - ln S) of non-sink P values underflow to zero because of the attention sink. Reverse iteration order eliminates the collapse, and pairing it with the static scale S=256 supplies a zero-underflow guarantee. The paper characterizes S=256 as the unique power-of-two scale that simultaneously meets bit-exact IEEE 754 scaling, the lower envelope of the sawtooth quantization-step function dp(S), and the largest normal-range coverage among such scales. Kernel experiments that isolate the P-cast step report 3–10× lower MSE when the two fixes are used.

Core claim

Forward KV iteration causes P-collapse -- to leading order a fraction Φ(Δ + δ_k - 6.93 - ln S) of non-sink P values underflow to zero, where the small shift δ_k ≈ 1 (for k_sink = 4) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with S=256. S = 256 = 2^8 is the static scale that simultaneously satisfies (i) bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function dp(S) over the E4M3 number line (dp = 2^{-4}), and (iii) the maximum normal-range coverage among bit-exact (2^k) scales.

What carries the argument

The P-collapse quantified by the normal-tail expression Φ(Δ + δ_k - 6.93 - ln S) under forward KV iteration, together with the three optimality conditions that single out S=256 for FP8 casting of P.

If this is right

Reverse KV iteration removes P-collapse.
Reverse iteration combined with S=256 supplies a zero-underflow guarantee.
S=256 simultaneously satisfies bit-exact IEEE 754 scaling, the lower envelope of dp(S), and maximum normal-range coverage among 2^k scales.
Kernel-faithful experiments show 3–10× MSE improvement at moderate sink strengths.
The two fixes together saturate to the same precision floor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The closed-form threshold Δ_c = 6.93 + ln S - δ_k supplies a direct way to predict kernel-level precision loss from sink strength alone.
Non-bit-exact scales such as 448 attain slightly higher coverage but sacrifice the bit-exact property that S=256 preserves.
If per-row score distributions deviate from the assumed normal tails, the predicted collapse fraction will no longer match observations.
The same iteration-order and scale considerations may apply to other low-precision formats when attention sinks are present.

Load-bearing premise

The leading-order underflow model assumes that non-sink attention scores are distributed such that their underflow probability is captured by the normal tail Φ(Δ + δ_k - 6.93 - ln S) with δ_k ≈ 1 for k_sink = 4.

What would settle it

Measure the actual fraction of non-sink P entries that underflow to zero in a controlled kernel run with known Δ, δ_k, and S, then compare the measured fraction against the predicted value of Φ(Δ + δ_k - 6.93 - ln S).

Figures

Figures reproduced from arXiv: 2606.06521 by Reed Lau.

**Figure 1.** Figure 1: The normalized quantization step dp(S) for E4M3. All 2 k values (blue) sit on the lower envelope at dp = 2−4 . S = 256 (green star) is the rightmost before overflow. S = 448 (red) sits 14% above the envelope. Theorem 6 (dp(S) structure). (i) For all S = 2k with integer k ∈ {0, 1, . . . , 8} (i.e. S ∈ {1, 2, 4, . . . , 256}): dp(2k ) = 2−4 . (ii) For all S ∈ [2−6 , 448] with S ̸= 2k : dp(S) > 2 −4 . (iii) … view at source ↗

**Figure 2.** Figure 2: Top: MSE vs. sink strength. Forward+S=1 exhibits a peak at ∆ = 7–8 (the transition region). All other configurations remain at the quantization-noise floor. Bottom: Diagnostic—non-sink probability mass (blue bars) and fraction of P values zeroed by S=1 cast (red bars) vs. S=256 cast (green line) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: MSE vs. sequence length at ∆ = 7. The improvement of Forward+S=256 over Forward+S=1 grows from 1.3× (N=512) to 10× (N=16384) as non-sink probability mass increases [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix~$P$ is cast to FP8 before the $P \cdot V$ matrix multiplication. We analyze two implementation choices that affect output precision under the \emph{Attention Sink} phenomenon: (1)~the KV block iteration order, and (2) the static scaling factor applied to $P$ before casting. We show that forward KV iteration causes \emph{P-collapse} -- to leading order a fraction $\Phi(\Delta + \delta_k - 6.93 - \ln S)$ of non-sink $P$ values underflow to zero, where the small shift $\delta_k \approx 1$ (for $k_{\text{sink}}{=}4$) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with $S{=}256$. We further give a constructive characterization of $S = 256 = 2^8$ as the static scale that simultaneously satisfies (i)~bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function $dp(S)$ over the E4M3 number line ($dp = 2^{-4}$, the minimum worst-case quantization step), and (iii)~the maximum normal-range coverage \emph{among bit-exact ($2^k$) scales} (a non-bit-exact scale such as $448$ attains slightly higher coverage; sec.5}). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of \emph{why} these choices are good and a closed-form threshold $\Delta_c = 6.93 + \ln S - \delta_k$ for predicting kernel-level precision loss. Kernel-faithful experiments ($Q, K, V$ in FP32 to isolate the P-cast effect) show $3$-$10\times$ MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined -- which motivated updating the hpc-ops kernel from $S{=}1$ to $S{=}256$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a closed-form threshold and leading-order collapse fraction for FP8 P underflow under attention sinks, but the normal-tail model lacks any empirical check.

read the letter

The main takeaway is that this work formalizes why reverse KV iteration plus S=256 avoids P-collapse in FP8 attention, with a new expression for the underflow fraction and a threshold Δ_c. It also gives a constructive reason why 256 is optimal among power-of-two scales for the E4M3 mantissa.

What is new is the explicit Φ(Δ + δ_k - 6.93 - ln S) term and the closed-form Δ_c. The paper shows that forward iteration produces a predictable fraction of zeroed non-sink probabilities while reverse iteration plus the chosen scale removes it to leading order. The experiments isolate the cast step and report 3-10× MSE reduction at moderate sink strengths, which matches the claimed directional effect.

The soft spot is the distribution assumption. The collapse fraction rests on non-sink scores behaving like a normal tail in the critical region, yet the abstract and stress-test note give no histograms, QQ plots, or per-row statistics to support that. If the actual tails are heavier or truncated differently by the sink max, the numerical prediction weakens even if the forward/reverse difference stays visible. The experiments report only aggregate MSE, not the per-row underflow counts that would test the model directly.

This is for kernel engineers tuning FP8 attention on current hardware. It is not reshaping the broader field but supplies a usable predictive threshold for people already shipping these kernels. The math is presented cleanly enough to merit referee time, mainly to verify the derivation steps and to ask for the missing distribution checks. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that forward KV iteration in FP8 (E4M3) attention causes P-collapse, with the fraction of underflowing non-sink P values given to leading order by the normal CDF Φ(Δ + δ_k - 6.93 - ln S) (δ_k ≈ 1 for k_sink = 4), while reverse iteration removes it and guarantees zero underflow when combined with S = 256. It further provides a constructive characterization of S = 256 = 2^8 as the static scale that simultaneously satisfies bit-exact IEEE 754 scaling, the lower envelope of the sawtooth dp(S) (dp = 2^{-4}), and maximum normal-range coverage among bit-exact (2^k) scales. Kernel-faithful experiments (Q, K, V in FP32) report 3–10× MSE improvement at moderate sink strengths and confirm that both fixes saturate to the same precision floor.

Significance. If the results hold, the work supplies a quantitative account of why reverse iteration and S = 256 are effective in FlashAttention-3/4, together with the closed-form threshold Δ_c = 6.93 + ln S - δ_k for predicting kernel-level precision loss. The kernel-faithful experiments that isolate the P-cast effect are a strength, as is the explicit multi-criterion characterization of the optimal scale.

major comments (2)

[Abstract (Φ expression) and corresponding derivation section] The leading-order underflow model (Abstract, paragraph describing the Φ expression): the fraction Φ(Δ + δ_k - 6.93 - ln S) and the derived threshold Δ_c assume that non-sink attention scores are distributed such that their underflow probability matches the standard normal tail. No histogram, QQ-plot, or Kolmogorov-Smirnov test validating normality in the critical tail region is supplied; the kernel-faithful experiments report only aggregate MSE. This assumption is load-bearing for the claimed collapse fraction and zero-underflow guarantee.
[Section 5] Section 5 (optimality characterization): the claim that S = 256 simultaneously satisfies the three listed criteria is scoped to bit-exact (2^k) scales, yet the text notes that a non-bit-exact scale such as 448 attains slightly higher coverage. The optimality statement should be restated with explicit scope to avoid implying global optimality.

minor comments (2)

[Abstract] The parameter δ_k is introduced as the expected within-sink-block score maximum with an approximate value; a precise definition or derivation equation should be given at first use.
[Experiments section] The paired-test results confirming saturation to the same precision floor would be clearer with a table of MSE values across forward, reverse, and combined configurations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and agree that both points warrant revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract (Φ expression) and corresponding derivation section] The leading-order underflow model (Abstract, paragraph describing the Φ expression): the fraction Φ(Δ + δ_k - 6.93 - ln S) and the derived threshold Δ_c assume that non-sink attention scores are distributed such that their underflow probability matches the standard normal tail. No histogram, QQ-plot, or Kolmogorov-Smirnov test validating normality in the critical tail region is supplied; the kernel-faithful experiments report only aggregate MSE. This assumption is load-bearing for the claimed collapse fraction and zero-underflow guarantee.

Authors: We agree that the derivation relies on a normality assumption for non-sink attention scores in the tail region and that this is load-bearing for the closed-form collapse fraction. While the kernel-faithful experiments produce MSE results consistent with the model predictions, we acknowledge the absence of direct statistical validation. In the revised manuscript we will add QQ-plots of the relevant score distributions together with a Kolmogorov-Smirnov test focused on the critical tail to substantiate the assumption. revision: yes
Referee: [Section 5] Section 5 (optimality characterization): the claim that S = 256 simultaneously satisfies the three listed criteria is scoped to bit-exact (2^k) scales, yet the text notes that a non-bit-exact scale such as 448 attains slightly higher coverage. The optimality statement should be restated with explicit scope to avoid implying global optimality.

Authors: The abstract already qualifies the optimality claim as holding among bit-exact (2^k) scales and explicitly notes that a non-bit-exact scale such as 448 attains slightly higher coverage. To prevent any ambiguity, we will revise the wording in Section 5 to restate this scope with equal explicitness in the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: model derives from explicit distributional assumption and constructive characterization, not self-fit or self-citation

full rationale

The leading-order expression Φ(Δ + δ_k - 6.93 - ln S) is presented as following from a normal-tail model of non-sink scores together with the FP8 underflow threshold; the constant 6.93 is introduced as part of that derivation rather than obtained by fitting the target collapse fraction. The optimality claim for S=256 is supported by three independent constructive properties (IEEE bit-exact scaling, sawtooth dp(S) lower envelope, and maximal normal-range coverage among 2^k scales) with no reduction to a fitted parameter or prior self-result. No self-citations appear in the load-bearing steps. The paper explicitly flags the normality assumption as the weakest link, rendering the claim falsifiable rather than tautological by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full list of assumptions and parameters cannot be extracted. The visible elements are the normal-tail model and the fitted δ_k shift.

free parameters (1)

δ_k = ~1
Expected within-sink-block score maximum (≈1 for k_sink=4) used inside the underflow fraction Φ(...).

axioms (1)

domain assumption Non-sink attention scores are distributed such that their underflow probability after scaling is given by the normal tail Φ(Δ + δ_k - 6.93 - ln S).
Invoked to obtain the leading-order collapse fraction; location: abstract paragraph describing P-collapse.

pith-pipeline@v0.9.1-grok · 5970 in / 1606 out tokens · 19666 ms · 2026-06-28T07:53:34.255285+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veli ˇckovi´c, and Razvan Pascanu. Why do LLMs attend to the first token? arXiv preprint arXiv:2504.02732, 2025

work page arXiv 2025
[2]

FlashAttention-2: Faster attention with better par- allelism and work partitioning.International Conference on Learning Representations, 2024

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning.International Conference on Learning Representations, 2024

2024
[3]

Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35, 2022

2022
[4]

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Efficient streaming language models with atten- tion sinks.International Conference on Learning Representa- tions, 2024

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with atten- tion sinks.International Conference on Learning Representa- tions, 2024

2024
[9]

SageAttention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. SageAttention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization. arXiv preprint arXiv:2411.10958, 2024

work page arXiv 2024
[10]

SageAt- tention2++: A more efficient implementation of SageAtten- tion2.arXiv preprint arXiv:2505.21136, 2025

Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, and Jianfei Chen. SageAt- tention2++: A more efficient implementation of SageAtten- tion2.arXiv preprint arXiv:2505.21136, 2025. A Completedp(S)Verification We verify Theorem 6 by exhaustive computation over all 126 positive E4M3 values: for each S, take the binade w...

work page arXiv 2025

[1] [1]

Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veli ˇckovi´c, and Razvan Pascanu. Why do LLMs attend to the first token? arXiv preprint arXiv:2504.02732, 2025

work page arXiv 2025

[2] [2]

FlashAttention-2: Faster attention with better par- allelism and work partitioning.International Conference on Learning Representations, 2024

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning.International Conference on Learning Representations, 2024

2024

[3] [3]

Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness.Advances in Neural Information Processing Systems, 35, 2022

2022

[4] [4]

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint arXiv:2407.08608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Efficient streaming language models with atten- tion sinks.International Conference on Learning Representa- tions, 2024

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with atten- tion sinks.International Conference on Learning Representa- tions, 2024

2024

[9] [9]

SageAttention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. SageAttention2: Efficient attention with thorough outlier smoothing and per-thread INT4 quantization. arXiv preprint arXiv:2411.10958, 2024

work page arXiv 2024

[10] [10]

SageAt- tention2++: A more efficient implementation of SageAtten- tion2.arXiv preprint arXiv:2505.21136, 2025

Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, and Jianfei Chen. SageAt- tention2++: A more efficient implementation of SageAtten- tion2.arXiv preprint arXiv:2505.21136, 2025. A Completedp(S)Verification We verify Theorem 6 by exhaustive computation over all 126 positive E4M3 values: for each S, take the binade w...

work page arXiv 2025