ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Pith reviewed 2026-05-25 05:27 UTC · model grok-4.3
The pith
Computing only 5% of query-key blocks in FP16 recovers on average 89.1% of the FP4-to-FP16 quality gap in long-context attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. ThriftAttention therefore runs a heuristic that rapidly selects a small number of important query-key block pairs for FP16 precision, computes the selected blocks in FP16 and the remaining blocks in FP4, and merges both paths via online softmax into a single output; across long-context benchmarks and model families this recovers on average 89.1% of the FP4-to-FP16 performance gap when only 5% of blocks are elevated to FP16.
What carries the argument
The two-stage selective mixed-precision mechanism that uses a heuristic to elevate only the most important query-key block pairs to FP16 computation while leaving the rest in FP4 and merges the results with online softmax.
If this is right
- Quality degradation from uniform FP4 attention is largely eliminated while retaining FP4 inference speed.
- The quality advantage over uniform FP4 grows as sequence length increases.
- The recovery percentage is consistent across different model families and long-context benchmarks.
- Near-FP16 output quality is reached by elevating only 5% of query-key blocks to FP16.
Where Pith is reading between the lines
- The same selective elevation logic could be tested on other low-bit precisions or on different matrix multiplications inside the transformer.
- Longer contexts could be handled without the quality drop that uniform FP4 currently produces.
- Refining the selection heuristic might allow an even smaller high-precision fraction while preserving the same recovery level.
Load-bearing premise
The rapid heuristic accurately identifies the small set of query-key block pairs whose quantisation error has the largest functional impact on the final output.
What would settle it
An experiment on the same long-context benchmarks that measures quality recovery well below 89% of the FP4-to-FP16 gap when the 5% high-precision blocks are chosen by the stated heuristic.
Figures
read the original abstract
Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ThriftAttention, a two-stage mixed-precision attention method for long-context inference. A heuristic first identifies a small set of important query-key block pairs (claimed to contain the most important tokens) for FP16 computation; the remaining blocks are computed in FP4. Both paths are merged using online softmax. The central empirical claim is that retaining only 5% of blocks in FP16 recovers on average 89.1% of the FP4-to-FP16 quality gap across long-context benchmarks and model families, with the advantage growing as sequence length increases. The code is released at the provided GitHub link.
Significance. If the heuristic reliably ranks blocks by the functional impact of their quantization error, the result would be a practical advance for FP4 attention on Blackwell-class GPUs, directly addressing the systematic quality degradation observed in prior block-scaled quantization work at long contexts. The public code release is a clear strength that enables reproducibility.
major comments (2)
- [Abstract / §3 (Method)] The paper provides no equation, pseudocode, or explicit definition for the first-stage heuristic that selects query-key blocks by 'importance of each query-key interaction.' Without this, it is impossible to verify whether the proxy actually correlates with output delta under FP4 vs. FP16 (the load-bearing assumption behind the 89.1% recovery figure).
- [§4 (Experiments)] No ablation is described that isolates the heuristic's accuracy (e.g., comparing the selected 5% blocks against an oracle selection based on actual per-block output error). This leaves open whether the reported recovery generalizes or depends on the specific heuristic chosen.
minor comments (1)
- [Abstract] The abstract states results 'across long-context benchmarks and model families' but does not list the exact models, datasets, or sequence lengths used; this should be stated explicitly in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Abstract / §3 (Method)] The paper provides no equation, pseudocode, or explicit definition for the first-stage heuristic that selects query-key blocks by 'importance of each query-key interaction.' Without this, it is impossible to verify whether the proxy actually correlates with output delta under FP4 vs. FP16 (the load-bearing assumption behind the 89.1% recovery figure).
Authors: The referee correctly notes that the manuscript does not supply an equation, pseudocode, or formal definition of the heuristic. Section 3 describes the selection criterion only at a high level (blocks whose quantization error has high functional impact on query-key interactions). We will add an explicit mathematical formulation together with pseudocode to §3 in the revision. revision: yes
-
Referee: [§4 (Experiments)] No ablation is described that isolates the heuristic's accuracy (e.g., comparing the selected 5% blocks against an oracle selection based on actual per-block output error). This leaves open whether the reported recovery generalizes or depends on the specific heuristic chosen.
Authors: We agree that an oracle ablation (ranking blocks by measured per-block output error under FP4 versus FP16 and comparing overlap with the heuristic) is missing. The current §4 reports only end-to-end recovery; we will insert this controlled ablation in the revised experiments section. revision: yes
Circularity Check
No circularity: empirical validation of heuristic method
full rationale
The paper introduces ThriftAttention as a two-stage heuristic for selective FP16/FP4 attention computation and reports its quality recovery as an experimental outcome measured on long-context benchmarks. No equations, fitted parameters, or derivations are described that would make the 89.1% recovery figure equivalent to its inputs by construction. The selection heuristic is presented as a practical proxy without any self-referential definition or self-citation chain that bears the central claim. The result remains falsifiable via external benchmarks and does not reduce to renaming or ansatz smuggling.
Axiom & Free-Parameter Ledger
free parameters (1)
- fraction of blocks kept in FP16
axioms (1)
- domain assumption Quantisation error impact is highly non-uniform across query-key interactions and concentrates in blocks containing the most important tokens.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction... heuristic rapidly selects... ˆSij = ¯qi · ¯kj
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native FP4 training can be optimal for large language models.arXiv preprint arXiv:2505.14669,
-
[2]
FP4 all the way: Fully quantized training of LLMs
Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. FP4 all the way: Fully quantized training of LLMs.arXiv preprint arXiv:2505.19115,
-
[3]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, and Saravan Raj- mohan. TurboAttention: Efficient attention approximation for high throughputs LLMs.arXiv preprint arXiv:2412.08585,
-
[9]
Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =
doi: 10.1145/3600006.3613165. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems,
-
[10]
Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, et al. Ministral 3.arXiv preprint arXiv:2601.08584,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
doi: 10.48550/arXiv.2601.08584. URLhttps: //arxiv.org/abs/2601.08584. Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krish- namoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantiza- tion with learned rotations. InInternational Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.08584
-
[12]
Pretraining large language models with NVFP4
NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, et al. Pretraining large language models with NVFP4.arXiv preprint arXiv:2509.25149,
-
[13]
doi: 10.1109/ISCA59077.2024.00019. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev- skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling trans- former inference. InProceedings of Machine Learning and Systems,
-
[14]
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,
-
[15]
Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537,
Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537,
-
[16]
Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG- Bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051,
work page 2023
-
[17]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[18]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InInternational Conference on Learning Representations, 2025b. How...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao
doi: 10.18653/v1/2025.acl-long.1126. Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao. FlashAttention-4: Algorithm and kernel pipelining co-design for asymmetric hardware scaling. arXiv preprint arXiv:2603.05451,
-
[20]
Target FP16 budgets are5%,10%, and25%. For a sequence withn=⌊L/64⌋key blocks, the integer top-kis chosen by (kn−k(k−1)/2)/(n(n+ 1)/2)≈f, rounded and clamped to[1, n]. Benchmarks.LongBench v1 is evaluated through lm-eval-harness on the English subset: narrativeqa,qasper,multifieldqa en,hotpotqa,2wikimqa,musique,gov report, qmsum,multi news,trec,triviaqa,sa...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.