CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection
Pith reviewed 2026-05-19 21:15 UTC · model grok-4.3
The pith
Block-union KV selection builds minimal tables so chunked prefill attention runs up to 2.72 times faster while staying close to dense accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel plans. It converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction.
What carries the argument
Block-Union KV Selection, the two-step union process that converts input masks into the smallest GQA-aware per-group KV block tables while keeping every selected block.
If this is right
- Selected KV blocks can be read directly from paged memory without a separate compaction step.
- Attention computation achieves up to 2.72 times speedup at 128K context length under chunked prefill.
- Accuracy stays close to full dense attention on the RULER benchmark for the tested model.
- The same tables support repeated chunk processing without repeating expensive fine-grained searches.
Where Pith is reading between the lines
- The union construction might apply to other block-sparse mask generators beyond the ones tested here.
- Serving systems could combine this table-building step with existing page eviction policies to further cut memory traffic.
- The approach may scale to even longer contexts if the per-group tables stay small relative to total KV length.
Load-bearing premise
The Q-block and intra-group unions always produce the smallest tables that still contain every KV block any query in the group actually needs under paged constraints.
What would settle it
Compare the generated per-group KV block tables against the full set of query-specific blocks chosen by the original masks on a single chunk and check whether any required block is absent.
Figures
read the original abstract
Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72$\times$ attention speedup at 128K context length under chunked prefill.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CompactAttention, a chunked-prefill attention mechanism that treats 2D block-sparse masks as KV-selection signals and converts them into GQA-aware per-group KV block tables via Q-block union followed by intra-group union. This construction is claimed to yield minimal tables that preserve all KV blocks selected by the input masks under paged execution constraints, avoiding explicit KV compaction and sparse-kernel overheads. On LLaMA-3.1-8B-Instruct, the method maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72× attention speedup at 128K context length.
Significance. If the completeness and minimality of the two-stage union construction hold across mask configurations, CompactAttention would address a practical gap in efficient long-context serving by enabling in-place paged access without the query-subsampling losses or copy overheads of prior chunked-prefill methods. The empirical speedup at 128K is a concrete strength, and the GQA-aware design aligns with modern model architectures.
major comments (2)
- [§3.2] §3.2 (Block-Union KV Selection): The central claim that Q-block union plus intra-group union produces minimal per-group tables preserving every query-specific KV block from the 2D mask is load-bearing for the accuracy result. The description presents this as a construction rather than proving an invariant; it is unclear whether block-granularity union can exclude a block required by only one query within a GQA group under arbitrary paged KV layouts. A single counter-example mask would falsify the preservation guarantee.
- [§4.3] §4.3 (RULER experiments): Accuracy is reported as 'close to dense' without per-task breakdowns, variance across seeds, or ablations isolating the effect of intra-group union. This makes it difficult to confirm that no query-specific KV blocks were dropped in the evaluated configurations.
minor comments (2)
- [Abstract / §4.1] The abstract and §4.1 should explicitly state the chunk size, block size, and exact KV cache paging configuration used for the 2.72× measurement to enable direct reproduction.
- [§3.2] Notation for the per-group block tables (e.g., how GQA group size interacts with the union operators) could be formalized with a small pseudocode listing to reduce ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical value of CompactAttention for chunked prefill. We respond to each major comment below and will revise the manuscript accordingly to strengthen the justification of the Block-Union construction and the experimental validation.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Block-Union KV Selection): The central claim that Q-block union plus intra-group union produces minimal per-group tables preserving every query-specific KV block from the 2D mask is load-bearing for the accuracy result. The description presents this as a construction rather than proving an invariant; it is unclear whether block-granularity union can exclude a block required by only one query within a GQA group under arbitrary paged KV layouts. A single counter-example mask would falsify the preservation guarantee.
Authors: We appreciate the referee highlighting the importance of rigorously establishing the preservation invariant. The construction proceeds in two stages: (1) Q-block union computes the set union of all KV blocks required by queries within each Q-block according to the 2D mask; (2) intra-group union then takes the union across all query heads belonging to the same GQA group. Because both steps are set-union operations, any KV block selected by even a single query within the group is retained in the final per-group table. The resulting tables are therefore minimal (no superfluous blocks) and complete (no required blocks omitted) with respect to the input mask, and this property is independent of the concrete paging layout since selection operates on block indices. We will revise §3.2 to state this invariant explicitly and include a short proof sketch together with a worked example demonstrating that a block needed by only one query is still preserved. We are also prepared to add a brief verification that no counter-example mask exists under the stated construction. revision: yes
-
Referee: [§4.3] §4.3 (RULER experiments): Accuracy is reported as 'close to dense' without per-task breakdowns, variance across seeds, or ablations isolating the effect of intra-group union. This makes it difficult to confirm that no query-specific KV blocks were dropped in the evaluated configurations.
Authors: We agree that more granular reporting would make the empirical validation more convincing. In the revised version we will add per-task accuracy tables for the RULER benchmark, include standard deviations from multiple random seeds where feasible, and provide an ablation that compares the full two-stage Block-Union against a variant that omits the intra-group union step. These additions will allow readers to directly observe that accuracy remains close to dense attention and that the intra-group union step does not drop query-specific blocks in the tested configurations. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper describes CompactAttention via an explicit algorithmic construction (Q-block union followed by intra-group union) that converts input 2D block-sparse masks into GQA-aware per-group KV block tables. This is presented as a design that produces minimal tables preserving all mask-selected blocks under paged constraints, with accuracy and speedup results reported as empirical outcomes on the RULER benchmark for LLaMA-3.1-8B-Instruct. No steps reduce by construction to fitted parameters, self-citations, or tautological renaming; the central preservation claim is an asserted property of the union procedure rather than a self-referential derivation, and results are externally benchmarked rather than internally forced. The derivation chain remains self-contained against the stated empirical evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Anthropic. System card: Claude Opus 4.6. Technical report, Anthropic, February 2026. Accessed: 2026-04-29
work page 2026
-
[3]
Google DeepMind. Gemini 3 Pro model card. Technical report, Google DeepMind, November 2025. Model card update: December 2025. Accessed: 2026-04-29
work page 2025
-
[4]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
work page 2026
-
[6]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024
work page 2024
-
[8]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
work page 2023
-
[9]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024
work page 2024
-
[10]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
work page 2022
-
[11]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024
work page 2024
-
[13]
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024
work page 2024
-
[14]
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766, 2025
-
[15]
Xattention: Block sparse attention with antidiagonal scoring
Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with antidiagonal scoring. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025
work page 2025
-
[16]
Seerattention: Learning intrinsic sparse attention in your llms.arXiv preprint arXiv:2410.13276,
Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok- Hay So, Ting Cao, Fan Yang, et al. Seerattention: Learning intrinsic sparse attention in your llms.arXiv preprint arXiv:2410.13276, 2024
-
[17]
Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, and Ran He. Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026
-
[18]
Quoka: Query-oriented kv selection for efficient llm prefill.arXiv preprint arXiv:2602.08722, 2026
Dalton Jones, Junyoung Park, Matthew Morse, Mingu Lee, Chris Lott, and Harper Langston. Quoka: Query-oriented kv selection for efficient llm prefill.arXiv preprint arXiv:2602.08722, 2026. 10
-
[19]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023
work page 2023
-
[20]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks.arXiv preprint arXiv:2412.15204, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[25]
MoBA: Mixture of Block Attention for Long-Context LLMs
Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Native sparse attention: Hardware-aligned and natively trainable sparse attention
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 23078–23097, 2025
work page 2025
-
[27]
Snapkv: Llm knows what you are looking for before generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024
work page 2024
-
[28]
Quest: query-aware sparsity for efficient long-context llm inference
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: query-aware sparsity for efficient long-context llm inference. InProceedings of the 41st International Conference on Machine Learning, pages 47901–47911, 2024. 11 A Related Work A.1 Chunked Prefill Chunked prefill was first proposed by Sarathi [6], which splits prefill...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.