pith. sign in

arxiv: 2605.16839 · v1 · pith:XARKISWTnew · submitted 2026-05-16 · 💻 cs.CL

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

Pith reviewed 2026-05-19 21:15 UTC · model grok-4.3

classification 💻 cs.CL
keywords chunked prefillsparse attentionKV cacheblock selectionlong contextGQApaged attentionattention acceleration
0
0 comments X

The pith

Block-union KV selection builds minimal tables so chunked prefill attention runs up to 2.72 times faster while staying close to dense accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn 2D block-sparse masks into compact per-group KV block tables that work under paged execution during repeated chunked prefill. It does this by first unioning across query blocks then unioning within each attention group, so the resulting tables contain every needed KV block without extra copies or missed entries. A reader would care because chunked prefill is now standard for long-context serving, yet earlier sparse methods either lose efficiency when queries are short or force costly token copying at every chunk.

Core claim

CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel plans. It converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction.

What carries the argument

Block-Union KV Selection, the two-step union process that converts input masks into the smallest GQA-aware per-group KV block tables while keeping every selected block.

If this is right

  • Selected KV blocks can be read directly from paged memory without a separate compaction step.
  • Attention computation achieves up to 2.72 times speedup at 128K context length under chunked prefill.
  • Accuracy stays close to full dense attention on the RULER benchmark for the tested model.
  • The same tables support repeated chunk processing without repeating expensive fine-grained searches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The union construction might apply to other block-sparse mask generators beyond the ones tested here.
  • Serving systems could combine this table-building step with existing page eviction policies to further cut memory traffic.
  • The approach may scale to even longer contexts if the per-group tables stay small relative to total KV length.

Load-bearing premise

The Q-block and intra-group unions always produce the smallest tables that still contain every KV block any query in the group actually needs under paged constraints.

What would settle it

Compare the generated per-group KV block tables against the full set of query-specific blocks chosen by the original masks on a single chunk and check whether any required block is absent.

Figures

Figures reproduced from arXiv: 2605.16839 by Beomseok Kang, Dongwon Jo, Jae-Joon Kim, Jiwon Song.

Figure 1
Figure 1. Figure 1: (a) CompactAttention achieves the best accuracy–speedup trade-off. (b) Block-sparse kernels under chunked prefill (Q ≪ KV ) fall far below one-shot and ideal speedups. (c) Pattern search cost accumulates across chunks, with XAttention incurring the highest overhead. estimate which attention blocks are important and then compute only the selected subset of the attention map. These methods can be effective f… view at source ↗
Figure 2
Figure 2. Figure 2: (a) KV-position rankings obtained by aggregating the attention each KV position receives from query positions in the shown window. Mean received attention ranks KV positions by their average received attention across all queries in the shown window, emphasizing globally important KV positions. Max received attention ranks KV positions by the largest attention received from any query within the window, expo… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CompactAttention. The KV selection stage converts a 2D per-head block mask into per-group KV block tables through Q-block union and intra-group union. The execution stage passes these block tables to a paged attention kernel, which accesses selected KV pages in place without explicit KV compaction. only sampled queries participate in QUOKA’s KV scoring, such query-specific KV entries may be mis… view at source ↗
Figure 4
Figure 4. Figure 4: KV cache layout comparison. Sequence-major layout forces KV heads to share one block table, preventing independent block selection. KV-head-major layout ex￾poses each KV-head block as a page, enabling independent KV block tables without copying K/V payloads. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention and end-to-end speedup under chunked prefill. We report speedup over dense attention on LLaMA-3.1-8B-Instruct across context lengths. (a) RTX PRO 6000 GPUs with TP=2, batch size 4, and chunk size 512. (b) H200 SXM GPUs with TP=2, batch size 8, and chunk size 1024. CompactAttention achieves the largest gains at long context lengths, and the attention-level improvements translate into end-to-end la… view at source ↗
Figure 6
Figure 6. Figure 6: LongBench V2 accuracy on LLaMA-3.1-8B-Instruct with chunk size 1024. CompactAt￾tention variants remain close to dense attention across difficulty levels and context-length groups, while QUOKA degrades more noticeably, especially on Hard samples. distributed information access. Block-sparse attention methods—XAttention, SeerAttention, and FlashPrefill—remain close to dense attention, suggesting that block-l… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Sparsity at the selected operating point. (b) Accuracy–speedup trade-off under α sweep on RULER 128K (RTX PRO 6000, TP=2, batch size 4, chunk size 1024). (c) Execution-only ablation at matched sparsity using the same unioned block mask (RTX PRO 6000, 128K, batch size 4, chunk size 512). Execution Strategy [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72$\times$ attention speedup at 128K context length under chunked prefill.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CompactAttention, a chunked-prefill attention mechanism that treats 2D block-sparse masks as KV-selection signals and converts them into GQA-aware per-group KV block tables via Q-block union followed by intra-group union. This construction is claimed to yield minimal tables that preserve all KV blocks selected by the input masks under paged execution constraints, avoiding explicit KV compaction and sparse-kernel overheads. On LLaMA-3.1-8B-Instruct, the method maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72× attention speedup at 128K context length.

Significance. If the completeness and minimality of the two-stage union construction hold across mask configurations, CompactAttention would address a practical gap in efficient long-context serving by enabling in-place paged access without the query-subsampling losses or copy overheads of prior chunked-prefill methods. The empirical speedup at 128K is a concrete strength, and the GQA-aware design aligns with modern model architectures.

major comments (2)
  1. [§3.2] §3.2 (Block-Union KV Selection): The central claim that Q-block union plus intra-group union produces minimal per-group tables preserving every query-specific KV block from the 2D mask is load-bearing for the accuracy result. The description presents this as a construction rather than proving an invariant; it is unclear whether block-granularity union can exclude a block required by only one query within a GQA group under arbitrary paged KV layouts. A single counter-example mask would falsify the preservation guarantee.
  2. [§4.3] §4.3 (RULER experiments): Accuracy is reported as 'close to dense' without per-task breakdowns, variance across seeds, or ablations isolating the effect of intra-group union. This makes it difficult to confirm that no query-specific KV blocks were dropped in the evaluated configurations.
minor comments (2)
  1. [Abstract / §4.1] The abstract and §4.1 should explicitly state the chunk size, block size, and exact KV cache paging configuration used for the 2.72× measurement to enable direct reproduction.
  2. [§3.2] Notation for the per-group block tables (e.g., how GQA group size interacts with the union operators) could be formalized with a small pseudocode listing to reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical value of CompactAttention for chunked prefill. We respond to each major comment below and will revise the manuscript accordingly to strengthen the justification of the Block-Union construction and the experimental validation.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Block-Union KV Selection): The central claim that Q-block union plus intra-group union produces minimal per-group tables preserving every query-specific KV block from the 2D mask is load-bearing for the accuracy result. The description presents this as a construction rather than proving an invariant; it is unclear whether block-granularity union can exclude a block required by only one query within a GQA group under arbitrary paged KV layouts. A single counter-example mask would falsify the preservation guarantee.

    Authors: We appreciate the referee highlighting the importance of rigorously establishing the preservation invariant. The construction proceeds in two stages: (1) Q-block union computes the set union of all KV blocks required by queries within each Q-block according to the 2D mask; (2) intra-group union then takes the union across all query heads belonging to the same GQA group. Because both steps are set-union operations, any KV block selected by even a single query within the group is retained in the final per-group table. The resulting tables are therefore minimal (no superfluous blocks) and complete (no required blocks omitted) with respect to the input mask, and this property is independent of the concrete paging layout since selection operates on block indices. We will revise §3.2 to state this invariant explicitly and include a short proof sketch together with a worked example demonstrating that a block needed by only one query is still preserved. We are also prepared to add a brief verification that no counter-example mask exists under the stated construction. revision: yes

  2. Referee: [§4.3] §4.3 (RULER experiments): Accuracy is reported as 'close to dense' without per-task breakdowns, variance across seeds, or ablations isolating the effect of intra-group union. This makes it difficult to confirm that no query-specific KV blocks were dropped in the evaluated configurations.

    Authors: We agree that more granular reporting would make the empirical validation more convincing. In the revised version we will add per-task accuracy tables for the RULER benchmark, include standard deviations from multiple random seeds where feasible, and provide an ablation that compares the full two-stage Block-Union against a variant that omits the intra-group union step. These additions will allow readers to directly observe that accuracy remains close to dense attention and that the intra-group union step does not drop query-specific blocks in the tested configurations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper describes CompactAttention via an explicit algorithmic construction (Q-block union followed by intra-group union) that converts input 2D block-sparse masks into GQA-aware per-group KV block tables. This is presented as a design that produces minimal tables preserving all mask-selected blocks under paged constraints, with accuracy and speedup results reported as empirical outcomes on the RULER benchmark for LLaMA-3.1-8B-Instruct. No steps reduce by construction to fitted parameters, self-citations, or tautological renaming; the central preservation claim is an asserted property of the union procedure rather than a self-referential derivation, and results are externally benchmarked rather than internally forced. The derivation chain remains self-contained against the stated empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about mask quality and paged memory behavior.

pith-pipeline@v0.9.0 · 5801 in / 1135 out tokens · 27345 ms · 2026-05-19T21:15:41.130579+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  2. [2]

    System card: Claude Opus 4.6

    Anthropic. System card: Claude Opus 4.6. Technical report, Anthropic, February 2026. Accessed: 2026-04-29

  3. [3]

    Gemini 3 Pro model card

    Google DeepMind. Gemini 3 Pro model card. Technical report, Google DeepMind, November 2025. Model card update: December 2025. Accessed: 2026-04-29

  4. [4]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  5. [5]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  6. [6]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

  7. [7]

    Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

  8. [8]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  9. [9]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

  10. [10]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  11. [11]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  12. [12]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024

  13. [13]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

  14. [14]

    Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766, 2025

  15. [15]

    Xattention: Block sparse attention with antidiagonal scoring

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with antidiagonal scoring. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  16. [16]

    Seerattention: Learning intrinsic sparse attention in your llms.arXiv preprint arXiv:2410.13276,

    Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok- Hay So, Ting Cao, Fan Yang, et al. Seerattention: Learning intrinsic sparse attention in your llms.arXiv preprint arXiv:2410.13276, 2024

  17. [17]

    Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026

    Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang, and Ran He. Flashprefill: Instantaneous pattern discovery and thresholding for ultra-fast long-context prefilling.arXiv preprint arXiv:2603.06199, 2026

  18. [18]

    Quoka: Query-oriented kv selection for efficient llm prefill.arXiv preprint arXiv:2602.08722, 2026

    Dalton Jones, Junyoung Park, Matthew Morse, Mingu Lee, Chris Lott, and Harper Langston. Quoka: Query-oriented kv selection for efficient llm prefill.arXiv preprint arXiv:2602.08722, 2026. 10

  19. [19]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  20. [20]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  21. [21]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  22. [22]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  23. [23]

    LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks.arXiv preprint arXiv:2412.15204, 2024

  24. [24]

    Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems, 7, 2025

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems, 7, 2025

  25. [25]

    MoBA: Mixture of Block Attention for Long-Context LLMs

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189, 2025

  26. [26]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 23078–23097, 2025

  27. [27]

    Snapkv: Llm knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947–22970, 2024

  28. [28]

    Quest: query-aware sparsity for efficient long-context llm inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: query-aware sparsity for efficient long-context llm inference. InProceedings of the 41st International Conference on Machine Learning, pages 47901–47911, 2024. 11 A Related Work A.1 Chunked Prefill Chunked prefill was first proposed by Sarathi [6], which splits prefill...