PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
Pith reviewed 2026-07-02 21:02 UTC · model grok-4.3
The pith
Page-aware workqueue scheduling over native block tables raises long-context LLM decode throughput by 1.04x to 1.40x on held-out workloads.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PersistentKV maps work by KV-head group, executes directly over native page tables, and adds a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks. With cost-model constants fixed on calibration traces, a calibrated roofline-style policy selects PersistentKV for B1 long-context steps and supported B8 long-context GQA steps, improving mean wall decode-token throughput by 1.04x to 1.08x on B8 bimodal, uniform, and Zipf-like workloads and by 1.40x on a B1 bucketed trace while routing B4 and uncalibrated GQA ratios to the baseline kernel to avoid regressions.
What carries the argument
The compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks over native block tables.
Load-bearing premise
The roofline-style policy calibrated on specific traces will continue to select the faster path on untested GQA ratios and workload distributions without causing regressions.
What would settle it
A new workload or GQA ratio where the policy selects PersistentKV yet measured throughput falls below the baseline kernel would falsify the central performance claim.
read the original abstract
Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce fragmentation, and mature kernels like FlashInfer provide highly optimized decode attention. However, the best single-kernel implementation is not always the best serving schedule: low-active long-context decode can under-utilize GPUs, while mixed sequence lengths introduce tension between many exact-length launches and coarse padded batches. We present PersistentKV, a native block-table decode attention engine and page-aware scheduling study for grouped-query attention (GQA). PersistentKV maps work by KV-head group, executes directly over native page tables, and adds a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks. On an RTX 3060 (FP16, page size 16, Hq=32, Hkv=8, d=128), a calibrated roofline-style policy selects FlashInfer for small active batches, PersistentKV sequence splitting for batch size 1 (B1) long-context steps, and PersistentKV workqueue scheduling for supported B8 long-context GQA steps. With cost-model constants fixed on calibration traces, five held-out seeds improve mean wall decode-token throughput by 1.04x to 1.08x on B8 bimodal, uniform, and Zipf-like workloads, and by 1.40x on a B1 bucketed trace. For the B4 boundary case and uncalibrated GQA ratios, the policy avoids regressions by routing to FlashInfer. We also report an attention-plus-MLP timing proxy and workload counters showing workqueue scheduling reduces launch fan-out from 16.00 to 2.00 launches per step on held-out bimodal B8. These results show that work assignment is a decisive serving-system variable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PersistentKV, a native block-table decode attention engine for grouped-query attention (GQA) that maps work by KV-head group and executes directly over page tables using a compact workqueue schedule for non-empty tasks. It proposes a calibrated roofline-style policy that selects between PersistentKV (sequence splitting for B1, workqueue for supported B8) and FlashInfer (small batches, B4, uncalibrated GQA), reporting 1.04x–1.08x mean wall-clock decode-token throughput gains on held-out seeds from bimodal/uniform/Zipf B8 workloads and 1.40x on a B1 bucketed trace, with an attention-plus-MLP proxy and launch-fan-out reduction from 16 to 2.
Significance. If the reported gains hold under the fixed cost-model constants, the work shows that page-aware work assignment and scheduling can yield measurable serving throughput improvements on commodity GPUs beyond single-kernel optimization. Strengths include the use of held-out seeds with frozen calibration constants and the conservative routing strategy that avoids regressions on uncalibrated cases. The narrow workload families and single GQA configuration (Hq=32, Hkv=8) limit broader claims about policy robustness.
major comments (2)
- [Abstract] Abstract: the 1.40x B1 gain is reported on a single 'bucketed trace' whose definition, sequence-length statistics, and representativeness relative to other B1 long-context workloads are not provided, weakening the central throughput claim for that regime.
- [Abstract] Abstract: mean throughput improvements (1.04x–1.08x, 1.40x) are stated without error bars, standard deviations, or per-seed values across the five held-out seeds, so it is impossible to judge whether the gains are statistically distinguishable from noise.
minor comments (2)
- Workload definitions (bimodal, uniform, Zipf-like, bucketed) and the exact calibration vs. held-out split are referenced but not fully specified, hindering reproducibility.
- The manuscript mentions an 'attention-plus-MLP timing proxy' and workload counters but does not show the corresponding figures or tables in the provided text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to incorporate additional details and statistical reporting.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 1.40x B1 gain is reported on a single 'bucketed trace' whose definition, sequence-length statistics, and representativeness relative to other B1 long-context workloads are not provided, weakening the central throughput claim for that regime.
Authors: We agree that the abstract would benefit from a concise definition of the B1 bucketed trace. This trace uses a bucketed sequence-length distribution designed to capture realistic long-context B1 serving patterns (with full generation details and statistics provided in the workload section of the paper). In revision we will add a short parenthetical description of the trace's length statistics and note its role as a representative B1 case, while retaining the 1.40x figure. revision: yes
-
Referee: [Abstract] Abstract: mean throughput improvements (1.04x–1.08x, 1.40x) are stated without error bars, standard deviations, or per-seed values across the five held-out seeds, so it is impossible to judge whether the gains are statistically distinguishable from noise.
Authors: We acknowledge that the abstract omits variability measures. The reported means reflect consistent gains across the five held-out seeds under frozen calibration constants. In the revision we will add either standard deviations or the per-seed range for the B8 workloads (and clarify that the B1 result is from a single representative trace) so readers can assess statistical distinguishability from noise. revision: yes
Circularity Check
No circularity; empirical results on held-out data after freezing calibration
full rationale
The paper reports wall-clock throughput gains from a roofline policy whose constants are fitted once on calibration traces and then frozen; subsequent measurements use five held-out seeds from the same workload families. No equations, self-definitional mappings, or load-bearing self-citations appear in the provided text. The central claims rest on direct, independent execution-time measurements rather than any derivation that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- cost-model constants
axioms (1)
- domain assumption The five held-out seeds and the bimodal/uniform/Zipf-like/B1 bucketed traces are representative of production long-context serving traffic.
Reference graph
Works this paper leans on
-
[1]
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022
2022
-
[2]
T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023
2023
-
[5]
G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun. Orca: A distributed serving system for Transformer-based generative models. InUSENIX Symposium on Operating Systems Design and Implementation, 2022
2022
-
[6]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee. Sarathi: Efficient LLM inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee. Taming throughput- latency tradeoff in LLM inference with Sarathi-Serve. arXiv preprint arXiv:2403.02310, 2024
- [8]
-
[9]
Zhang, Y
Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023
2023
-
[10]
Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y . Tian, C. R ´e, and B. Chen. Deja Vu: Contextual sparsity for efficient LLMs at inference time. InInternational Conference on Machine Learning, 2023
2023
- [11]
-
[12]
FlashInfer: Kernel library for LLM serving
FlashInfer contributors. FlashInfer: Kernel library for LLM serving. https://github.com/flashinfer-ai/ flashinfer
-
[13]
TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM
NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM
-
[14]
vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm
vLLM contributors. vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm. 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.