PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Muhammad Ahmed

arxiv: 2606.26666 · v2 · pith:EL6A4QOTnew · submitted 2026-06-25 · 💻 cs.LG

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Muhammad Ahmed This is my paper

Pith reviewed 2026-07-02 21:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM servingKV cachedecode attentionpage-aware schedulinggrouped-query attentionworkqueue schedulinglong-context inferenceGPU throughput

0 comments

The pith

Page-aware workqueue scheduling over native block tables raises long-context LLM decode throughput by 1.04x to 1.40x on held-out workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PersistentKV as a block-table decode attention engine that maps work by KV-head group and runs only non-empty row-KV-head-sequence-split tasks through a compact workqueue. This targets under-utilization in low-active long-context decode and the tension between exact-length launches and padded batches in mixed workloads. A roofline-style policy, calibrated on traces, chooses PersistentKV for batch sizes 1 and 8 on supported GQA setups while routing other cases to an existing kernel. On five held-out seeds the approach improves mean wall-clock decode-token throughput without regressions on boundary cases.

Core claim

PersistentKV maps work by KV-head group, executes directly over native page tables, and adds a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks. With cost-model constants fixed on calibration traces, a calibrated roofline-style policy selects PersistentKV for B1 long-context steps and supported B8 long-context GQA steps, improving mean wall decode-token throughput by 1.04x to 1.08x on B8 bimodal, uniform, and Zipf-like workloads and by 1.40x on a B1 bucketed trace while routing B4 and uncalibrated GQA ratios to the baseline kernel to avoid regressions.

What carries the argument

The compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks over native block tables.

Load-bearing premise

The roofline-style policy calibrated on specific traces will continue to select the faster path on untested GQA ratios and workload distributions without causing regressions.

What would settle it

A new workload or GQA ratio where the policy selects PersistentKV yet measured throughput falls below the baseline kernel would falsify the central performance claim.

read the original abstract

Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce fragmentation, and mature kernels like FlashInfer provide highly optimized decode attention. However, the best single-kernel implementation is not always the best serving schedule: low-active long-context decode can under-utilize GPUs, while mixed sequence lengths introduce tension between many exact-length launches and coarse padded batches. We present PersistentKV, a native block-table decode attention engine and page-aware scheduling study for grouped-query attention (GQA). PersistentKV maps work by KV-head group, executes directly over native page tables, and adds a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks. On an RTX 3060 (FP16, page size 16, Hq=32, Hkv=8, d=128), a calibrated roofline-style policy selects FlashInfer for small active batches, PersistentKV sequence splitting for batch size 1 (B1) long-context steps, and PersistentKV workqueue scheduling for supported B8 long-context GQA steps. With cost-model constants fixed on calibration traces, five held-out seeds improve mean wall decode-token throughput by 1.04x to 1.08x on B8 bimodal, uniform, and Zipf-like workloads, and by 1.40x on a B1 bucketed trace. For the B4 boundary case and uncalibrated GQA ratios, the policy avoids regressions by routing to FlashInfer. We also report an attention-plus-MLP timing proxy and workload counters showing workqueue scheduling reduces launch fan-out from 16.00 to 2.00 launches per step on held-out bimodal B8. These results show that work assignment is a decisive serving-system variable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PersistentKV adds a native block-table engine plus workqueue schedule for GQA decode and shows small held-out throughput gains via a calibrated policy, but the policy's behavior on new GQA ratios or workloads is unproven.

read the letter

The paper's core contribution is a native block-table decode attention engine that maps work by KV-head group, runs directly on page tables, and uses a compact workqueue schedule for non-empty tasks. They pair this with a roofline-style policy that chooses between their engine and FlashInfer depending on batch size and workload. On an RTX 3060 they report 1.04-1.08x mean decode-token throughput on held-out B8 bimodal/uniform/Zipf traces and 1.40x on a B1 bucketed trace, with launch count dropping from 16 to 2 per step.

The measurements are direct wall-clock numbers on held-out seeds after a one-time calibration fit, so the central claims are not circular. The workqueue reduction and the explicit routing of B4 and uncalibrated GQA cases back to FlashInfer are useful details.

The soft spots are real but not fatal. Gains are modest on the B8 cases that matter most for serving, the policy constants come from post-hoc selection on specific traces, and there are no error bars, full workload definitions, or released code. The stress-test concern lands: nothing shows the fixed cost model will keep picking the faster path when GQA head ratios or sequence statistics differ from the calibration set. The abstract's safeguard of routing uncertain cases to FlashInfer helps, but it also means the headline gains only apply where the model was already tuned.

This is for systems builders who already run paged attention and want to squeeze a few percent more out of long-context decode on commodity GPUs. The scheduling angle is new enough within that line of work to justify a serious referee, though the paper would need more validation on broader GQA ratios and workloads before the policy could be trusted in production.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PersistentKV, a native block-table decode attention engine for grouped-query attention (GQA) that maps work by KV-head group and executes directly over page tables using a compact workqueue schedule for non-empty tasks. It proposes a calibrated roofline-style policy that selects between PersistentKV (sequence splitting for B1, workqueue for supported B8) and FlashInfer (small batches, B4, uncalibrated GQA), reporting 1.04x–1.08x mean wall-clock decode-token throughput gains on held-out seeds from bimodal/uniform/Zipf B8 workloads and 1.40x on a B1 bucketed trace, with an attention-plus-MLP proxy and launch-fan-out reduction from 16 to 2.

Significance. If the reported gains hold under the fixed cost-model constants, the work shows that page-aware work assignment and scheduling can yield measurable serving throughput improvements on commodity GPUs beyond single-kernel optimization. Strengths include the use of held-out seeds with frozen calibration constants and the conservative routing strategy that avoids regressions on uncalibrated cases. The narrow workload families and single GQA configuration (Hq=32, Hkv=8) limit broader claims about policy robustness.

major comments (2)

[Abstract] Abstract: the 1.40x B1 gain is reported on a single 'bucketed trace' whose definition, sequence-length statistics, and representativeness relative to other B1 long-context workloads are not provided, weakening the central throughput claim for that regime.
[Abstract] Abstract: mean throughput improvements (1.04x–1.08x, 1.40x) are stated without error bars, standard deviations, or per-seed values across the five held-out seeds, so it is impossible to judge whether the gains are statistically distinguishable from noise.

minor comments (2)

Workload definitions (bimodal, uniform, Zipf-like, bucketed) and the exact calibration vs. held-out split are referenced but not fully specified, hindering reproducibility.
The manuscript mentions an 'attention-plus-MLP timing proxy' and workload counters but does not show the corresponding figures or tables in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to incorporate additional details and statistical reporting.

read point-by-point responses

Referee: [Abstract] Abstract: the 1.40x B1 gain is reported on a single 'bucketed trace' whose definition, sequence-length statistics, and representativeness relative to other B1 long-context workloads are not provided, weakening the central throughput claim for that regime.

Authors: We agree that the abstract would benefit from a concise definition of the B1 bucketed trace. This trace uses a bucketed sequence-length distribution designed to capture realistic long-context B1 serving patterns (with full generation details and statistics provided in the workload section of the paper). In revision we will add a short parenthetical description of the trace's length statistics and note its role as a representative B1 case, while retaining the 1.40x figure. revision: yes
Referee: [Abstract] Abstract: mean throughput improvements (1.04x–1.08x, 1.40x) are stated without error bars, standard deviations, or per-seed values across the five held-out seeds, so it is impossible to judge whether the gains are statistically distinguishable from noise.

Authors: We acknowledge that the abstract omits variability measures. The reported means reflect consistent gains across the five held-out seeds under frozen calibration constants. In the revision we will add either standard deviations or the per-seed range for the B8 workloads (and clarify that the B1 result is from a single representative trace) so readers can assess statistical distinguishability from noise. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on held-out data after freezing calibration

full rationale

The paper reports wall-clock throughput gains from a roofline policy whose constants are fitted once on calibration traces and then frozen; subsequent measurements use five held-out seeds from the same workload families. No equations, self-definitional mappings, or load-bearing self-citations appear in the provided text. The central claims rest on direct, independent execution-time measurements rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical calibration of a scheduling policy and assumptions that the tested workloads and GQA ratios are representative; no new physical entities or mathematical axioms are introduced.

free parameters (1)

cost-model constants
Fixed on calibration traces to choose between FlashInfer and PersistentKV schedules for different batch sizes and context lengths.

axioms (1)

domain assumption The five held-out seeds and the bimodal/uniform/Zipf-like/B1 bucketed traces are representative of production long-context serving traffic.
Used to claim generalization of the 1.04-1.40x gains.

pith-pipeline@v0.9.1-grok · 5864 in / 1366 out tokens · 36431 ms · 2026-07-02T21:02:06.460527+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 canonical work pages · 3 internal anchors

[1]

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

2022
[2]

T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023

2023
[5]

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun. Orca: A distributed serving system for Transformer-based generative models. InUSENIX Symposium on Operating Systems Design and Implementation, 2022

2022
[6]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee. Sarathi: Efficient LLM inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Agrawal, N

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee. Taming throughput- latency tradeoff in LLM inference with Sarathi-Serve. arXiv preprint arXiv:2403.02310, 2024

work page arXiv 2024
[8]

Patel, E

P. Patel, E. Choukse, C. Zhang, A. Shah, I. Goiri, S. Maleki, and R. Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023

work page arXiv 2023
[9]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023

2023
[10]

Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y . Tian, C. R ´e, and B. Chen. Deja Vu: Contextual sparsity for efficient LLMs at inference time. InInternational Conference on Machine Learning, 2023

2023
[11]

Prabhu, V

R. Prabhu, V . Janapa Reddi, and M. Zaharia. vAttention: Dynamic memory management for serving LLMs without PagedAttention. arXiv preprint arXiv:2405.04437, 2024

work page arXiv 2024
[12]

FlashInfer: Kernel library for LLM serving

FlashInfer contributors. FlashInfer: Kernel library for LLM serving. https://github.com/flashinfer-ai/ flashinfer
[13]

TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM

NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM
[14]

vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm

vLLM contributors. vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm. 9

[1] [1]

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

2022

[2] [2]

T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023

2023

[5] [5]

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun. Orca: A distributed serving system for Transformer-based generative models. InUSENIX Symposium on Operating Systems Design and Implementation, 2022

2022

[6] [6]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee. Sarathi: Efficient LLM inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Agrawal, N

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee. Taming throughput- latency tradeoff in LLM inference with Sarathi-Serve. arXiv preprint arXiv:2403.02310, 2024

work page arXiv 2024

[8] [8]

Patel, E

P. Patel, E. Choukse, C. Zhang, A. Shah, I. Goiri, S. Maleki, and R. Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023

work page arXiv 2023

[9] [9]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023

2023

[10] [10]

Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y . Tian, C. R ´e, and B. Chen. Deja Vu: Contextual sparsity for efficient LLMs at inference time. InInternational Conference on Machine Learning, 2023

2023

[11] [11]

Prabhu, V

R. Prabhu, V . Janapa Reddi, and M. Zaharia. vAttention: Dynamic memory management for serving LLMs without PagedAttention. arXiv preprint arXiv:2405.04437, 2024

work page arXiv 2024

[12] [12]

FlashInfer: Kernel library for LLM serving

FlashInfer contributors. FlashInfer: Kernel library for LLM serving. https://github.com/flashinfer-ai/ flashinfer

[13] [13]

TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM

NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM

[14] [14]

vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm

vLLM contributors. vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm. 9