pith. sign in

arxiv: 2606.26666 · v2 · pith:EL6A4QOTnew · submitted 2026-06-25 · 💻 cs.LG

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Pith reviewed 2026-07-02 21:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM servingKV cachedecode attentionpage-aware schedulinggrouped-query attentionworkqueue schedulinglong-context inferenceGPU throughput
0
0 comments X

The pith

Page-aware workqueue scheduling over native block tables raises long-context LLM decode throughput by 1.04x to 1.40x on held-out workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PersistentKV as a block-table decode attention engine that maps work by KV-head group and runs only non-empty row-KV-head-sequence-split tasks through a compact workqueue. This targets under-utilization in low-active long-context decode and the tension between exact-length launches and padded batches in mixed workloads. A roofline-style policy, calibrated on traces, chooses PersistentKV for batch sizes 1 and 8 on supported GQA setups while routing other cases to an existing kernel. On five held-out seeds the approach improves mean wall-clock decode-token throughput without regressions on boundary cases.

Core claim

PersistentKV maps work by KV-head group, executes directly over native page tables, and adds a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks. With cost-model constants fixed on calibration traces, a calibrated roofline-style policy selects PersistentKV for B1 long-context steps and supported B8 long-context GQA steps, improving mean wall decode-token throughput by 1.04x to 1.08x on B8 bimodal, uniform, and Zipf-like workloads and by 1.40x on a B1 bucketed trace while routing B4 and uncalibrated GQA ratios to the baseline kernel to avoid regressions.

What carries the argument

The compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks over native block tables.

Load-bearing premise

The roofline-style policy calibrated on specific traces will continue to select the faster path on untested GQA ratios and workload distributions without causing regressions.

What would settle it

A new workload or GQA ratio where the policy selects PersistentKV yet measured throughput falls below the baseline kernel would falsify the central performance claim.

read the original abstract

Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce fragmentation, and mature kernels like FlashInfer provide highly optimized decode attention. However, the best single-kernel implementation is not always the best serving schedule: low-active long-context decode can under-utilize GPUs, while mixed sequence lengths introduce tension between many exact-length launches and coarse padded batches. We present PersistentKV, a native block-table decode attention engine and page-aware scheduling study for grouped-query attention (GQA). PersistentKV maps work by KV-head group, executes directly over native page tables, and adds a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks. On an RTX 3060 (FP16, page size 16, Hq=32, Hkv=8, d=128), a calibrated roofline-style policy selects FlashInfer for small active batches, PersistentKV sequence splitting for batch size 1 (B1) long-context steps, and PersistentKV workqueue scheduling for supported B8 long-context GQA steps. With cost-model constants fixed on calibration traces, five held-out seeds improve mean wall decode-token throughput by 1.04x to 1.08x on B8 bimodal, uniform, and Zipf-like workloads, and by 1.40x on a B1 bucketed trace. For the B4 boundary case and uncalibrated GQA ratios, the policy avoids regressions by routing to FlashInfer. We also report an attention-plus-MLP timing proxy and workload counters showing workqueue scheduling reduces launch fan-out from 16.00 to 2.00 launches per step on held-out bimodal B8. These results show that work assignment is a decisive serving-system variable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PersistentKV, a native block-table decode attention engine for grouped-query attention (GQA) that maps work by KV-head group and executes directly over page tables using a compact workqueue schedule for non-empty tasks. It proposes a calibrated roofline-style policy that selects between PersistentKV (sequence splitting for B1, workqueue for supported B8) and FlashInfer (small batches, B4, uncalibrated GQA), reporting 1.04x–1.08x mean wall-clock decode-token throughput gains on held-out seeds from bimodal/uniform/Zipf B8 workloads and 1.40x on a B1 bucketed trace, with an attention-plus-MLP proxy and launch-fan-out reduction from 16 to 2.

Significance. If the reported gains hold under the fixed cost-model constants, the work shows that page-aware work assignment and scheduling can yield measurable serving throughput improvements on commodity GPUs beyond single-kernel optimization. Strengths include the use of held-out seeds with frozen calibration constants and the conservative routing strategy that avoids regressions on uncalibrated cases. The narrow workload families and single GQA configuration (Hq=32, Hkv=8) limit broader claims about policy robustness.

major comments (2)
  1. [Abstract] Abstract: the 1.40x B1 gain is reported on a single 'bucketed trace' whose definition, sequence-length statistics, and representativeness relative to other B1 long-context workloads are not provided, weakening the central throughput claim for that regime.
  2. [Abstract] Abstract: mean throughput improvements (1.04x–1.08x, 1.40x) are stated without error bars, standard deviations, or per-seed values across the five held-out seeds, so it is impossible to judge whether the gains are statistically distinguishable from noise.
minor comments (2)
  1. Workload definitions (bimodal, uniform, Zipf-like, bucketed) and the exact calibration vs. held-out split are referenced but not fully specified, hindering reproducibility.
  2. The manuscript mentions an 'attention-plus-MLP timing proxy' and workload counters but does not show the corresponding figures or tables in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to incorporate additional details and statistical reporting.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 1.40x B1 gain is reported on a single 'bucketed trace' whose definition, sequence-length statistics, and representativeness relative to other B1 long-context workloads are not provided, weakening the central throughput claim for that regime.

    Authors: We agree that the abstract would benefit from a concise definition of the B1 bucketed trace. This trace uses a bucketed sequence-length distribution designed to capture realistic long-context B1 serving patterns (with full generation details and statistics provided in the workload section of the paper). In revision we will add a short parenthetical description of the trace's length statistics and note its role as a representative B1 case, while retaining the 1.40x figure. revision: yes

  2. Referee: [Abstract] Abstract: mean throughput improvements (1.04x–1.08x, 1.40x) are stated without error bars, standard deviations, or per-seed values across the five held-out seeds, so it is impossible to judge whether the gains are statistically distinguishable from noise.

    Authors: We acknowledge that the abstract omits variability measures. The reported means reflect consistent gains across the five held-out seeds under frozen calibration constants. In the revision we will add either standard deviations or the per-seed range for the B8 workloads (and clarify that the B1 result is from a single representative trace) so readers can assess statistical distinguishability from noise. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on held-out data after freezing calibration

full rationale

The paper reports wall-clock throughput gains from a roofline policy whose constants are fitted once on calibration traces and then frozen; subsequent measurements use five held-out seeds from the same workload families. No equations, self-definitional mappings, or load-bearing self-citations appear in the provided text. The central claims rest on direct, independent execution-time measurements rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical calibration of a scheduling policy and assumptions that the tested workloads and GQA ratios are representative; no new physical entities or mathematical axioms are introduced.

free parameters (1)
  • cost-model constants
    Fixed on calibration traces to choose between FlashInfer and PersistentKV schedules for different batch sizes and context lengths.
axioms (1)
  • domain assumption The five held-out seeds and the bimodal/uniform/Zipf-like/B1 bucketed traces are representative of production long-context serving traffic.
    Used to claim generalization of the 1.04-1.40x gains.

pith-pipeline@v0.9.1-grok · 5864 in / 1366 out tokens · 36431 ms · 2026-07-02T21:02:06.460527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

  2. [2]

    T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  3. [3]

    J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024

  4. [4]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023

  5. [5]

    G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun. Orca: A distributed serving system for Transformer-based generative models. InUSENIX Symposium on Operating Systems Design and Implementation, 2022

  6. [6]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee. Sarathi: Efficient LLM inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023

  7. [7]

    Agrawal, N

    A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee. Taming throughput- latency tradeoff in LLM inference with Sarathi-Serve. arXiv preprint arXiv:2403.02310, 2024

  8. [8]

    Patel, E

    P. Patel, E. Choukse, C. Zhang, A. Shah, I. Goiri, S. Maleki, and R. Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. arXiv preprint arXiv:2311.18677, 2023

  9. [9]

    Zhang, Y

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023

  10. [10]

    Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y . Tian, C. R ´e, and B. Chen. Deja Vu: Contextual sparsity for efficient LLMs at inference time. InInternational Conference on Machine Learning, 2023

  11. [11]

    Prabhu, V

    R. Prabhu, V . Janapa Reddi, and M. Zaharia. vAttention: Dynamic memory management for serving LLMs without PagedAttention. arXiv preprint arXiv:2405.04437, 2024

  12. [12]

    FlashInfer: Kernel library for LLM serving

    FlashInfer contributors. FlashInfer: Kernel library for LLM serving. https://github.com/flashinfer-ai/ flashinfer

  13. [13]

    TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM

    NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM

  14. [14]

    vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm

    vLLM contributors. vLLM: Easy, fast, and cheap LLM serving.https://github.com/vllm-project/vllm. 9