pith. sign in

arxiv: 2605.19049 · v1 · pith:54ILOMC2new · submitted 2026-05-18 · 💻 cs.LG · cs.AI

KVBuffer: IO-aware Serving for Linear Attention

Pith reviewed 2026-05-20 11:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords linear attentionKV bufferserving systemdecoding latencyspeculative decodingmemory IO optimizationchunkwise computation
0
0 comments X

The pith

KVBuffer reduces linear attention decoding latency by buffering keys and values to enable chunked and batched state updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear attention keeps decoding cost constant with context length but existing systems update a large state recurrently at every token, causing heavy memory traffic. KVBuffer stores recent keys and values so that state updates can be deferred and performed in chunks or batches instead of per token. This preserves exact numerical results while cutting average memory access. The same buffer supports parallel verification of multiple draft tokens in speculative decoding and direct attention computation for short contexts without ever building the state.

Core claim

By buffering recent keys and values, KVBuffer permits linear attention outputs to be computed through chunkwise or batched state updates rather than recurrent per-token updates, which lowers memory access volume and improves serving throughput while preserving exact numerical equivalence.

What carries the argument

KVBuffer, a buffer holding recent keys and values that defers and batches linear attention state updates to reduce IO costs.

If this is right

  • Decoding latency falls because state updates occur less often and in larger batches that exploit better memory locality.
  • Speculative decoding verifies four draft tokens in parallel without allocating temporary states for each.
  • Short-context requests skip state creation and update entirely by reading directly from the key-value buffer.
  • Maximum concurrent requests rise as each request requires less memory bandwidth per generated token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same buffering pattern could be applied to other recurrent state mechanisms that suffer from per-step IO.
  • Hardware-aware chunk sizing could further reduce latency on specific GPUs or TPUs.
  • Models with even longer contexts would see the largest relative gains because the fixed-size state becomes a smaller fraction of total traffic.

Load-bearing premise

That reordering state updates into chunks and batches produces identical numerical results to per-token recurrent updates while only changing memory access patterns.

What would settle it

Run identical token sequences through both the original recurrent linear attention and the KVBuffer chunked version, then verify that output logits or hidden states match to machine precision.

Figures

Figures reproduced from arXiv: 2605.19049 by Lin Zhong, Longwei Zou.

Figure 1
Figure 1. Figure 1: KVBuffer Design. We partition the memory for KV buffers into blocks, each of which can store 6 KVs. Each request has two blocks. During decoding, the serving system loads state along with buffered KVs to compute attention output. When the buffer is full, the state is updated with all buffered KVs on GPU and the updated state will be written back to the state slot. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Speculative Decoding with KVBuffer. Existing serving systems only have a state pool and use the recurrent form for speculative decoding verification. Therefore, they have to store a temporary state for each draft token. After determining that the accepted draft tokens are 0 and 2, the state of this request is replaced by the temporary state2. In contrast, KVBuffer buffers the KV for each draft tokens and u… view at source ↗
Figure 3
Figure 3. Figure 3: Kernel latency of chunkwise decoding with KVBuffer. Latency is normalized by the corresponding recurrent decoding latency, i.e., the case with buffer size m = 0. Chunkwise de￾coding latency is averaged over a full KVBuffer cycle, including decoding with buffer occupan￾cies from 0 to m −1 and the state update latency. 2 4 6 8 Number of Draft Tokens 0 1 2 3 4 Latency (ms) 1.23 1.25 2.09 1.28 2.98 1.33 3.88 1… view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end serving throughput with speculative decoding. By avoiding the storage of temporary states for draft tokens, KVBuffer sus￾tains higher request rates and improves overall throughput. 16 32 64 128 256 Context Length 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Normalized Latency Recurrent Decoding Chunkwise Decoding KV-Only Decoding [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For speculative decoding, KVBuffer verifies draft tokens in parallel and avoids storing temporary states. For short contexts, KVBuffer computes attention outputs directly from buffered keys and values, without creating or updating the linear attention state. We implement KVBuffer in SGLang for Qwen3-Next. Our evaluations show that KVBuffer can reduce linear attention decoding latency by up to 45.17% and increase the maximum number of serving requests by 5x for speculative decoding when verifying four draft tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes KVBuffer, an IO-aware serving system for linear attention models. It buffers recent keys and values to support chunkwise state updates during decoding (deferring and batching updates to reduce memory traffic), parallel draft verification in speculative decoding without temporary states, and direct KV-based attention for short contexts without maintaining the linear state. The method is implemented in SGLang for Qwen3-Next; the abstract reports up to 45.17% lower decoding latency and up to 5x higher maximum serving requests under speculative decoding with four draft tokens.

Significance. If the numerical equivalence of chunkwise updates holds and the performance numbers are reproducible, the work addresses a practical bottleneck in serving linear-attention models for long contexts by improving memory access patterns without changing the underlying algorithm. The concrete implementation and reported speedups constitute a strength; however, the absence of detailed experimental methodology limits the strength of the empirical claims.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the reported 45.17% latency reduction and 5x request increase are presented without any description of hardware platform, baseline serving system, measurement methodology (e.g., median vs. mean, warm-up steps), or variance across runs. These omissions make it impossible to assess whether the gains are load-bearing or reproducible.
  2. [Method (chunkwise decoding)] Method description of chunkwise computation: the paper states that KVBuffer “defers state updates and applies them in batch” while preserving correctness, yet provides neither an associativity argument for the specific linear-attention recurrence used in Qwen3-Next nor any numerical verification (e.g., maximum logit difference or FP-error bounds) that the batched result equals the per-token recurrent result to machine precision. This equivalence is load-bearing for the claim that the optimization is exact rather than approximate.
minor comments (1)
  1. [Method] Notation for the linear-attention state update is introduced without an explicit equation reference; adding a numbered equation for S_t = f(S_{t-1}, k_t, v_t) would clarify the subsequent chunkwise reformulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have revised the paper to address the concerns about experimental reproducibility and the formal justification for chunkwise updates. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the reported 45.17% latency reduction and 5x request increase are presented without any description of hardware platform, baseline serving system, measurement methodology (e.g., median vs. mean, warm-up steps), or variance across runs. These omissions make it impossible to assess whether the gains are load-bearing or reproducible.

    Authors: We agree that the original manuscript provided insufficient detail on the experimental setup, which limits assessment of reproducibility. In the revised version, we have added a new subsection titled 'Experimental Setup' in the Evaluation section. This subsection now specifies the hardware platform, the exact baseline serving system (vanilla SGLang without KVBuffer), the measurement methodology including use of median latency after warm-up steps, and reporting of variance across multiple runs. These additions directly address the referee's concerns and strengthen the empirical claims. revision: yes

  2. Referee: [Method (chunkwise decoding)] Method description of chunkwise computation: the paper states that KVBuffer “defers state updates and applies them in batch” while preserving correctness, yet provides neither an associativity argument for the specific linear-attention recurrence used in Qwen3-Next nor any numerical verification (e.g., maximum logit difference or FP-error bounds) that the batched result equals the per-token recurrent result to machine precision. This equivalence is load-bearing for the claim that the optimization is exact rather than approximate.

    Authors: We acknowledge that the original submission did not include an explicit associativity argument or numerical verification for the chunkwise updates, which is a valid concern given that equivalence is central to claiming an exact optimization. The linear attention recurrence in Qwen3-Next admits an associative formulation under the standard state-update rules (matrix scaling and outer-product accumulation). In the revised manuscript, we have added a dedicated paragraph in Section 3.2 with the associativity proof tailored to this recurrence, plus an appendix subsection reporting numerical verification (maximum logit difference below machine epsilon across long sequences). This confirms the batched result matches the recurrent result to floating-point precision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on direct empirical measurements of an implemented system

full rationale

The paper's central claims concern measured latency reductions (up to 45.17%) and throughput gains (5x) from an IO-aware serving mechanism implemented in SGLang for Qwen3-Next. These results are obtained by running the system and recording wall-clock times and request capacities rather than by any mathematical derivation, fitted parameter, or self-referential equation. The description of chunkwise state updates is presented as an engineering optimization whose correctness is implicitly verified by the reported end-to-end numbers; no algebraic identity is asserted as a first-principles result that later reduces to the same identity. Consequently the derivation chain contains no self-definitional, fitted-input, or self-citation load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that linear attention state updates commute with chunked computation when performed in batch, plus standard hardware assumptions about memory access costs. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Linear attention state updates can be deferred and batched without changing the final numerical output
    Implicit in the description of chunkwise computation and deferred state updates.

pith-pipeline@v0.9.0 · 5741 in / 1115 out tokens · 38523 ms · 2026-05-20T11:56:38.104958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 8 internal anchors

  1. [1]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. SARATHI: efficient LLM inference by piggybacking decodes with chunked prefills.CoRR, abs/2308.16369, 2023

  2. [2]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July...

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

  4. [4]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023

  5. [5]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

  6. [6]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020

  7. [7]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating 10 System...

  8. [8]

    Fast inference from transformers via specu- lative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via specu- lative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Res...

  9. [9]

    EAGLE: speculative sampling requires rethinking feature uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: speculative sampling requires rethinking feature uncertainty. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27...

  10. [10]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention.CoRR, abs/2506.13585, 2025

  11. [11]

    Marconi: Prefix caching for the era of hybrid llms

    Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid llms. In Matei Zaharia, Gauri Joshi, and Yingyan (Celine) Lin, editors,Proceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys...

  12. [12]

    Wind, Stanislaw Wozniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Sait...

  13. [13]

    Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemyslaw Kazienko, Kranthi Ki- ran GV , Jan Kocon, Bartlomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanislaw Wozniak, Ruich...

  14. [14]

    Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

    Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. Prefill-as-a-service: KVCache of next-generation models could go cross-datacenter.CoRR, abs/2604.15039, 2026

  15. [15]

    Qwen3-next-80b-a3b

    Qwen Team. Qwen3-next-80b-a3b. https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd, September 2025. Blog post, accessed May 5, 2026

  16. [16]

    Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  17. [17]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.CoRR, abs/2307.08621, 2023

  18. [18]

    Stree: Speculative tree decoding for hybrid state-space models.CoRR, abs/2505.14969, 2025

    Yangchao Wu, Zongyue Qin, Alex Wong, and Stefano Soatto. Stree: Speculative tree decoding for hybrid state-space models.CoRR, abs/2505.14969, 2025. 11

  19. [19]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  20. [20]

    Gated linear attention transformers with hardware-efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Aus...

  21. [21]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mat...

  22. [22]

    Orca: A distributed serving system for transformer-based generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In Marcos K. Aguilera and Hakim Weatherspoon, editors,16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 521–538. USENIX Associat...

  23. [23]

    Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, B...

  24. [24]

    Gonzalez, Clark W

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zh...