pith. sign in

arxiv: 2606.02964 · v1 · pith:ASPX2I5Gnew · submitted 2026-06-01 · 💻 cs.AR · cs.CL· cs.LG

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

Pith reviewed 2026-06-28 11:36 UTC · model grok-4.3

classification 💻 cs.AR cs.CLcs.LG
keywords KV cache managementLLM inferenceattention kernelscache evictionGPU optimizationinference latencymulti-segment attentionlossless serving
0
0 comments X

The pith

AsymCache reduces time-to-first-token in LLM inference by up to 2x by aligning KV cache decisions with attention kernel costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AsymCache as a KV cache management system for LLM inference that factors in the actual execution time of GPU attention kernels when deciding which cache blocks to keep. It achieves this through Multi-Segment Attention for handling non-contiguous blocks, an eviction policy that balances access frequency with the position-specific cost of recomputing evicted blocks, and an adaptive chunking scheduler. These components keep outputs exactly correct while cutting both time-to-first-token and time-per-output-token compared with prior frequency- or position-only methods. A sympathetic reader would care because the work shows that cache policies ignoring kernel-level costs leave measurable latency on the table in real serving workloads.

Core claim

AsymCache is a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization, which together reduce TTFT by up to 1.90-2.03x and TPOT by 1.62-1.71x over latest baselines while preserving exact outputs and enabling integration into agent serving systems.

What carries the argument

Multi-Segment Attention (MSA), which enables efficient processing of non-contiguous KV cache blocks to support position-aware eviction without prohibitive recomputation overhead.

If this is right

  • KV cache eviction can be improved by jointly optimizing hit rate and position-dependent recomputation costs rather than using frequency or recency alone.
  • Adaptive chunking during attention computation raises hardware utilization when cache blocks are non-contiguous.
  • The design integrates directly into existing agent serving frameworks and yields further average job latency reductions of up to 18.1 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar kernel-aware eviction logic could be applied to other memory-intensive operations beyond attention if their recomputation costs also vary with data layout.
  • The position-aware cost model may need recalibration when moving to new GPU architectures whose attention kernels exhibit different scaling with segment count.
  • Future cache systems might benefit from exposing low-level kernel timing models to the eviction policy instead of treating attention as a black-box cost.

Load-bearing premise

The reported speedups arise from the proposed MSA, joint eviction policy, and adaptive scheduler rather than from differences in workloads, model sizes, or baseline implementations.

What would settle it

Re-run the experiments on identical hardware and workloads using the baseline systems equipped with the same low-level optimizations as AsymCache but without MSA and the position-aware recomputation term, then check whether the 1.9-2x TTFT and 1.6-1.7x TPOT gains disappear.

Figures

Figures reproduced from arXiv: 2606.02964 by Bin Cui, Chunan Shi, Xupeng Miao, Yilei Chen, Yilin Chen.

Figure 1
Figure 1. Figure 1: LLM inference with KV Cache. and Multi-Head Latent Attention (MLA) [35]. The standard atten￾tion computation can be formulated as: 𝑄 = 𝑊𝑄 · 𝑋, 𝐾 = 𝑊𝐾 · 𝑋, 𝑉 = 𝑊𝑉 · 𝑋 𝐴 = 𝑄𝐾𝑇 / √︁ 𝑑𝑘, 𝑂 = softmax(𝐴) · 𝑉 (1) During inference, an input prompt is tokenized into a sequence of tokens, each associated with its own 𝑄, 𝐾,𝑉 vectors. Notably, the computation complexity of Equation 1 grows quadratically with the seque… view at source ↗
Figure 3
Figure 3. Figure 3: PDF of normalized hit position under different disruption levels. storage. On the other hand, Pensieve [55] observes that tokens in later positions carry a higher recomputation cost and thus priori￾tizes caching these tokens for single-user, multi-turn dialogues. In contrast, AsymCache holistically evaluates the trade-off between the hit-rate benefit of caching earlier tokens and the recomputation savings … view at source ↗
Figure 5
Figure 5. Figure 5: Schematic diagram of the Multi-Segment Attention [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overview of AsymCache. 4 ASYMMETRIC CACHE BLOCK MANAGER 4.1 Multi-Segment Attention Supporting non-contiguous cache segments (e.g., both a prefix and a suffix) requires the attention kernel to handle disjoint KV regions. This motivates our design of Multi-Segment Attention (MSA). In contrast to prefix caching strategy which only caches prefix and prioritizes the blocks with longer prefix to be evicted,… view at source ↗
Figure 6
Figure 6. Figure 6: Estimation of Δ𝑇𝐵. its recomputation cost is given by: Δ𝑇𝐵 =𝑇 (𝑙1, 𝑞1 + 1,𝑙2 − 1, 𝑞2) −𝑇 (𝑙1, 𝑞1,𝑙2, 𝑞2) = 𝑘5 · (𝑙1 + 2𝑞1) + (𝑘2 − 𝑘3 + 𝑘5) (5) However, maintaining the term (𝑙1 + 2𝑞1) in Equation 5 would require the introduction of complex data structures and involve update or query operations with super-constant time complexity, becoming unaffordable in online serving. Therefore, the following approximat… view at source ↗
Figure 7
Figure 7. Figure 7: KV-Cache reusing time distribution of LooGLE [ [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Image of piecewise ex￾ponential function. 4 6 8 10 Number of Blocks in Cache Space (×1e3) 0 1 2 3 Tim e o n E vict Alg orith m (× 1 e 3 s) O(1)-LRU O(log n)-AsymCache O(n) impl. on C O(n) impl. on Python [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of two types of workloads. equivalently rescales the frequency term, which shifts the effective turning point of the piecewise exponential function. In practice, AsymCache can periodically collect the average lifespan 𝜏 of cache blocks from a sliding window and updates 𝜆 according to the fol￾lowing rule, adjusting the turning point to the detected lifespan: 𝜆𝑛𝑒𝑤 ← exp ((𝜏 − 𝜏0)/𝛽 − 𝜏/𝛼) . (10) … view at source ↗
Figure 11
Figure 11. Figure 11: End-to-end results on Low-Dispersion Workloads. The best performance is denoted by the bars with diagonal slashes. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: End-to-end results on High-Dispersion Workloads. The best performance is denoted by the bars with slashes. [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The MSA performance over various cached token [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Results on BFCL Dataset, an agentic workload. [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
read the original abstract

Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization. Experiments show that AsymCache reduces TTFT by up to 1.90-2.03x and time-per-output-token (TPOT) by 1.62-1.71x over latest baselines, confirming the effectiveness of the method in common workloads and validating its design goal of balancing computational efficiency with cache hit rate. Moreover, the low-level design of AsymCache allows seamless integration into agent serving systems such as Continuum, where it further reduces average job latency by up to 18.1%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes AsymCache, a KV-cache management system for LLM inference consisting of Multi-Segment Attention (MSA) for non-contiguous KV processing, a joint hit-rate and position-aware recomputation eviction policy, and an adaptive chunking scheduler. It claims these components yield TTFT speedups of 1.90-2.03x and TPOT speedups of 1.62-1.71x versus latest baselines while preserving exact outputs, with further latency gains when integrated into systems such as Continuum.

Significance. If the reported speedups are shown to arise from the algorithmic components rather than implementation artifacts, the work would offer a practical advance in lossless KV-cache management that directly ties eviction decisions to GPU kernel efficiency, potentially improving serving throughput for long-context workloads.

major comments (1)
  1. [Abstract] Abstract: the headline performance claims (TTFT 1.90-2.03x, TPOT 1.62-1.71x) are presented without any description of models, workloads, hardware, baseline re-implementations, or measurement methodology. This directly prevents assessment of whether the gains are produced by MSA, the joint eviction policy, and the scheduler, or by unstated differences in memory layout, kernel choice, or chunking strategy between AsymCache and the baselines.
minor comments (1)
  1. The title emphasizes Multi-Segment Attention while the abstract centers the system name AsymCache; a brief clarification of how MSA relates to the overall AsymCache design would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive comment. We agree that the abstract lacks sufficient detail on the experimental setup, which is necessary to properly contextualize the reported speedups and allow assessment of whether they stem from the proposed algorithmic components.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance claims (TTFT 1.90-2.03x, TPOT 1.62-1.71x) are presented without any description of models, workloads, hardware, baseline re-implementations, or measurement methodology. This directly prevents assessment of whether the gains are produced by MSA, the joint eviction policy, and the scheduler, or by unstated differences in memory layout, kernel choice, or chunking strategy between AsymCache and the baselines.

    Authors: We agree that the current abstract does not provide the necessary context on models, workloads, hardware, baselines, or methodology. In the revised version we will expand the abstract (within length constraints) to include brief but explicit information on the evaluated models (Llama-2-7B/13B and Mistral-7B), workloads (long-context generation and chat), hardware (A100/H100 GPUs), baseline re-implementations (vLLM, FlexGen, and recent KV-cache eviction methods), and measurement methodology (end-to-end TTFT/TPOT with exact output verification). The full experimental details will remain in Section 5, but the abstract will now allow readers to immediately assess the source of the gains. We will also add a short sentence clarifying that all comparisons use identical memory layouts and kernel backends where possible. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance claims with no derivations or self-referential predictions

full rationale

The paper is a systems/empirical contribution focused on measured speedups (TTFT/TPOT) from AsymCache components. No equations, first-principles derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. Claims rest on experimental comparisons against baselines rather than any chain that reduces to its own inputs by construction. This is the standard case of a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no free parameters, axioms, or invented entities can be extracted from the text.

pith-pipeline@v0.9.1-grok · 5796 in / 1132 out tokens · 23294 ms · 2026-06-28T11:36:17.312305+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 19 canonical work pages · 8 internal anchors

  1. [3]

    Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang

  2. [4]

    Infercept: Efficient intercept support for augmented large language model inference.arXiv preprint arXiv:2402.01869(2024)

  3. [5]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  4. [6]

    Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

  5. [7]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134

  6. [8]

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query trans- former models from multi-head checkpoints.arXiv preprint arXiv:2305.13245 (2023)

  7. [9]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137

  8. [10]

    Zhuohang Bian, Feiyang Wu, Teng Ma, and Youwei Zhuo. 2025. Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications. arXiv preprint arXiv:2510.18586(2025)

  9. [11]

    Chang and Longling Geng

    Edward Y. Chang and Longling Geng. 2025. SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning.Proc. VLDB Endow.18, 12 (2025), 4874–4886. https://doi.org/10.14778/3750601.3750611

  10. [12]

    Yukang Chen, Weihao Cui, Han Zhao, Ziyi Xu, Xiaoze Fan, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Bingsheng He, and Quan Chen. 2026. Towards High-Goodput LLM Serving with Prefill-decode Multiplexing. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 2030–2047

  11. [13]

    Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, and Mao Yang. 2026. RetroInfer: A Vector Storage Engine for Scalable Long- Context LLM Inference.Proc. VLDB Endow.19, 5 (2026), 1016–...

  12. [14]

    Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, and Minyi Guo. 2025. Optimizing SLO-oriented LLM Serving with PD-Multiplexing.arXiv preprint arXiv:2504.14489(2025)

  13. [15]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

  14. [16]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Attentionstore: Cost- effective attention reuse across multi-turn conversations in large language model serving.arXiv preprint arXiv:2403.1970852 (2024), 20–38

  15. [17]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 111–126

  16. [18]

    Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast state restoration in LLM serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems. 128–143

  17. [19]

    Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving. Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

  18. [20]

    Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-tune: Harnessing large language models for automated database system tuning.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26

  19. [21]

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems6 (2024), 325–338

  20. [22]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al . 2024. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671(2024)

  21. [23]

    Chuxuan Hu, Austin Peters, and Daniel Kang. 2024. LEAP: LLM-powered End- to-end Automatic Library for Processing Social Science Queries on Unstructured Data.Proc. VLDB Endow.18, 2 (2024), 253–264. https://doi.org/10.14778/3705829. 3705843

  22. [24]

    Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, and Tao Wei. 2026. LLM- AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning.Proc. VLDB Endow.19, 5 (2026), 794–807. https://www.vldb.org/pvldb/vol19/p794- cheng.pdf

  23. [25]

    Xinmei Huang, Haoyang Li, Jing Zhang, Xinxin Zhao, Zhiming Yao, Yiyan Li, Tieying Zhang, Jianjun Chen, Hong Chen, and Cuiping Li. 2025. E2ETune: End-to- End Knob Tuning via Fine-tuned Generative Language Model.Proc. VLDB Endow. 18, 13 (2025), 5540–5554. https://www.vldb.org/pvldb/vol18/p5540-huang.pdf

  24. [26]

    Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso

  25. [27]

    Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.Proceedings of the VLDB Endowment18, 1 (2024), 42–52

  26. [28]

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems44, 1 (2025), 1–27

  27. [29]

    Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z Morley Mao. 2024. Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065(2024)

  28. [30]

    Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897–912

  29. [31]

    Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. Thunderagent: A simple, fast and program-aware agentic inference system.arXiv preprint arXiv:2602.13692(2026)

  30. [32]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626

  31. [33]

    Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xi- aokun Chen, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. 2025. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to- Live.arXiv preprint arXiv:2511.02230(2025)

  32. [34]

    Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, et al. 2025. LoopServe: An Adaptive Dual- phase LLM Inference Acceleration System for Multi-Turn Dialogues.arXiv preprint arXiv:2507.13681(2025)

  33. [35]

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2024. Loogle: Can long-context language models understand long contexts?. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 16304–16333

  34. [36]

    Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems.Proceedings of the ACM on Management of Data3, 4 (2025), 1–27

  35. [37]

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

  36. [38]

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)

  37. [39]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. 2024. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

  38. [40]

    Kuan Lu, Zhihui Yang, Sai Wu, Ruichen Xia, Dongxiang Zhang, and Gang Chen

  39. [41]

    Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents.Proceedings of the ACM on Management of Data3, 3 (2025), 1–27

  40. [42]

    Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965(2025)

  41. [43]

    Xinyue Ma, Heelim Hong, Taegeon Um, Jongseop Lee, Seoyeong Choy, Woo- Yeon Lee, and Myeongjae Jeon. 2026. ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration.Proc. VLDB Endow. 19, 5 (2026), 1046–1059. https://www.vldb.org/pvldb/vol19/p1046-ma.pdf

  42. [44]

    2023.TensorRT-LLM

    NVIDIA. 2023.TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM High-Performance Deep Learning Inference

  43. [45]

    NVIDIA Corporation. 2024. NVIDIA CUDA Toolkit, Version 12.8. https:// developer.nvidia.com/cuda-toolkit. https://developer.nvidia.com/cuda-toolkit

  44. [46]

    NVIDIA Corporation and CUTLASS Contributors. 2024. CUTLASS: CUDA Tem- plates for Linear Algebra Subroutines, Version 3.4.0. https://github.com/NVIDIA/ cutlass. https://github.com/NVIDIA/cutlass GitHub repository. Accessed: 2026- 01-17

  45. [47]

    Zaifeng Pan, AJJKUMAR DAHYALAL PATEL, Yipeng Shen, Zhengding Hu, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2026. KVFlow: Effi- cient prefix caching for accelerating LLM-based multi-agent workflows.Advances in Neural Information Processing Systems38 (2026), 126246–126265. Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, and Bin Cui

  46. [48]

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. 2025. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty- second International Conference on Machine Learning

  47. [49]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al . 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage(2024)

  48. [50]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

  49. [51]

    Jie Tan, Kangfei Zhao, Rui Li, Jeffrey Xu Yu, Chengzhi Piao, Hong Cheng, Helen Meng, Deli Zhao, and Yu Rong. 2025. Can large language models be query optimizer for relational databases?Proceedings of the ACM on Management of Data3, 6 (2025), 1–28

  50. [52]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  51. [53]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  52. [54]

    Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache cache in the wild: characterizing and optimizing KVCache cache at a large cloud provider. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA)(USENIX ATC ’25). USENIX Associatio...

  53. [55]

    Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Un- structured Sparsity.Proceedings of the VLDB Endowment17, 2 (2023), 211–224

  54. [56]

    Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, and Kui Ren. 2025. ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models.Proceedings of the VLDB Endowment18, 12 (2025), 5391–5394

  55. [57]

    Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangk...

  56. [58]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). 521–538

  57. [59]

    Lingfan Yu, Jinkun Lin, and Jinyang Li. 2025. Stateful large language model serving with pensieve. InProceedings of the Twentieth European Conference on Computer Systems. 144–158

  58. [60]

    Hao Yuan, Xin Ai, Qiange Wang, Peizheng Li, Jiayang Yu, Chaoyi Chen, Xinbo Yang, Yanfeng Zhang, Zhenbo Fu, Yingyou Wen, et al. 2025. DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention. Proceedings of the ACM on Management of Data3, 6 (2025), 1–29

  59. [61]

    Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, and Magdalena Balazinska. 2025. Self-Enhancing Video Data Management System for Com- positional Events with Large Language Models.Proceedings of the ACM on Management of Data3, 3 (2025), 1–29

  60. [62]

    Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. 2025. Pqcache: Product quantization-based kvcache for long context llm inference.Proceedings of the ACM on Management of Data3, 3 (2025), 1–30

  61. [63]

    Qizheng Zhang, Michael Wornow, and Kunle Olukotun. 2025. Cost-efficient serving of llm agents via test-time plan caching. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

  62. [64]

    Xinyang Zhao, Xuanhe Zhou, and Guoliang Li. 2024. Chat2data: An interactive data analysis system with rag, vector databases and llms.Proceedings of the VLDB Endowment17, 12 (2024), 4481–4484

  63. [65]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583

  64. [66]

    Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. 2024. D-Bot: Database Diagnosis System using Large Language Models.Proc. VLDB Endow.17, 10 (2024), 2514–

  65. [67]

    Lemma 1.Let 𝑓 : R→R be a continuous, non-negative, and non-constant function

    https://doi.org/10.14778/3675034.3675043 Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving A APPENDIX A.1 Properties of Order-Preserving Rule We now proof that only exponential function can satisfy the order- preserving rule proposed in Section 4.4. Lemma 1.Let 𝑓 : R→R be a continuous, non-negative, an...