Adaptive KV Cache Reuse for Fast Long-Context LLM Serving

Fei Li; Jinhua Cui; Jinyu Wang; Shiqiang Nie; Song Liu; Weiguo Wu; Yan Liu

arxiv: 2605.24022 · v1 · pith:KFKCSGA4new · submitted 2026-05-20 · 💻 cs.AR · cs.DC

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving

Fei li , Song Liu , Yan Liu , Jinhua Cui , Shiqiang Nie , Jinyu Wang , Weiguo Wu This is my paper

Pith reviewed 2026-06-30 17:21 UTC · model grok-4.3

classification 💻 cs.AR cs.DC

keywords KV cache reuselong-context LLM inferenceprefill optimizationnon-prefix cachingfrequency-domain analysishardware-aware servingTTFT reductioncompute-I/O overlap

0 comments

The pith

CacheTune recovers cross-attention in non-prefix KV reuse by recomputing only frequency-critical tokens and overlapping I/O with computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to reuse KV caches in long-context LLM inference even when new inputs do not form exact prefixes of prior ones. Standard prefix caching breaks global attention relationships across chunks and harms output quality, while full recomputation wastes time. CacheTune runs an offline frequency-domain analysis to rank KV pairs by importance for attention recovery, then recomputes only the top-ranked ones online. The remaining pairs are reused directly, with additional techniques that hide data movement behind computation across GPU, SSD, and HDD tiers. If the method holds, prefill latency drops sharply without quality loss and the system stays effective even when caches sit in slow external storage.

Core claim

CacheTune identifies the KV pairs most critical to cross-attention recovery through offline frequency-domain analysis, selectively recomputes only those semantic-critical tokens online, and reuses the rest while applying sparse KV transfer, multi-stream asynchronous overlap, deferred positional-encoding recovery, and hardware-aware adaptive recomputation-ratio tuning to balance compute and data movement across heterogeneous cache pools.

What carries the argument

frequency-guided selective recomputation of semantic-critical tokens together with sparse transfer and multi-stream asynchronous overlap

If this is right

TTFT drops by 3.72x-4.86x and throughput rises by 3.93x-6.21x on mainstream LLMs with quality close to full recompute.
The same speedups hold when reusable KV data resides on SSD or HDD instead of GPU memory.
Non-prefix KV reuse becomes practical without forcing strict prefix alignment between requests.
Hardware-specific tuning automatically balances recomputation ratio against I/O cost for different storage tiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The offline frequency ranking could be refreshed periodically on representative workloads rather than once per model.
The same selective-recompute logic might apply to other attention patterns such as grouped-query or multi-query attention.
If frequency signatures prove stable across similar tasks, the analysis step could move online with modest overhead.
Extending the method to dynamic context lengths would require testing whether the critical-token set changes smoothly with input size.

Load-bearing premise

Frequency-domain analysis performed offline can reliably identify the KV pairs whose selective recomputation restores cross-chunk global attention relationships sufficiently to avoid quality degradation in non-prefix reuse scenarios.

What would settle it

Run CacheTune on a non-prefix long-context task and observe whether its generation quality metrics fall measurably below those of full recompute while TTFT remains lower.

Figures

Figures reproduced from arXiv: 2605.24022 by Fei Li, Jinhua Cui, Jinyu Wang, Shiqiang Nie, Song Liu, Weiguo Wu, Yan Liu.

**Figure 1.** Figure 1: Comparison of cross-attention restoration strategies in non-prefix KV Cache reuse. selectively recomputes a fixed fraction; however, the firstlayer recomputation overhead is irreducible, and its selection is based solely on the first-layer attention deviation. Subsequently, EPIC [10] recomputes only the first 𝑘 attentionsink [31] positions of each chunk to recover cross-attention at extremely low cost, … view at source ↗

**Figure 2.** Figure 2: Energy distribution of the KV Cache along the sequence dimension in the frequency domain. (a) (b) (c) (d) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-attention weight heatmaps from the suffix query to historical KV Chunks under different recomputation strategies: (a) full recompute; (b) direct KV Cache reuse without recomputation; (c) recomputing only the tokens corresponding to the top 15% low-frequency components; (d) recomputing only the top 15% high-frequency components. Beyond the semantic-level cross-attention loss, the practical gain of … view at source ↗

**Figure 4.** Figure 4: Impact of recomputation ratio on KV Cache reuse latency. The 0% and 100% curves correspond to full KV reuse and full GPU recomputation. CPU-memory cache favors low recomputation ratios, while HDD cache requires more recomputation to reduce I/O overhead. not determined simply by whether reuse is performed, but by the interaction between the cache medium and recomputation cost. Without unified modeling and … view at source ↗

**Figure 5.** Figure 5: Overview of CacheTune. tokens during offline KV Cache generation and, at online inference time, fuses their selectively recomputed KVs with the directly reused KVs of the remaining tokens. 4.2 KV Cache Offloading and Sparse Reuse To reduce GPU memory pressure, CacheTune places precomputed KV Chunks in a GPU-external cache pool, which may reside in CPU memory, SSDs, or lower-tier storage. A key design of C… view at source ↗

**Figure 6.** Figure 6: KV Cache offloading and sparse reuse pipeline. which gathers the offline KV Chunks intended for reuse; (2) the check layer, where the system applies the precomputed index set to sparsely filter the query and computes attention only for the corresponding tokens; and (3) the fusion layers, comprising every layer subsequent to the check layer. For each Fusion layer, the system waits for the sparse transfer to… view at source ↗

**Figure 7.** Figure 7: Accuracy–TTFT trade-off of CacheTune compared with baseline methods across different models and dataset tasks [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 9.** Figure 9: Effect of the recomputation ratio 𝑟 on model accuracy and TTFT speedup. GPU memory versus CPU memory. For comparison, we further include CacheBlend, CacheSlide, and vLLM’s native Prefix Caching, all configured with the same cache medium as CacheTune. The results, using Mistral, are reported in Table 2. For GPU-resident caches, CacheTune follows the native GPU cache-reuse path; for CPU-resident caches, i… view at source ↗

**Figure 8.** Figure 8: TTFT trends under increasing request rates. Curves extending further to the right with lower TTFT indicate higher effective throughput under low-latency serving. TTFT bottleneck. We compare CacheTune’s TTFT under a 15% recomputation ratio when the KV Cache is stored in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 10.** Figure 10: Effectiveness comparison of different recomputation-token selection strategies [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

In long-context Large Language Model (LLM) inference, the Time-To-First-Token (TTFT) latency incurred by the prefill stage has become the foremost bottleneck limiting interactive performance and deployment cost. KV Cache reuse offers a direct path to reduce redundant prefill, yet traditional prefix caching applies only to strict-prefix scenarios; directly reusing KV Cache in non-prefix settings breaks the cross-chunk global attention relationships and causes significant degradation in generation quality. When reusable KV Cache is offloaded to GPU-external cache pools, I/O overheads across heterogeneous hardware tiers further emerge as a new TTFT bottleneck. Efficient non-prefix KV Cache reuse therefore requires both semantic-consistency recovery and compute-I/O co-optimization. This paper presents CacheTune, a frequency-guided and hardware-aware KV Cache reuse system for long-context LLM serving. CacheTune first identifies, offline, the KV pairs most critical to cross-attention recovery through frequency-domain analysis, and then selectively recomputes only these semantic-critical tokens online while reusing the remaining KVs. To turn this semantic selection into end-to-end latency reduction, CacheTune further combines sparse KV transfer, multi-stream asynchronous overlap, deferred positional-encoding recovery, and hardware-aware adaptive recomputation-ratio tuning to balance computation and data movement across heterogeneous cache pools. Evaluations on mainstream LLMs and long-context tasks show that CacheTune achieves 3.72x-4.86x TTFT speedup and 3.93x-6.21x higher throughput while maintaining generation quality close to full recompute. Even when caches are offloaded to I/O-bound SSD/HDD storage, CacheTune sustains 2.34x-2.36x TTFT speedup through adaptive recomputation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CacheTune's frequency-guided selective recompute for non-prefix KV reuse targets a real serving bottleneck but the abstract leaves the speedups and quality claims hard to assess without more experimental detail.

read the letter

The core idea is using offline frequency analysis on KV tensors to pick a sparse set of tokens for recomputation when reusing cache across non-prefix chunks, then layering on sparse transfers, multi-stream overlap, deferred positional encoding, and adaptive recompute ratios tuned to the hardware tier. This is the part that goes beyond standard prefix caching.

The work does address a practical pain point: TTFT from prefill dominates interactive long-context serving, and offloaded caches add I/O costs. The combination of semantic selection with hardware co-optimization is a reasonable systems move, and the reported numbers (roughly 4x TTFT, 5x throughput, still holding with SSD offload) would matter if they hold up.

The soft spot is the missing experimental grounding. The abstract states quality stays close to full recompute but gives no datasets, models, baselines, or measurement method for generation quality. Without that, it's difficult to know whether the frequency selection actually recovers the necessary cross-chunk attention relationships or just works on the tested cases. The assumption that dominant frequency components reliably flag the semantically critical KV pairs is plausible but not obviously guaranteed; attention depends on token-specific dependencies that may not align with signal frequency. If the full paper has ablations across tasks and scales showing the selected pairs are both necessary and sufficient, that would strengthen it.

This is for systems people working on LLM inference stacks rather than core model research. The problem is real and the approach is concrete enough that a serious referee could check the claims and suggest targeted experiments. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript presents CacheTune, a frequency-guided and hardware-aware KV cache reuse system for long-context LLM serving. It performs offline frequency-domain analysis to identify critical KV pairs for selective recomputation in non-prefix reuse scenarios (to recover cross-chunk attention), then applies sparse KV transfer, multi-stream asynchronous overlap, deferred positional-encoding recovery, and hardware-aware adaptive recomputation-ratio tuning to co-optimize compute and I/O across GPU and external storage tiers. The abstract reports 3.72x-4.86x TTFT speedup, 3.93x-6.21x higher throughput, and 2.34x-2.36x TTFT speedup under SSD/HDD offload, while claiming generation quality close to full recompute.

Significance. If the frequency-domain selection reliably recovers necessary cross-chunk attention relationships without quality loss and the I/O-compute optimizations deliver the reported gains across models and tasks, the work would be a practical contribution to efficient long-context serving, particularly for heterogeneous memory hierarchies. The empirical speedups and the explicit handling of non-prefix reuse plus offloading are notable strengths if substantiated by rigorous experiments.

major comments (2)

[Abstract] Abstract: the central claim that offline frequency-domain analysis identifies a sparse subset of KV pairs whose selective recomputation suffices to restore cross-chunk global attention relationships (and thereby preserve quality) is load-bearing for all reported speedups and quality results, yet the abstract supplies neither derivation, necessity/sufficiency argument, nor ablation evidence that the selected pairs align with semantic dependencies rather than signal-like properties; this directly engages the stress-test concern and prevents evaluation of soundness.
[Abstract] Abstract: concrete speedups (3.72x-4.86x TTFT, 3.93x-6.21x throughput) and quality preservation are stated without any reference to experimental setup, models, datasets, baselines, statistical significance, or quality metrics (e.g., perplexity vs. task accuracy), rendering the claims impossible to assess; this is a load-bearing omission for an empirical systems paper.

minor comments (1)

[Abstract] Abstract: the description of 'adaptive recomputation-ratio tuning' is introduced without indicating whether the ratio is a free parameter or derived parameter-free from hardware characteristics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the abstract. We address each major comment below and indicate where revisions to the manuscript will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that offline frequency-domain analysis identifies a sparse subset of KV pairs whose selective recomputation suffices to restore cross-chunk global attention relationships (and thereby preserve quality) is load-bearing for all reported speedups and quality results, yet the abstract supplies neither derivation, necessity/sufficiency argument, nor ablation evidence that the selected pairs align with semantic dependencies rather than signal-like properties; this directly engages the stress-test concern and prevents evaluation of soundness.

Authors: The abstract is intentionally concise. The frequency-domain analysis, including the mathematical derivation showing how selected frequency components recover cross-chunk attention relationships, the necessity/sufficiency argument, and evidence that the pairs align with semantic dependencies (via attention correlation and ablation), appears in Section 3.2. Section 5.3 presents ablations confirming quality preservation relative to full recompute. We will revise the abstract to add a one-sentence reference to the offline frequency-guided selection and its grounding in attention recovery. revision: yes
Referee: [Abstract] Abstract: concrete speedups (3.72x-4.86x TTFT, 3.93x-6.21x throughput) and quality preservation are stated without any reference to experimental setup, models, datasets, baselines, statistical significance, or quality metrics (e.g., perplexity vs. task accuracy), rendering the claims impossible to assess; this is a load-bearing omission for an empirical systems paper.

Authors: We agree that the abstract would benefit from additional context for immediate assessment. We will revise it to name the evaluated models, long-context benchmarks, quality metrics (perplexity and task accuracy), and note that speedups are relative to the non-reuse baseline with results averaged across runs. Full experimental details, including statistical reporting, remain in Sections 4 and 5. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with measured speedups

full rationale

The paper describes CacheTune as an engineering system that performs offline frequency-domain analysis to select KV pairs for selective recomputation, then applies hardware-aware optimizations (sparse transfer, async overlap, etc.) and reports measured TTFT/throughput gains on benchmarks. No equations, fitted parameters, or self-citations are presented that would make the reported speedups reduce by construction to quantities defined inside the method; the performance numbers are external empirical outcomes. The frequency-selection step is an algorithmic heuristic whose correctness is evaluated by quality metrics, not derived tautologically from the reuse policy itself. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The paper is an engineering systems contribution whose central claim rests on empirical measurements of the proposed optimizations rather than on new mathematical axioms or invented physical entities.

free parameters (1)

adaptive recomputation ratio
The system tunes the fraction of tokens to recompute based on hardware; the abstract does not specify how this ratio is chosen or fitted.

pith-pipeline@v0.9.1-grok · 5854 in / 1191 out tokens · 36677 ms · 2026-06-30T17:21:03.388504+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
cs.AI 2026-06 unverdicted novelty 4.0

E³RL uses dynamic thresholds on epistemic entropy from autoregressive cross-entropy to enable erasable RL in LLM reasoning, reporting 5.349% and 6.514% gains on AIME for 4B and 8B models over prior SOTA.

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Ma- hapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025
[2]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context under- standing. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137

2024
[3]

Debendra Das Sharma, Robert Blankenship, and Daniel Berger. 2024. An Introduction to the Compute Express Link (CXL) Interconnect. ACM Comput. Surv.56, 11, Article 290 (July 2024), 37 pages. doi:10. 1145/3669900

2024
[4]

Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale multi-document summariza- tion dataset and abstractive hierarchical model. InProceedings of the 57th annual meeting of the association for computational linguistics. 1074–1084

2019
[5]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost- Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24). 111–126

2024
[6]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems6 (2024), 325–338

2024
[7]

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer
[8]

InProceedings of the 2nd Workshop on New Frontiers in Summarization

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization. 70–79
[9]

Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, and Zhouhan Lin. 2023. Fourier transformer: Fast long range modeling by removing sequence redundancy with fft operator. InFindings of the Association for Computational Linguistics: ACL 2023. 8954–8966

2023
[10]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics. 6609–6625

2020
[11]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2025. EPIC: Efficient Position-Independent Caching for Serving Large Language Models. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceedings of Machine Learning ...

2025
[12]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.CoRRabs/2409.12186 (2024). arXiv:2409.12186 doi:10.48550/ARXIV.2409.12186

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024
[13]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bres- sand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/231...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[14]

Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. 2025. Freqkv: Frequency domain key-value compression for efficient context window extension.arXiv preprint arXiv:2505.00570 (2025)

work page arXiv 2025
[15]

2019.Algorithms for optimization

Mykel J Kochenderfer and Tim A Wheeler. 2019.Algorithms for optimization. Mit Press

2019
[16]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
[17]

InProceedings of the 29th symposium on operating systems principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626
[18]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024
[19]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al . 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020
[20]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

2004
[21]

Song Liu, Fei Li, Chenyu Zhao, Qin Xia, Shiqiang Nie, Jinyu Wang, and Weiguo Wu. 2026. FDSR: Efficient Model Training via Adaptive Tensor Quantization Based on Frequency Domain Division and Similarity Data Reuse.ACM Trans. Archit. Code Optim.(April 2026). doi:10.1145/ 3802593Just Accepted

2026
[22]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al
[23]

Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025
[24]

Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. 2026. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving. In24th USENIX Conference on File and Storage Technologies (FAST 26). 83–99

2026
[25]

Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, et al
[26]

CacheGen: KV cache compression and streaming for fast large language model serving, 2023

Cachegen: Fast context loading for language model applications. 13 Fei Li, Song Liu, Yan Liu, Jinhua Cui, Shiqiang Nie, Jinyu Wang, and Weiguo Wu arXiv preprint arXiv:2310.07240(2023)

work page arXiv 2023
[27]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

2024
[28]

Zebin Ren, Krijn Doekemeijer, Tiziano De Matteis, Christian Pinto, Radu Stoica, and Animesh Trivedi. 2025. An i/o characterizing study of offloading llm models and kv caches to nvme ssd. InProceedings of the 5th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. 23–33

2025
[29]

Minseok Seo, Jungi Hyun, Seongho Jeong, Xuan Truong Nguyen, Hyuk-Jae Lee, and Hyokeun Lee. 2025. OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems. IEEE Computer Architecture Letters(2025)

2025
[30]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116

2023
[31]

Llama Team. 2024. The Llama 3 Herd of Models.CoRRabs/2407.21783 (2024). arXiv:2407.21783 doi:10.48550/ARXIV.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[32]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop Questions via Single-hop Ques- tion Composition.Transactions of the Association for Computational Linguistics10 (2022), 539–554

2022
[33]

Guanchu Wang, Zirui Liu, Zhimeng Jiang, Ninghao Liu, Na Zou, and Xia Hu. 2023. Division: memory efficient training via dual activation precision. InInternational Conference on Machine Learning. PMLR, 36036–36057

2023
[34]

2012.Discrete-time signal processing: an algebraic approach

Darrell Williamson. 2012.Discrete-time signal processing: an algebraic approach. Springer Science & Business Media

2012
[35]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=NG7sS51zVF

2024
[36]

Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, and Zhenxuan Pan. 2024. Layerkv: Optimizing large language model serving with layer-wise kv cache management.arXiv preprint arXiv:2410.00428(2024)

work page arXiv 2024
[37]

Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. 2020. Learning in the frequency domain. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1740–1749

2020
[38]

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. 2025. CacheClip: Accelerating RAG with Effective KV Cache Reuse.arXiv preprint arXiv:2510.10129(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. 2025. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.CoRRabs/2502.16002 (2025). arXiv:2502.16002 doi:10.48550/ ARXIV.2502.16002

work page arXiv 2025
[40]

In: Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - Novem- ber 4, 2018, Ellen Riloff,...

work page doi:10.18653/v1/d18-1259 2018
[41]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

2025
[42]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.https://openreview.net/forum?id=WE_vluYUL-X

2023
[43]

Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 11608–11620

2024
[44]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured lan- guage model programs.Advances in neural information processing systems37 (2024), 62557–62583

2024
[45]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210. 14

2024

[1] [1]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Ma- hapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025

[2] [2]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context under- standing. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137

2024

[3] [3]

Debendra Das Sharma, Robert Blankenship, and Daniel Berger. 2024. An Introduction to the Compute Express Link (CXL) Interconnect. ACM Comput. Surv.56, 11, Article 290 (July 2024), 37 pages. doi:10. 1145/3669900

2024

[4] [4]

Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale multi-document summariza- tion dataset and abstractive hierarchical model. InProceedings of the 57th annual meeting of the association for computational linguistics. 1074–1084

2019

[5] [5]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost- Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24). 111–126

2024

[6] [6]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems6 (2024), 325–338

2024

[7] [7]

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer

[8] [8]

InProceedings of the 2nd Workshop on New Frontiers in Summarization

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization. 70–79

[9] [9]

Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, and Zhouhan Lin. 2023. Fourier transformer: Fast long range modeling by removing sequence redundancy with fft operator. InFindings of the Association for Computational Linguistics: ACL 2023. 8954–8966

2023

[10] [10]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics. 6609–6625

2020

[11] [11]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2025. EPIC: Efficient Position-Independent Caching for Serving Large Language Models. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceedings of Machine Learning ...

2025

[12] [12]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.CoRRabs/2409.12186 (2024). arXiv:2409.12186 doi:10.48550/ARXIV.2409.12186

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024

[13] [13]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- ford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bres- sand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B.CoRRabs/231...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023

[14] [14]

Jushi Kai, Boyi Zeng, Yixuan Wang, Haoli Bai, Ziwei He, Bo Jiang, and Zhouhan Lin. 2025. Freqkv: Frequency domain key-value compression for efficient context window extension.arXiv preprint arXiv:2505.00570 (2025)

work page arXiv 2025

[15] [15]

2019.Algorithms for optimization

Mykel J Kochenderfer and Tim A Wheeler. 2019.Algorithms for optimization. Mit Press

2019

[16] [16]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

[17] [17]

InProceedings of the 29th symposium on operating systems principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

[18] [18]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024

[19] [19]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al . 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020

[20] [20]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

2004

[21] [21]

Song Liu, Fei Li, Chenyu Zhao, Qin Xia, Shiqiang Nie, Jinyu Wang, and Weiguo Wu. 2026. FDSR: Efficient Model Training via Adaptive Tensor Quantization Based on Frequency Domain Division and Similarity Data Reuse.ACM Trans. Archit. Code Optim.(April 2026). doi:10.1145/ 3802593Just Accepted

2026

[22] [22]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al

[23] [23]

Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025

[24] [24]

Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. 2026. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving. In24th USENIX Conference on File and Storage Technologies (FAST 26). 83–99

2026

[25] [25]

Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, et al

[26] [26]

CacheGen: KV cache compression and streaming for fast large language model serving, 2023

Cachegen: Fast context loading for language model applications. 13 Fei Li, Song Liu, Yan Liu, Jinhua Cui, Shiqiang Nie, Jinyu Wang, and Weiguo Wu arXiv preprint arXiv:2310.07240(2023)

work page arXiv 2023

[27] [27]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

2024

[28] [28]

Zebin Ren, Krijn Doekemeijer, Tiziano De Matteis, Christian Pinto, Radu Stoica, and Animesh Trivedi. 2025. An i/o characterizing study of offloading llm models and kv caches to nvme ssd. InProceedings of the 5th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. 23–33

2025

[29] [29]

Minseok Seo, Jungi Hyun, Seongho Jeong, Xuan Truong Nguyen, Hyuk-Jae Lee, and Hyokeun Lee. 2025. OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems. IEEE Computer Architecture Letters(2025)

2025

[30] [30]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116

2023

[31] [31]

Llama Team. 2024. The Llama 3 Herd of Models.CoRRabs/2407.21783 (2024). arXiv:2407.21783 doi:10.48550/ARXIV.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[32] [32]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop Questions via Single-hop Ques- tion Composition.Transactions of the Association for Computational Linguistics10 (2022), 539–554

2022

[33] [33]

Guanchu Wang, Zirui Liu, Zhimeng Jiang, Ninghao Liu, Na Zou, and Xia Hu. 2023. Division: memory efficient training via dual activation precision. InInternational Conference on Machine Learning. PMLR, 36036–36057

2023

[34] [34]

2012.Discrete-time signal processing: an algebraic approach

Darrell Williamson. 2012.Discrete-time signal processing: an algebraic approach. Springer Science & Business Media

2012

[35] [35]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=NG7sS51zVF

2024

[36] [36]

Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, and Zhenxuan Pan. 2024. Layerkv: Optimizing large language model serving with layer-wise kv cache management.arXiv preprint arXiv:2410.00428(2024)

work page arXiv 2024

[37] [37]

Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. 2020. Learning in the frequency domain. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1740–1749

2020

[38] [38]

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. 2025. CacheClip: Accelerating RAG with Effective KV Cache Reuse.arXiv preprint arXiv:2510.10129(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. 2025. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.CoRRabs/2502.16002 (2025). arXiv:2502.16002 doi:10.48550/ ARXIV.2502.16002

work page arXiv 2025

[40] [40]

In: Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - Novem- ber 4, 2018, Ellen Riloff,...

work page doi:10.18653/v1/d18-1259 2018

[41] [41]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

2025

[42] [42]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.https://openreview.net/forum?id=WE_vluYUL-X

2023

[43] [43]

Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 11608–11620

2024

[44] [44]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured lan- guage model programs.Advances in neural information processing systems37 (2024), 62557–62583

2024

[45] [45]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210. 14

2024