pith. sign in

arxiv: 2605.17613 · v1 · pith:G7JVWCF7new · submitted 2026-05-17 · 💻 cs.AR · cs.LG

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Pith reviewed 2026-05-19 22:05 UTC · model grok-4.3

classification 💻 cs.AR cs.LG
keywords KV cache compressionLLM inferencespeculative decodinglossless verificationthroughput optimizationlong context servingcache swapping
0
0 comments X

The pith

VeriCache achieves identical outputs to full-KV-cache decoding at up to 4 times higher throughput by drafting with compressed caches and verifying in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to resolve the tension between fast but lossy KV cache compression and the need for exact outputs in LLM inference. Lossy methods like token dropping or quantization work for short generations but cause outputs to diverge over longer sequences, breaking tasks such as code generation. VeriCache keeps the full KV cache out of GPU memory and uses the compressed version only to draft candidate tokens, then swaps in the full cache for verification. The approach succeeds by running the drafting step in parallel with the swap, since drafting is limited by HBM bandwidth while swapping is limited by PCIe or network speed, and by using the similarity between compressed and full outputs to draft many tokens per swap. A reader would care because this removes the accuracy risk of compression without sacrificing the throughput gains needed for long-context serving.

Core claim

VeriCache uses the compressed KV cache to draft tokens then verifies those drafts against the full KV cache. It solves the resulting system challenge by parallelizing compressed-KV decoding with full-KV swapping, because the former is HBM-bandwidth-bound and the latter is PCIe- or network-bound, while the frequent similarity of compressed outputs to full outputs permits long drafting horizons that amortize each swap cost. The method applies uniformly to token-dropping and quantization compressors and composes with standard speculative decoding.

What carries the argument

Parallel drafting on compressed KV cache overlapped with full-KV swap-in, enabled by differing bandwidth bottlenecks and output similarity for long draft sequences.

If this is right

  • The same framework works for both long-context decoding and remote prefix caching.
  • Any token-dropping or quantization compressor can be plugged in through the uniform interface.
  • Traditional speculative decoding can be layered on top for additional speedups.
  • Identical outputs are guaranteed regardless of how far the generation proceeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Serving systems could reduce reliance on large amounts of high-bandwidth memory by keeping only compressed caches resident.
  • Similar draft-and-verify patterns might help other lossy approximations inside model inference pipelines.
  • Measuring divergence rates across different model families and tasks would show how often the long-horizon assumption holds in practice.

Load-bearing premise

Compressed and full KV outputs stay similar enough over many tokens to let each full-cache swap be amortized by a long drafting horizon.

What would settle it

A long output sequence where the compressed cache produces tokens that diverge from the full cache within a few steps, forcing short drafts and eliminating the throughput advantage.

Figures

Figures reproduced from arXiv: 2605.17613 by Dongjoo Seo, Jiayi Yao, Junchen Jiang, Kuntai Du, Rui Zhang, Samuel Shen, Shan Lu, Shaoting Feng, Yuhan Liu, Yuyang Huang.

Figure 1
Figure 1. Figure 1: The accuracy–throughput dichotomy. Veri [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Code-generation failure from compressed KV. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sequence-level KL KL(𝑝full(𝑥1:𝑡) ∥ 𝑝lossy(𝑥1:𝑡)) grows roughly linearly in 𝑡 under KVzip 4× (left) and TurboQuant k4v3 (right); at temperature 0.5. the chain rule of KL divergence: KL1:𝑇 ≜ KL 𝑝full(𝑥1:𝑇 ) ∥ 𝑝lossy (𝑥1:𝑇 )  = ∑︁ 𝑇 𝑡=1 E𝑥<𝑡∼𝑝full[KL𝑡] . (2) if per-step KL exceeds 𝜀 > 0, sequence-level KL grows lin￾early: KL1:𝑇 ≥ 𝜀𝑇 . Since KL1:𝑇 equals E𝑥∼𝑝full log(𝑝full(𝑥1:𝑇 )/𝑝lossy (𝑥1:𝑇 )) , this means … view at source ↗
Figure 5
Figure 5. Figure 5: Overview of VeriCache. Tokens drafted with [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: VeriCache’s two settings: long-context decod [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Acceptance rate (left) and ideal speedup (right) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Acceptance length vs. draft length, comparing [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Composing VeriCache with Eagle. (Left) Ac [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sustained decoding throughput on long [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: End-to-end request latency vs. request rate on Pipeline 1 (top) and Pipeline 2 (bottom). [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: VeriCache’s speedup over Full KV: Pipeline 1, varying KV-cache budget and HBM/interconnect ratio [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Quality (negative KL-divergence) vs. throughput on Pipeline 1 (top) and Pipeline 2 (bottom). [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional token-dropping and quantization [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Quality vs. throughput: function-call accuracy (top), defense success rate (bottom). [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Quality vs. throughput on Pipeline 1 long [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
read the original abstract

The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap. VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents VeriCache, an inference framework that converts lossy KV cache compression methods into lossless LLM decoding. It uses the compressed KV cache to draft tokens via speculative-style decoding and verifies the drafts against the full KV cache to guarantee identical outputs. The key system insight is that compressed-KV decoding (HBM-bandwidth bound) can be overlapped with full-KV cache swapping (PCIe/network bound), and that output similarity often permits sufficiently long drafting horizons to amortize swap costs. The approach supports long-context decoding, remote prefix caching, a uniform interface for token-dropping and quantization compressors, and composition with conventional speculative decoding. Experiments are reported to deliver up to 4X throughput versus full-KV inference while producing identical outputs.

Significance. If the performance and correctness claims are substantiated, VeriCache would offer a practical way to retain the memory and bandwidth benefits of aggressive KV compression without sacrificing output fidelity, which is especially relevant for long-context serving and distributed prefix caching. The uniform compressor interface and explicit composition with existing speculative decoding are concrete engineering contributions that could be adopted broadly. The bandwidth-overlap insight is a systems-level strength that may generalize beyond the specific setting.

major comments (2)
  1. [Abstract] Abstract: the central performance claim ('up to 4X higher throughput ... while producing identical outputs') rests on the assumption that compressed-KV drafts remain sufficiently similar to full-KV outputs for long horizons, yet the abstract itself notes that lossy methods cause outputs to 'increasingly diverge' and produce 'catastrophic failures' in code generation and tool calling. No quantitative acceptance-rate or horizon-length data are supplied for these divergence-prone regimes, which directly determines whether swap amortization can occur and whether the 4X figure is achievable.
  2. [Experimental results] Experimental results (as summarized in the abstract): the reported throughput gains lack any description of the models, workloads, hardware configuration, measurement methodology, or drafting-horizon statistics. Without these details it is impossible to evaluate whether the parallelization insight actually hides the full-KV swap latency under realistic conditions.
minor comments (1)
  1. [Design] The description of the uniform compressor interface could be expanded with a short pseudocode or API sketch to clarify how new compression methods are integrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of VeriCache's potential impact. We address each major comment point by point below and have revised the manuscript to provide the requested details and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claim ('up to 4X higher throughput ... while producing identical outputs') rests on the assumption that compressed-KV drafts remain sufficiently similar to full-KV outputs for long horizons, yet the abstract itself notes that lossy methods cause outputs to 'increasingly diverge' and produce 'catastrophic failures' in code generation and tool calling. No quantitative acceptance-rate or horizon-length data are supplied for these divergence-prone regimes, which directly determines whether swap amortization can occur and whether the 4X figure is achievable.

    Authors: We agree that the abstract correctly identifies the divergence problem as motivation for the work. VeriCache guarantees identical outputs via verification regardless of similarity; however, throughput gains depend on sufficiently long drafting horizons to amortize swaps. The full manuscript reports acceptance rates and horizon statistics across workloads, including code generation and tool calling. In the revision we will add a dedicated table and accompanying text in Section 5 that explicitly quantifies average acceptance rates and drafting horizons for these divergence-prone tasks under the evaluated compressors, allowing readers to assess amortization directly. revision: yes

  2. Referee: [Experimental results] Experimental results (as summarized in the abstract): the reported throughput gains lack any description of the models, workloads, hardware configuration, measurement methodology, or drafting-horizon statistics. Without these details it is impossible to evaluate whether the parallelization insight actually hides the full-KV swap latency under realistic conditions.

    Authors: We acknowledge that the abstract's condensed summary omits these details. The full manuscript contains Section 5 with descriptions of models (Llama-2-7B/13B, Mistral-7B), workloads (long-context QA, code generation, tool calling), hardware (A100/H100 GPUs with PCIe/NVLink), methodology (end-to-end tokens/s, per-phase latency breakdowns), and drafting-horizon/acceptance statistics. In the revision we will expand Section 5 with additional tables and a new subsection on bandwidth-overlap measurements to make all parameters and statistics explicit and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on independent systems observations and experiments

full rationale

The paper describes a speculative-decoding-style framework that uses compressed KV for drafting and full KV for verification to guarantee identical outputs. Throughput gains are attributed to measured bandwidth differences (HBM-bound drafting overlapping PCIe-bound swaps) and empirical output similarity allowing amortization; these are external observations about hardware constraints and workload behavior, not self-definitions, fitted parameters presented as predictions, or results that reduce to the paper's own inputs by construction. No equations, uniqueness theorems, or self-citation chains appear in the provided text that would force the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about hardware bandwidth asymmetry and similarity of compressed versus full KV outputs; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Compressed KV cache produces output similar enough to full KV cache to support a long drafting horizon that amortizes full-KV swaps.
    This similarity assumption is required for the amortization argument in the abstract.
  • domain assumption Compressed-KV decoding is HBM-bandwidth-bound while full-KV swap is PCIe/network-bound, enabling effective overlap.
    This hardware assumption underpins the parallelization insight.

pith-pipeline@v0.9.0 · 5851 in / 1361 out tokens · 54999 ms · 2026-05-19T22:05:21.069105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 13 internal anchors

  1. [1]

    Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems6 (2024), 114– 127

  2. [2]

    Sudhanshu Agrawal, Wonseok Jeon, and Mingu Lee. 2024. Adaedl: Early draft stopping for speculative decoding of large language models via an entropy-based lower bound on token acceptance probability. arXiv preprint arXiv:2410.18351(2024)

  3. [3]

    Amazon Web Services. 2025. Performance specifications for Amazon S3. https://docs.aws.amazon.com/AmazonS3/latest/ userguide/s3-files-performance.html. Accessed: 2026-04-16

  4. [4]

    Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, and Xi Chen

  5. [5]

    Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

    FastMTP: Accelerating LLM Inference with Enhanced Multi- Token Prediction. arXiv:2509.18362 [cs.LG] https://arxiv.org/ abs/2509.18362

  6. [6]

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal information fun- neling.arXiv preprint arXiv:2406.02069(2024)

  7. [7]

    Jinglin Chen, Qiwei Li, Zuchao Li, Baoyuan Qi, Liu Guoming, Haojun Ai, Hai Zhao, and Ping Wang. 2025. Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 18051–18062

  8. [8]

    2025.{IMPRESS}: An {Importance-Informed} {Multi-Tier} prefix {KV} storage system for large language model inference

    Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025.{IMPRESS}: An {Importance-Informed} {Multi-Tier} prefix {KV} storage system for large language model inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25)

  9. [9]

    Alessio Devoto, Maximilian Jeblick, and Simon Jégou. 2025. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636(2025)

  10. [10]

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2024. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550(2024)

  11. [11]

    Runpeng Geng, Yanting Wang, Chenlong Yin, Minhao Cheng, Ying Chen, and Jinyuan Jia. 2025. PISanitizer: Preventing Prompt Injec- tion to Long-Context LLMs via Prompt Sanitization.arXiv preprint arXiv:2511.10720(2025)

  12. [12]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  13. [13]

    Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond. arXiv:2503.20589 [cs.SE] https:// arxiv.org/abs/2503.20589

  14. [14]

    LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. 2025. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research(2025)

  15. [15]

    Horace He and Thinking Machines Lab. 2025. Defeating Nondeter- minism in LLM Inference. https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/

  16. [16]

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation.Advances in Neural Information Processing Systems37 (2024), 1270–1303

  17. [17]

    Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al

  18. [18]

    Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.arXiv preprint arXiv:2401.11181 (2024)

  19. [19]

    Simon Jegou and Maximilian Jeblick. 2026. KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.arXiv preprint arXiv:2601.07891(2026)

  20. [20]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

  21. [21]

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. GEAR: An efficient error reduction framework for KV cache compression in LLM inference. In Proc. NeurIPS, Vol. 262. 305–321

  22. [22]

    Jang-Hyun Kim, Dongyoon Han, and Sangdoo Yun. 2026. Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction.arXiv preprint arXiv:2601.17668(2026)

  23. [23]

    Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. 2025. Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416(2025)

  24. [24]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG] https: //arxiv.org/abs/2309.06180

  25. [25]

    Yuanyuan Lei and Ruihong Huang. 2025. Multi-document Sum- marization through Multi-document Event Relation Graph Reason- ing in LLMs: a case study in Framing Bias Mitigation. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Tahe...

  26. [26]

    Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2025. A Survey on Large Language Model Acceleration based on KV Cache Management. arXiv:2412.19442 [cs.AI] https://arxiv.org/abs/ 2412.19442

  27. [27]

    Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wu- long Liu, Yiwu Yao, Sinno Jialin Pan, and Mingxuan Yuan. 2025. Kv- tuner: Sensitivity-aware layer-wise mixed-precision kv cache quanti- zation for efficient and nearly lossless llm inference.arXiv preprint arXiv:2502.04420(2025)

  28. [28]

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

  29. [29]

    Advances in Neural Information Processing Systems37 (2024), 22947– 22970

    Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970

  30. [30]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. arXiv:2406.16858 [cs.CL] https://arxiv.org/abs/2406. 16858

  31. [31]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EA- GLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077 [cs.LG]https://arxiv.org/abs/2401.15077

  32. [32]

    Manlai Liang, JiaMing Zhang, Xiong Li, and Jinlong Li. 2025. LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important.arXiv preprint arXiv:2504.04704(2025)

  33. [33]

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2025. Qserve: W4a8kv4 quantization 14 and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems7 (2025)

  34. [34]

    Jingjing Liu, Silin Li, Zeming Liu, Zihao Cheng, Yuhang Guo, Yuan- fang Guo, Yunhong Wang, and Haifeng Wang. 2026. Towards multi- language repository-level code generation: From-scratch to guided tasks.Neurocomputing(2026), 133204

  35. [35]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang

  36. [36]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems(2023)

  37. [37]

    Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv:2306.03091 [cs.CL]https://arxiv.org/abs/2306.03091

  38. [38]

    Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. 2024. LongGen- Bench: Long-context Generation Benchmark. arXiv:2410.04199 [cs.CL] https://arxiv.org/abs/2410.04199

  39. [39]

    Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, and Alvin Che- ung. 2025. Speculative Decoding: Performance or Illusion?arXiv preprint arXiv:2601.11580(2025)

  40. [40]

    Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG] https: //arxiv.org/abs/2510.09665

  41. [41]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

  42. [42]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning- free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750(2024)

  43. [43]

    LMCache. 2025. LMCache Agentic Traces. https://huggingface. co/datasets/sammshen/lmcache-agentic-traces

  44. [44]

    Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. 2024. RepoAgent: An LLM-Powered Open- Source Framework for Repository-level Code Documentation Gen- eration. arXiv:2402.16667 [cs.CL] https://arxiv.org/abs/2402. 16667

  45. [45]

    Mistral AI. 2025. Mistral Small 24B Instruct 2501. https: //huggingface.co/mistralai/Mistral-Small-24B-Instruct- 2501

  46. [46]

    NVIDIA Corporation. 2026. NemoClaw: Secure AI Agent Stack for OpenClaw. https://github.com/NVIDIA/NemoClaw. Accessed: 2026-04-01

  47. [47]

    OpenAI. 2026. Agents Guide. https://developers.openai.com/ api/docs/guides/agents. Accessed: 2026-04-01

  48. [48]

    Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Dong- han Yu, Jiawei Han, and Yelong Shen. 2024. Temperature-centric investigation of speculative decoding with knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP 2024. 13125–13137

  49. [49]

    Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient prefix caching for accelerating LLM-based multi-agent work- flows.arXiv preprint arXiv:2507.07400(2025)

  50. [50]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative LLM inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)

  51. [51]

    Joseph J Peper, Wenzhao Qiu, Ali Payani, and Lu Wang. 2025. Mdbench: A synthetic multi-document reasoning benchmark generated with knowledge guidance. InFindings of the Association for Computational Linguistics: ACL 2025. 25592–25621

  52. [52]

    Kimonas Provatas, Aris Karatzikos, Charalampos Koilakos, Michail Patsakis, Alexandros Tzanakakis, Akshatha Nayak, Ioannis Mouratidis, Evangelos Ioannis Avgoulas, and Ilias Georgakopoulos-Soares. 2026. Accelerating inference in genomic and proteomic foundation models via speculative decoding.bioRxiv(2026), 2026–01

  53. [53]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

  54. [54]

    RedHat AI. 2025. Llama-3.3-70B-Instruct-speculator.eagle3. https://huggingface.co/RedHatAI/Llama-3.3-70B-Instruct- speculator.eagle3

  55. [55]

    RedHat AI. 2025. Qwen3-32B-speculator.eagle3. https:// huggingface.co/RedHatAI/Qwen3-32B-speculator.eagle3

  56. [56]

    1994.An introduction to information theory

    Fazlollah M Reza. 1994.An introduction to information theory. Courier Corporation

  57. [57]

    Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, et al. 2024. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Gen- eration with Speculative Decoding.arXiv preprint arXiv:2408.11049 (2024)

  58. [58]

    Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. 2026. Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning. arXiv:2504.17192 [cs.CL] https://arxiv.org/ abs/2504.17192

  59. [59]

    Konrad Staniszewski and Adrian Łańcucki. 2025. KV Cache Trans- form Coding for Compact Storage in LLM Inference.arXiv preprint arXiv:2511.01815(2025)

  60. [60]

    Peter Steinberger. 2025. OpenClaw: Open-source autonomous AI agent. https://github.com/openclaw/openclaw. GitHub reposi- tory

  61. [61]

    Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. 2025. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383(2025)

  62. [62]

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2025. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence. InProceedings of the 42nd International Conference on Machine Learning

  63. [63]

    Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, and Wenjie Zhang. 2025. HydraRAG: Structured Cross-Source Enhanced Large Language Model Reasoning. arXiv:2505.17464 [cs.CL] https://arxiv.org/abs/2505.17464

  64. [64]

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-aware sparsity for efficient long- context llm inference.arXiv preprint arXiv:2406.10774(2024)

  65. [65]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  66. [66]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Se- hoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache. InProceedings of the 42nd International Conference on Machine Learning

  67. [67]

    Yuhao Wu, Ming Shan Hee, Zhiqing Hu, and Roy Ka-Wei Lee. 2024. LongGenBench: Benchmarking Long-Form Generation in Long Con- text LLMs. arXiv:2409.02076 [cs.CL] https://arxiv.org/abs/2409. 02076 15

  68. [68]

    Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, and Minlan Yu. 2025. Shad- owServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching. arXiv:2509.16857 [cs.DC] https://arxiv.org/abs/2509. 16857

  69. [69]

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819(2024)

  70. [70]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453(2023)

  71. [71]

    Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, and Houqiang Li. 2025. DocR1: Evi- dence Page-Guided GRPO for Multi-Page Document Understanding. arXiv:2508.07313 [cs.CV]https://arxiv.org/abs/2508.07313

  72. [72]

    Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, and Bryan Catanzaro. 2025. From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models.arXiv preprint (2025)

  73. [73]

    Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM. 265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture. 445–460

  74. [74]

    Yichun Xu, Navjot K Khaira, and Tejinder Singh. 2026. KV Cache Optimization Strategies for Scalable and Efficient LLM Inference.arXiv preprint arXiv:2603.20397(2026)

  75. [75]

    Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. 2024. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Com- putational Linguistics: ACL 2024. 3258–3270

  76. [76]

    Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han

  77. [77]

    Lserve: Efficient long-sequence llm serving with unified sparse attention.Proceedings of Machine Learning and Systems7 (2025)

  78. [78]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

  79. [79]

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874(2025)

  80. [80]

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan O Arik. [n. d.]. Chain of agents: Large language models collabo- rating on long-context tasks, 2024.URL https://arxiv. org/abs/2406.02818 3 ([n. d.])

Showing first 80 references.