VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Dongjoo Seo; Jiayi Yao; Junchen Jiang; Kuntai Du; Rui Zhang; Samuel Shen; Shan Lu; Shaoting Feng; Yuhan Liu; Yuyang Huang

arxiv: 2605.17613 · v1 · pith:G7JVWCF7new · submitted 2026-05-17 · 💻 cs.AR · cs.LG

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Jiayi Yao , Samuel Shen , Kuntai Du , Shaoting Feng , Dongjoo Seo , Rui Zhang , Yuyang Huang , Yuhan Liu

show 2 more authors

Shan Lu Junchen Jiang

This is my paper

Pith reviewed 2026-05-19 22:05 UTC · model grok-4.3

classification 💻 cs.AR cs.LG

keywords KV cache compressionLLM inferencespeculative decodinglossless verificationthroughput optimizationlong context servingcache swapping

0 comments

The pith

VeriCache achieves identical outputs to full-KV-cache decoding at up to 4 times higher throughput by drafting with compressed caches and verifying in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to resolve the tension between fast but lossy KV cache compression and the need for exact outputs in LLM inference. Lossy methods like token dropping or quantization work for short generations but cause outputs to diverge over longer sequences, breaking tasks such as code generation. VeriCache keeps the full KV cache out of GPU memory and uses the compressed version only to draft candidate tokens, then swaps in the full cache for verification. The approach succeeds by running the drafting step in parallel with the swap, since drafting is limited by HBM bandwidth while swapping is limited by PCIe or network speed, and by using the similarity between compressed and full outputs to draft many tokens per swap. A reader would care because this removes the accuracy risk of compression without sacrificing the throughput gains needed for long-context serving.

Core claim

VeriCache uses the compressed KV cache to draft tokens then verifies those drafts against the full KV cache. It solves the resulting system challenge by parallelizing compressed-KV decoding with full-KV swapping, because the former is HBM-bandwidth-bound and the latter is PCIe- or network-bound, while the frequent similarity of compressed outputs to full outputs permits long drafting horizons that amortize each swap cost. The method applies uniformly to token-dropping and quantization compressors and composes with standard speculative decoding.

What carries the argument

Parallel drafting on compressed KV cache overlapped with full-KV swap-in, enabled by differing bandwidth bottlenecks and output similarity for long draft sequences.

If this is right

The same framework works for both long-context decoding and remote prefix caching.
Any token-dropping or quantization compressor can be plugged in through the uniform interface.
Traditional speculative decoding can be layered on top for additional speedups.
Identical outputs are guaranteed regardless of how far the generation proceeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Serving systems could reduce reliance on large amounts of high-bandwidth memory by keeping only compressed caches resident.
Similar draft-and-verify patterns might help other lossy approximations inside model inference pipelines.
Measuring divergence rates across different model families and tasks would show how often the long-horizon assumption holds in practice.

Load-bearing premise

Compressed and full KV outputs stay similar enough over many tokens to let each full-cache swap be amortized by a long drafting horizon.

What would settle it

A long output sequence where the compressed cache produces tokens that diverge from the full cache within a few steps, forcing short drafts and eliminating the throughput advantage.

Figures

Figures reproduced from arXiv: 2605.17613 by Dongjoo Seo, Jiayi Yao, Junchen Jiang, Kuntai Du, Rui Zhang, Samuel Shen, Shan Lu, Shaoting Feng, Yuhan Liu, Yuyang Huang.

**Figure 2.** Figure 2: Code-generation failure from compressed KV. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Sequence-level KL KL(𝑝full(𝑥1:𝑡) ∥ 𝑝lossy(𝑥1:𝑡)) grows roughly linearly in 𝑡 under KVzip 4× (left) and TurboQuant k4v3 (right); at temperature 0.5. the chain rule of KL divergence: KL1:𝑇 ≜ KL 𝑝full(𝑥1:𝑇 ) ∥ 𝑝lossy (𝑥1:𝑇 ) = ∑︁ 𝑇 𝑡=1 E𝑥<𝑡∼𝑝full[KL𝑡] . (2) if per-step KL exceeds 𝜀 > 0, sequence-level KL grows linearly: KL1:𝑇 ≥ 𝜀𝑇 . Since KL1:𝑇 equals E𝑥∼𝑝full log(𝑝full(𝑥1:𝑇 )/𝑝lossy (𝑥1:𝑇 )) , this means … view at source ↗

**Figure 5.** Figure 5: Overview of VeriCache. Tokens drafted with [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: VeriCache’s two settings: long-context decod [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Acceptance rate (left) and ideal speedup (right) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Acceptance length vs. draft length, comparing [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Composing VeriCache with Eagle. (Left) Ac [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Sustained decoding throughput on long [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: End-to-end request latency vs. request rate on Pipeline 1 (top) and Pipeline 2 (bottom). [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: VeriCache’s speedup over Full KV: Pipeline 1, varying KV-cache budget and HBM/interconnect ratio [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Quality (negative KL-divergence) vs. throughput on Pipeline 1 (top) and Pipeline 2 (bottom). [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 15.** Figure 15: Additional token-dropping and quantization [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 16.** Figure 16: Quality vs. throughput: function-call accuracy (top), defense success rate (bottom). [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

**Figure 17.** Figure 17: Quality vs. throughput on Pipeline 1 long [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

read the original abstract

The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap. VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

VeriCache gives a workable system to keep lossy KV compression fast while forcing identical outputs via drafting plus verification and overlapped swaps, though the 4x claim rests on how long those drafts stay close in practice. The paper builds directly on the known problem that token dropping and quantization start to diverge after a few tokens and break code or tool tasks. Their fix is to run the compressed cache for drafting, then verify the sequence against the full cache only when needed, while hiding the swap cost by running the two in parallel because one is HBM-bound and the other PCIe-bound. They also supply a uniform interface so any compressor can plug in and show the approach works for remote prefix caching as well as on-device long-context decoding. It composes cleanly with ordinary speculative decoding too. That combination of ideas is the concrete advance. The experiments report up to 4x throughput with matching outputs, which is the kind of end-to-end number that matters for serving. The soft spot is exactly the one the stress-test note flags. The whole speedup depends on the compressed drafts staying similar enough for a long enough horizon to amortize each full-KV swap. The abstract itself says lossy methods diverge more with length and fail on the very tasks that need accuracy. If acceptance rates drop after only a handful of tokens in code generation, the parallelization trick cannot hide the cost and throughput falls back toward baseline. The paper must have measured average draft lengths and swap frequency on those workloads; without seeing those numbers the 4x result is hard to judge. This is aimed at systems people who run long-context models in production or build inference engines. Anyone who has tried KV compression and hit the accuracy wall will see the practical value. It deserves peer review because the core mechanism is grounded in real hardware constraints and the measurements, if they hold up under scrutiny, would be useful to the community.

Referee Report

2 major / 1 minor

Summary. The manuscript presents VeriCache, an inference framework that converts lossy KV cache compression methods into lossless LLM decoding. It uses the compressed KV cache to draft tokens via speculative-style decoding and verifies the drafts against the full KV cache to guarantee identical outputs. The key system insight is that compressed-KV decoding (HBM-bandwidth bound) can be overlapped with full-KV cache swapping (PCIe/network bound), and that output similarity often permits sufficiently long drafting horizons to amortize swap costs. The approach supports long-context decoding, remote prefix caching, a uniform interface for token-dropping and quantization compressors, and composition with conventional speculative decoding. Experiments are reported to deliver up to 4X throughput versus full-KV inference while producing identical outputs.

Significance. If the performance and correctness claims are substantiated, VeriCache would offer a practical way to retain the memory and bandwidth benefits of aggressive KV compression without sacrificing output fidelity, which is especially relevant for long-context serving and distributed prefix caching. The uniform compressor interface and explicit composition with existing speculative decoding are concrete engineering contributions that could be adopted broadly. The bandwidth-overlap insight is a systems-level strength that may generalize beyond the specific setting.

major comments (2)

[Abstract] Abstract: the central performance claim ('up to 4X higher throughput ... while producing identical outputs') rests on the assumption that compressed-KV drafts remain sufficiently similar to full-KV outputs for long horizons, yet the abstract itself notes that lossy methods cause outputs to 'increasingly diverge' and produce 'catastrophic failures' in code generation and tool calling. No quantitative acceptance-rate or horizon-length data are supplied for these divergence-prone regimes, which directly determines whether swap amortization can occur and whether the 4X figure is achievable.
[Experimental results] Experimental results (as summarized in the abstract): the reported throughput gains lack any description of the models, workloads, hardware configuration, measurement methodology, or drafting-horizon statistics. Without these details it is impossible to evaluate whether the parallelization insight actually hides the full-KV swap latency under realistic conditions.

minor comments (1)

[Design] The description of the uniform compressor interface could be expanded with a short pseudocode or API sketch to clarify how new compression methods are integrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of VeriCache's potential impact. We address each major comment point by point below and have revised the manuscript to provide the requested details and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim ('up to 4X higher throughput ... while producing identical outputs') rests on the assumption that compressed-KV drafts remain sufficiently similar to full-KV outputs for long horizons, yet the abstract itself notes that lossy methods cause outputs to 'increasingly diverge' and produce 'catastrophic failures' in code generation and tool calling. No quantitative acceptance-rate or horizon-length data are supplied for these divergence-prone regimes, which directly determines whether swap amortization can occur and whether the 4X figure is achievable.

Authors: We agree that the abstract correctly identifies the divergence problem as motivation for the work. VeriCache guarantees identical outputs via verification regardless of similarity; however, throughput gains depend on sufficiently long drafting horizons to amortize swaps. The full manuscript reports acceptance rates and horizon statistics across workloads, including code generation and tool calling. In the revision we will add a dedicated table and accompanying text in Section 5 that explicitly quantifies average acceptance rates and drafting horizons for these divergence-prone tasks under the evaluated compressors, allowing readers to assess amortization directly. revision: yes
Referee: [Experimental results] Experimental results (as summarized in the abstract): the reported throughput gains lack any description of the models, workloads, hardware configuration, measurement methodology, or drafting-horizon statistics. Without these details it is impossible to evaluate whether the parallelization insight actually hides the full-KV swap latency under realistic conditions.

Authors: We acknowledge that the abstract's condensed summary omits these details. The full manuscript contains Section 5 with descriptions of models (Llama-2-7B/13B, Mistral-7B), workloads (long-context QA, code generation, tool calling), hardware (A100/H100 GPUs with PCIe/NVLink), methodology (end-to-end tokens/s, per-phase latency breakdowns), and drafting-horizon/acceptance statistics. In the revision we will expand Section 5 with additional tables and a new subsection on bandwidth-overlap measurements to make all parameters and statistics explicit and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on independent systems observations and experiments

full rationale

The paper describes a speculative-decoding-style framework that uses compressed KV for drafting and full KV for verification to guarantee identical outputs. Throughput gains are attributed to measured bandwidth differences (HBM-bound drafting overlapping PCIe-bound swaps) and empirical output similarity allowing amortization; these are external observations about hardware constraints and workload behavior, not self-definitions, fitted parameters presented as predictions, or results that reduce to the paper's own inputs by construction. No equations, uniqueness theorems, or self-citation chains appear in the provided text that would force the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about hardware bandwidth asymmetry and similarity of compressed versus full KV outputs; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Compressed KV cache produces output similar enough to full KV cache to support a long drafting horizon that amortizes full-KV swaps.
This similarity assumption is required for the amortization argument in the abstract.
domain assumption Compressed-KV decoding is HBM-bandwidth-bound while full-KV swap is PCIe/network-bound, enabling effective overlap.
This hardware assumption underpins the parallelization insight.

pith-pipeline@v0.9.0 · 5851 in / 1361 out tokens · 54999 ms · 2026-05-19T22:05:21.069105+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache... compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 13 internal anchors

[1]

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems6 (2024), 114– 127

work page 2024
[2]

Sudhanshu Agrawal, Wonseok Jeon, and Mingu Lee. 2024. Adaedl: Early draft stopping for speculative decoding of large language models via an entropy-based lower bound on token acceptance probability. arXiv preprint arXiv:2410.18351(2024)

work page arXiv 2024
[3]

Amazon Web Services. 2025. Performance specifications for Amazon S3. https://docs.aws.amazon.com/AmazonS3/latest/ userguide/s3-files-performance.html. Accessed: 2026-04-16

work page 2025
[4]

Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, and Xi Chen

work page
[5]

Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

FastMTP: Accelerating LLM Inference with Enhanced Multi- Token Prediction. arXiv:2509.18362 [cs.LG] https://arxiv.org/ abs/2509.18362

work page arXiv
[6]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal information fun- neling.arXiv preprint arXiv:2406.02069(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Jinglin Chen, Qiwei Li, Zuchao Li, Baoyuan Qi, Liu Guoming, Haojun Ai, Hai Zhao, and Ping Wang. 2025. Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 18051–18062

work page 2025
[8]

2025.{IMPRESS}: An {Importance-Informed} {Multi-Tier} prefix {KV} storage system for large language model inference

Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025.{IMPRESS}: An {Importance-Informed} {Multi-Tier} prefix {KV} storage system for large language model inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25)

work page 2025
[9]

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. 2025. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636(2025)

work page arXiv 2025
[10]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2024. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Runpeng Geng, Yanting Wang, Chenlong Yin, Minhao Cheng, Ying Chen, and Jinyuan Jia. 2025. PISanitizer: Preventing Prompt Injec- tion to Long-Context LLMs via Prompt Sanitization.arXiv preprint arXiv:2511.10720(2025)

work page arXiv 2025
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond. arXiv:2503.20589 [cs.SE] https:// arxiv.org/abs/2503.20589

work page arXiv 2025
[14]

LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. 2025. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research(2025)

work page 2025
[15]

Horace He and Thinking Machines Lab. 2025. Defeating Nondeter- minism in LLM Inference. https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/

work page 2025
[16]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation.Advances in Neural Information Processing Systems37 (2024), 1270–1303

work page 2024
[17]

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al

work page
[18]

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.arXiv preprint arXiv:2401.11181 (2024)

work page arXiv 2024
[19]

Simon Jegou and Maximilian Jeblick. 2026. KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.arXiv preprint arXiv:2601.07891(2026)

work page arXiv 2026
[20]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. GEAR: An efficient error reduction framework for KV cache compression in LLM inference. In Proc. NeurIPS, Vol. 262. 305–321

work page 2024
[22]

Jang-Hyun Kim, Dongyoon Han, and Sangdoo Yun. 2026. Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction.arXiv preprint arXiv:2601.17668(2026)

work page arXiv 2026
[23]

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. 2025. Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416(2025)

work page arXiv 2025
[24]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG] https: //arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Yuanyuan Lei and Ruihong Huang. 2025. Multi-document Sum- marization through Multi-document Event Relation Graph Reason- ing in LLMs: a case study in Framing Bias Mitigation. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Tahe...

work page doi:10.18653/v1/2025.acl-long.1291 2025
[26]

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2025. A Survey on Large Language Model Acceleration based on KV Cache Management. arXiv:2412.19442 [cs.AI] https://arxiv.org/abs/ 2412.19442

work page arXiv 2025
[27]

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wu- long Liu, Yiwu Yao, Sinno Jialin Pan, and Mingxuan Yuan. 2025. Kv- tuner: Sensitivity-aware layer-wise mixed-precision kv cache quanti- zation for efficient and nearly lossless llm inference.arXiv preprint arXiv:2502.04420(2025)

work page arXiv 2025
[28]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

work page
[29]

Advances in Neural Information Processing Systems37 (2024), 22947– 22970

Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970

work page 2024
[30]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. arXiv:2406.16858 [cs.CL] https://arxiv.org/abs/2406. 16858

work page arXiv 2024
[31]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EA- GLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077 [cs.LG]https://arxiv.org/abs/2401.15077

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Manlai Liang, JiaMing Zhang, Xiong Li, and Jinlong Li. 2025. LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important.arXiv preprint arXiv:2504.04704(2025)

work page arXiv 2025
[33]

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2025. Qserve: W4a8kv4 quantization 14 and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems7 (2025)

work page 2025
[34]

Jingjing Liu, Silin Li, Zeming Liu, Zihao Cheng, Yuhang Guo, Yuan- fang Guo, Yunhong Wang, and Haifeng Wang. 2026. Towards multi- language repository-level code generation: From-scratch to guided tasks.Neurocomputing(2026), 133204

work page 2026
[35]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang

work page
[36]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems(2023)

work page 2023
[37]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv:2306.03091 [cs.CL]https://arxiv.org/abs/2306.03091

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. 2024. LongGen- Bench: Long-context Generation Benchmark. arXiv:2410.04199 [cs.CL] https://arxiv.org/abs/2410.04199

work page arXiv 2024
[39]

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, and Alvin Che- ung. 2025. Speculative Decoding: Performance or Illusion?arXiv preprint arXiv:2601.11580(2025)

work page arXiv 2025
[40]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG] https: //arxiv.org/abs/2510.09665

work page arXiv 2025
[41]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

work page 2024
[42]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning- free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

LMCache. 2025. LMCache Agentic Traces. https://huggingface. co/datasets/sammshen/lmcache-agentic-traces

work page 2025
[44]

Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. 2024. RepoAgent: An LLM-Powered Open- Source Framework for Repository-level Code Documentation Gen- eration. arXiv:2402.16667 [cs.CL] https://arxiv.org/abs/2402. 16667

work page arXiv 2024
[45]

Mistral AI. 2025. Mistral Small 24B Instruct 2501. https: //huggingface.co/mistralai/Mistral-Small-24B-Instruct- 2501

work page 2025
[46]

NVIDIA Corporation. 2026. NemoClaw: Secure AI Agent Stack for OpenClaw. https://github.com/NVIDIA/NemoClaw. Accessed: 2026-04-01

work page 2026
[47]

OpenAI. 2026. Agents Guide. https://developers.openai.com/ api/docs/guides/agents. Accessed: 2026-04-01

work page 2026
[48]

Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Dong- han Yu, Jiawei Han, and Yelong Shen. 2024. Temperature-centric investigation of speculative decoding with knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP 2024. 13125–13137

work page 2024
[49]

Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient prefix caching for accelerating LLM-based multi-agent work- flows.arXiv preprint arXiv:2507.07400(2025)

work page arXiv 2025
[50]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative LLM inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)

work page 2024
[51]

Joseph J Peper, Wenzhao Qiu, Ali Payani, and Lu Wang. 2025. Mdbench: A synthetic multi-document reasoning benchmark generated with knowledge guidance. InFindings of the Association for Computational Linguistics: ACL 2025. 25592–25621

work page 2025
[52]

Kimonas Provatas, Aris Karatzikos, Charalampos Koilakos, Michail Patsakis, Alexandros Tzanakakis, Akshatha Nayak, Ioannis Mouratidis, Evangelos Ioannis Avgoulas, and Ilias Georgakopoulos-Soares. 2026. Accelerating inference in genomic and proteomic foundation models via speculative decoding.bioRxiv(2026), 2026–01

work page 2026
[53]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

work page 2024
[54]

RedHat AI. 2025. Llama-3.3-70B-Instruct-speculator.eagle3. https://huggingface.co/RedHatAI/Llama-3.3-70B-Instruct- speculator.eagle3

work page 2025
[55]

RedHat AI. 2025. Qwen3-32B-speculator.eagle3. https:// huggingface.co/RedHatAI/Qwen3-32B-speculator.eagle3

work page 2025
[56]

1994.An introduction to information theory

Fazlollah M Reza. 1994.An introduction to information theory. Courier Corporation

work page 1994
[57]

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, et al. 2024. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Gen- eration with Speculative Decoding.arXiv preprint arXiv:2408.11049 (2024)

work page arXiv 2024
[58]

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. 2026. Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning. arXiv:2504.17192 [cs.CL] https://arxiv.org/ abs/2504.17192

work page arXiv 2026
[59]

Konrad Staniszewski and Adrian Łańcucki. 2025. KV Cache Trans- form Coding for Compact Storage in LLM Inference.arXiv preprint arXiv:2511.01815(2025)

work page arXiv 2025
[60]

Peter Steinberger. 2025. OpenClaw: Open-source autonomous AI agent. https://github.com/openclaw/openclaw. GitHub reposi- tory

work page 2025
[61]

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. 2025. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383(2025)

work page arXiv 2025
[62]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2025. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence. InProceedings of the 42nd International Conference on Machine Learning

work page 2025
[63]

Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, and Wenjie Zhang. 2025. HydraRAG: Structured Cross-Source Enhanced Large Language Model Reasoning. arXiv:2505.17464 [cs.CL] https://arxiv.org/abs/2505.17464

work page arXiv 2025
[64]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-aware sparsity for efficient long- context llm inference.arXiv preprint arXiv:2406.10774(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Mahoney, Kurt Keutzer, and Amir Gholami

Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Se- hoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache. InProceedings of the 42nd International Conference on Machine Learning

work page 2025
[67]

Yuhao Wu, Ming Shan Hee, Zhiqing Hu, and Roy Ka-Wei Lee. 2024. LongGenBench: Benchmarking Long-Form Generation in Long Con- text LLMs. arXiv:2409.02076 [cs.CL] https://arxiv.org/abs/2409. 02076 15

work page arXiv 2024
[68]

Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, and Minlan Yu. 2025. Shad- owServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching. arXiv:2509.16857 [cs.DC] https://arxiv.org/abs/2509. 16857

work page arXiv 2025
[69]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, and Houqiang Li. 2025. DocR1: Evi- dence Page-Guided GRPO for Multi-Page Document Understanding. arXiv:2508.07313 [cs.CV]https://arxiv.org/abs/2508.07313

work page arXiv 2025
[72]

Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, and Bryan Catanzaro. 2025. From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models.arXiv preprint (2025)

work page 2025
[73]

Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM. 265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture. 445–460

work page 2025
[74]

Yichun Xu, Navjot K Khaira, and Tejinder Singh. 2026. KV Cache Optimization Strategies for Scalable and Efficient LLM Inference.arXiv preprint arXiv:2603.20397(2026)

work page arXiv 2026
[75]

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. 2024. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Com- putational Linguistics: ACL 2024. 3258–3270

work page 2024
[76]

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han

work page
[77]

Lserve: Efficient long-sequence llm serving with unified sparse attention.Proceedings of Machine Learning and Systems7 (2025)

work page 2025
[78]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

work page 2025
[79]

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874(2025)

work page internal anchor Pith review arXiv 2025
[80]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan O Arik. [n. d.]. Chain of agents: Large language models collabo- rating on long-context tasks, 2024.URL https://arxiv. org/abs/2406.02818 3 ([n. d.])

work page arXiv 2024

Showing first 80 references.

[1] [1]

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems6 (2024), 114– 127

work page 2024

[2] [2]

Sudhanshu Agrawal, Wonseok Jeon, and Mingu Lee. 2024. Adaedl: Early draft stopping for speculative decoding of large language models via an entropy-based lower bound on token acceptance probability. arXiv preprint arXiv:2410.18351(2024)

work page arXiv 2024

[3] [3]

Amazon Web Services. 2025. Performance specifications for Amazon S3. https://docs.aws.amazon.com/AmazonS3/latest/ userguide/s3-files-performance.html. Accessed: 2026-04-16

work page 2025

[4] [4]

Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, and Xi Chen

work page

[5] [5]

Fastmtp: Accelerating llm inference with enhanced multi-token prediction.arXiv preprint arXiv:2509.18362, 2025

FastMTP: Accelerating LLM Inference with Enhanced Multi- Token Prediction. arXiv:2509.18362 [cs.LG] https://arxiv.org/ abs/2509.18362

work page arXiv

[6] [6]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal information fun- neling.arXiv preprint arXiv:2406.02069(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Jinglin Chen, Qiwei Li, Zuchao Li, Baoyuan Qi, Liu Guoming, Haojun Ai, Hai Zhao, and Ping Wang. 2025. Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 18051–18062

work page 2025

[8] [8]

2025.{IMPRESS}: An {Importance-Informed} {Multi-Tier} prefix {KV} storage system for large language model inference

Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025.{IMPRESS}: An {Importance-Informed} {Multi-Tier} prefix {KV} storage system for large language model inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25)

work page 2025

[9] [9]

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. 2025. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636(2025)

work page arXiv 2025

[10] [10]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2024. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Runpeng Geng, Yanting Wang, Chenlong Yin, Minhao Cheng, Ying Chen, and Jinyuan Jia. 2025. PISanitizer: Preventing Prompt Injec- tion to Long-Context LLMs via Prompt Sanitization.arXiv preprint arXiv:2511.10720(2025)

work page arXiv 2025

[12] [12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond. arXiv:2503.20589 [cs.SE] https:// arxiv.org/abs/2503.20589

work page arXiv 2025

[14] [14]

LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. 2025. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research(2025)

work page 2025

[15] [15]

Horace He and Thinking Machines Lab. 2025. Defeating Nondeter- minism in LLM Inference. https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/

work page 2025

[16] [16]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation.Advances in Neural Information Processing Systems37 (2024), 1270–1303

work page 2024

[17] [17]

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al

work page

[18] [18]

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.arXiv preprint arXiv:2401.11181 (2024)

work page arXiv 2024

[19] [19]

Simon Jegou and Maximilian Jeblick. 2026. KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.arXiv preprint arXiv:2601.07891(2026)

work page arXiv 2026

[20] [20]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. GEAR: An efficient error reduction framework for KV cache compression in LLM inference. In Proc. NeurIPS, Vol. 262. 305–321

work page 2024

[22] [22]

Jang-Hyun Kim, Dongyoon Han, and Sangdoo Yun. 2026. Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction.arXiv preprint arXiv:2601.17668(2026)

work page arXiv 2026

[23] [23]

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. 2025. Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416(2025)

work page arXiv 2025

[24] [24]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG] https: //arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Yuanyuan Lei and Ruihong Huang. 2025. Multi-document Sum- marization through Multi-document Event Relation Graph Reason- ing in LLMs: a case study in Framing Bias Mitigation. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Tahe...

work page doi:10.18653/v1/2025.acl-long.1291 2025

[26] [26]

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2025. A Survey on Large Language Model Acceleration based on KV Cache Management. arXiv:2412.19442 [cs.AI] https://arxiv.org/abs/ 2412.19442

work page arXiv 2025

[27] [27]

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wu- long Liu, Yiwu Yao, Sinno Jialin Pan, and Mingxuan Yuan. 2025. Kv- tuner: Sensitivity-aware layer-wise mixed-precision kv cache quanti- zation for efficient and nearly lossless llm inference.arXiv preprint arXiv:2502.04420(2025)

work page arXiv 2025

[28] [28]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

work page

[29] [29]

Advances in Neural Information Processing Systems37 (2024), 22947– 22970

Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970

work page 2024

[30] [30]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. arXiv:2406.16858 [cs.CL] https://arxiv.org/abs/2406. 16858

work page arXiv 2024

[31] [31]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EA- GLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077 [cs.LG]https://arxiv.org/abs/2401.15077

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Manlai Liang, JiaMing Zhang, Xiong Li, and Jinlong Li. 2025. LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important.arXiv preprint arXiv:2504.04704(2025)

work page arXiv 2025

[33] [33]

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2025. Qserve: W4a8kv4 quantization 14 and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems7 (2025)

work page 2025

[34] [34]

Jingjing Liu, Silin Li, Zeming Liu, Zihao Cheng, Yuhang Guo, Yuan- fang Guo, Yunhong Wang, and Haifeng Wang. 2026. Towards multi- language repository-level code generation: From-scratch to guided tasks.Neurocomputing(2026), 133204

work page 2026

[35] [35]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang

work page

[36] [36]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems(2023)

work page 2023

[37] [37]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv:2306.03091 [cs.CL]https://arxiv.org/abs/2306.03091

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. 2024. LongGen- Bench: Long-context Generation Benchmark. arXiv:2410.04199 [cs.CL] https://arxiv.org/abs/2410.04199

work page arXiv 2024

[39] [39]

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, and Alvin Che- ung. 2025. Speculative Decoding: Performance or Illusion?arXiv preprint arXiv:2601.11580(2025)

work page arXiv 2025

[40] [40]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG] https: //arxiv.org/abs/2510.09665

work page arXiv 2025

[41] [41]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

work page 2024

[42] [42]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning- free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

LMCache. 2025. LMCache Agentic Traces. https://huggingface. co/datasets/sammshen/lmcache-agentic-traces

work page 2025

[44] [44]

Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. 2024. RepoAgent: An LLM-Powered Open- Source Framework for Repository-level Code Documentation Gen- eration. arXiv:2402.16667 [cs.CL] https://arxiv.org/abs/2402. 16667

work page arXiv 2024

[45] [45]

Mistral AI. 2025. Mistral Small 24B Instruct 2501. https: //huggingface.co/mistralai/Mistral-Small-24B-Instruct- 2501

work page 2025

[46] [46]

NVIDIA Corporation. 2026. NemoClaw: Secure AI Agent Stack for OpenClaw. https://github.com/NVIDIA/NemoClaw. Accessed: 2026-04-01

work page 2026

[47] [47]

OpenAI. 2026. Agents Guide. https://developers.openai.com/ api/docs/guides/agents. Accessed: 2026-04-01

work page 2026

[48] [48]

Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Dong- han Yu, Jiawei Han, and Yelong Shen. 2024. Temperature-centric investigation of speculative decoding with knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP 2024. 13125–13137

work page 2024

[49] [49]

Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient prefix caching for accelerating LLM-based multi-agent work- flows.arXiv preprint arXiv:2507.07400(2025)

work page arXiv 2025

[50] [50]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative LLM inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)

work page 2024

[51] [51]

Joseph J Peper, Wenzhao Qiu, Ali Payani, and Lu Wang. 2025. Mdbench: A synthetic multi-document reasoning benchmark generated with knowledge guidance. InFindings of the Association for Computational Linguistics: ACL 2025. 25592–25621

work page 2025

[52] [52]

Kimonas Provatas, Aris Karatzikos, Charalampos Koilakos, Michail Patsakis, Alexandros Tzanakakis, Akshatha Nayak, Ioannis Mouratidis, Evangelos Ioannis Avgoulas, and Ilias Georgakopoulos-Soares. 2026. Accelerating inference in genomic and proteomic foundation models via speculative decoding.bioRxiv(2026), 2026–01

work page 2026

[53] [53]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

work page 2024

[54] [54]

RedHat AI. 2025. Llama-3.3-70B-Instruct-speculator.eagle3. https://huggingface.co/RedHatAI/Llama-3.3-70B-Instruct- speculator.eagle3

work page 2025

[55] [55]

RedHat AI. 2025. Qwen3-32B-speculator.eagle3. https:// huggingface.co/RedHatAI/Qwen3-32B-speculator.eagle3

work page 2025

[56] [56]

1994.An introduction to information theory

Fazlollah M Reza. 1994.An introduction to information theory. Courier Corporation

work page 1994

[57] [57]

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, et al. 2024. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Gen- eration with Speculative Decoding.arXiv preprint arXiv:2408.11049 (2024)

work page arXiv 2024

[58] [58]

Minju Seo, Jinheon Baek, Seongyun Lee, and Sung Ju Hwang. 2026. Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning. arXiv:2504.17192 [cs.CL] https://arxiv.org/ abs/2504.17192

work page arXiv 2026

[59] [59]

Konrad Staniszewski and Adrian Łańcucki. 2025. KV Cache Trans- form Coding for Compact Storage in LLM Inference.arXiv preprint arXiv:2511.01815(2025)

work page arXiv 2025

[60] [60]

Peter Steinberger. 2025. OpenClaw: Open-source autonomous AI agent. https://github.com/openclaw/openclaw. GitHub reposi- tory

work page 2025

[61] [61]

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. 2025. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383(2025)

work page arXiv 2025

[62] [62]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2025. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence. InProceedings of the 42nd International Conference on Machine Learning

work page 2025

[63] [63]

Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, and Wenjie Zhang. 2025. HydraRAG: Structured Cross-Source Enhanced Large Language Model Reasoning. arXiv:2505.17464 [cs.CL] https://arxiv.org/abs/2505.17464

work page arXiv 2025

[64] [64]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-aware sparsity for efficient long- context llm inference.arXiv preprint arXiv:2406.10774(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Mahoney, Kurt Keutzer, and Amir Gholami

Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Se- hoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache. InProceedings of the 42nd International Conference on Machine Learning

work page 2025

[67] [67]

Yuhao Wu, Ming Shan Hee, Zhiqing Hu, and Roy Ka-Wei Lee. 2024. LongGenBench: Benchmarking Long-Form Generation in Long Con- text LLMs. arXiv:2409.02076 [cs.CL] https://arxiv.org/abs/2409. 02076 15

work page arXiv 2024

[68] [68]

Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, and Minlan Yu. 2025. Shad- owServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching. arXiv:2509.16857 [cs.DC] https://arxiv.org/abs/2509. 16857

work page arXiv 2025

[69] [69]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, and Houqiang Li. 2025. DocR1: Evi- dence Page-Guided GRPO for Multi-Page Document Understanding. arXiv:2508.07313 [cs.CV]https://arxiv.org/abs/2508.07313

work page arXiv 2025

[72] [72]

Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, and Bryan Catanzaro. 2025. From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models.arXiv preprint (2025)

work page 2025

[73] [73]

Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM. 265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture. 445–460

work page 2025

[74] [74]

Yichun Xu, Navjot K Khaira, and Tejinder Singh. 2026. KV Cache Optimization Strategies for Scalable and Efficient LLM Inference.arXiv preprint arXiv:2603.20397(2026)

work page arXiv 2026

[75] [75]

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. 2024. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Com- putational Linguistics: ACL 2024. 3258–3270

work page 2024

[76] [76]

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han

work page

[77] [77]

Lserve: Efficient long-sequence llm serving with unified sparse attention.Proceedings of Machine Learning and Systems7 (2025)

work page 2025

[78] [78]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109

work page 2025

[79] [79]

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874(2025)

work page internal anchor Pith review arXiv 2025

[80] [80]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan O Arik. [n. d.]. Chain of agents: Large language models collabo- rating on long-context tasks, 2024.URL https://arxiv. org/abs/2406.02818 3 ([n. d.])

work page arXiv 2024