VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
Pith reviewed 2026-05-19 22:05 UTC · model grok-4.3
pith:G7JVWCF7 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{G7JVWCF7}
Prints a linked pith:G7JVWCF7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
VeriCache achieves identical outputs to full-KV-cache decoding at up to 4 times higher throughput by drafting with compressed caches and verifying in parallel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeriCache uses the compressed KV cache to draft tokens then verifies those drafts against the full KV cache. It solves the resulting system challenge by parallelizing compressed-KV decoding with full-KV swapping, because the former is HBM-bandwidth-bound and the latter is PCIe- or network-bound, while the frequent similarity of compressed outputs to full outputs permits long drafting horizons that amortize each swap cost. The method applies uniformly to token-dropping and quantization compressors and composes with standard speculative decoding.
What carries the argument
Parallel drafting on compressed KV cache overlapped with full-KV swap-in, enabled by differing bandwidth bottlenecks and output similarity for long draft sequences.
If this is right
- The same framework works for both long-context decoding and remote prefix caching.
- Any token-dropping or quantization compressor can be plugged in through the uniform interface.
- Traditional speculative decoding can be layered on top for additional speedups.
- Identical outputs are guaranteed regardless of how far the generation proceeds.
Where Pith is reading between the lines
- Serving systems could reduce reliance on large amounts of high-bandwidth memory by keeping only compressed caches resident.
- Similar draft-and-verify patterns might help other lossy approximations inside model inference pipelines.
- Measuring divergence rates across different model families and tasks would show how often the long-horizon assumption holds in practice.
Load-bearing premise
Compressed and full KV outputs stay similar enough over many tokens to let each full-cache swap be amortized by a long drafting horizon.
What would settle it
A long output sequence where the compressed cache produces tokens that diverge from the full cache within a few steps, forcing short drafts and eliminating the throughput advantage.
Figures
read the original abstract
The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap. VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents VeriCache, an inference framework that converts lossy KV cache compression methods into lossless LLM decoding. It uses the compressed KV cache to draft tokens via speculative-style decoding and verifies the drafts against the full KV cache to guarantee identical outputs. The key system insight is that compressed-KV decoding (HBM-bandwidth bound) can be overlapped with full-KV cache swapping (PCIe/network bound), and that output similarity often permits sufficiently long drafting horizons to amortize swap costs. The approach supports long-context decoding, remote prefix caching, a uniform interface for token-dropping and quantization compressors, and composition with conventional speculative decoding. Experiments are reported to deliver up to 4X throughput versus full-KV inference while producing identical outputs.
Significance. If the performance and correctness claims are substantiated, VeriCache would offer a practical way to retain the memory and bandwidth benefits of aggressive KV compression without sacrificing output fidelity, which is especially relevant for long-context serving and distributed prefix caching. The uniform compressor interface and explicit composition with existing speculative decoding are concrete engineering contributions that could be adopted broadly. The bandwidth-overlap insight is a systems-level strength that may generalize beyond the specific setting.
major comments (2)
- [Abstract] Abstract: the central performance claim ('up to 4X higher throughput ... while producing identical outputs') rests on the assumption that compressed-KV drafts remain sufficiently similar to full-KV outputs for long horizons, yet the abstract itself notes that lossy methods cause outputs to 'increasingly diverge' and produce 'catastrophic failures' in code generation and tool calling. No quantitative acceptance-rate or horizon-length data are supplied for these divergence-prone regimes, which directly determines whether swap amortization can occur and whether the 4X figure is achievable.
- [Experimental results] Experimental results (as summarized in the abstract): the reported throughput gains lack any description of the models, workloads, hardware configuration, measurement methodology, or drafting-horizon statistics. Without these details it is impossible to evaluate whether the parallelization insight actually hides the full-KV swap latency under realistic conditions.
minor comments (1)
- [Design] The description of the uniform compressor interface could be expanded with a short pseudocode or API sketch to clarify how new compression methods are integrated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of VeriCache's potential impact. We address each major comment point by point below and have revised the manuscript to provide the requested details and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim ('up to 4X higher throughput ... while producing identical outputs') rests on the assumption that compressed-KV drafts remain sufficiently similar to full-KV outputs for long horizons, yet the abstract itself notes that lossy methods cause outputs to 'increasingly diverge' and produce 'catastrophic failures' in code generation and tool calling. No quantitative acceptance-rate or horizon-length data are supplied for these divergence-prone regimes, which directly determines whether swap amortization can occur and whether the 4X figure is achievable.
Authors: We agree that the abstract correctly identifies the divergence problem as motivation for the work. VeriCache guarantees identical outputs via verification regardless of similarity; however, throughput gains depend on sufficiently long drafting horizons to amortize swaps. The full manuscript reports acceptance rates and horizon statistics across workloads, including code generation and tool calling. In the revision we will add a dedicated table and accompanying text in Section 5 that explicitly quantifies average acceptance rates and drafting horizons for these divergence-prone tasks under the evaluated compressors, allowing readers to assess amortization directly. revision: yes
-
Referee: [Experimental results] Experimental results (as summarized in the abstract): the reported throughput gains lack any description of the models, workloads, hardware configuration, measurement methodology, or drafting-horizon statistics. Without these details it is impossible to evaluate whether the parallelization insight actually hides the full-KV swap latency under realistic conditions.
Authors: We acknowledge that the abstract's condensed summary omits these details. The full manuscript contains Section 5 with descriptions of models (Llama-2-7B/13B, Mistral-7B), workloads (long-context QA, code generation, tool calling), hardware (A100/H100 GPUs with PCIe/NVLink), methodology (end-to-end tokens/s, per-phase latency breakdowns), and drafting-horizon/acceptance statistics. In the revision we will expand Section 5 with additional tables and a new subsection on bandwidth-overlap measurements to make all parameters and statistics explicit and reproducible. revision: yes
Circularity Check
No circularity; claims rest on independent systems observations and experiments
full rationale
The paper describes a speculative-decoding-style framework that uses compressed KV for drafting and full KV for verification to guarantee identical outputs. Throughput gains are attributed to measured bandwidth differences (HBM-bound drafting overlapping PCIe-bound swaps) and empirical output similarity allowing amortization; these are external observations about hardware constraints and workload behavior, not self-definitions, fitted parameters presented as predictions, or results that reduce to the paper's own inputs by construction. No equations, uniqueness theorems, or self-citation chains appear in the provided text that would force the central claims.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Compressed KV cache produces output similar enough to full KV cache to support a long drafting horizon that amortizes full-KV swaps.
- domain assumption Compressed-KV decoding is HBM-bandwidth-bound while full-KV swap is PCIe/network-bound, enabling effective overlap.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache... compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference.Proceedings of Machine Learning and Systems6 (2024), 114– 127
work page 2024
- [2]
-
[3]
Amazon Web Services. 2025. Performance specifications for Amazon S3. https://docs.aws.amazon.com/AmazonS3/latest/ userguide/s3-files-performance.html. Accessed: 2026-04-16
work page 2025
-
[4]
Yuxuan Cai, Xiaozhuan Liang, Xinghua Wang, Jin Ma, Haijin Liang, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Yuyang Yin, and Xi Chen
-
[5]
arXiv:2509.18362 [cs.LG] https://arxiv.org/ abs/2509.18362
FastMTP: Accelerating LLM Inference with Enhanced Multi- Token Prediction. arXiv:2509.18362 [cs.LG] https://arxiv.org/ abs/2509.18362
-
[6]
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal information fun- neling.arXiv preprint arXiv:2406.02069(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Jinglin Chen, Qiwei Li, Zuchao Li, Baoyuan Qi, Liu Guoming, Haojun Ai, Hai Zhao, and Ping Wang. 2025. Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 18051–18062
work page 2025
-
[8]
Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025.{IMPRESS}: An {Importance-Informed} {Multi-Tier} prefix {KV} storage system for large language model inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25)
work page 2025
- [9]
-
[10]
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2024. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
-
[12]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond. arXiv:2503.20589 [cs.SE] https:// arxiv.org/abs/2503.20589
-
[14]
LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. 2025. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research(2025)
work page 2025
-
[15]
Horace He and Thinking Machines Lab. 2025. Defeating Nondeter- minism in LLM Inference. https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/
work page 2025
-
[16]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation.Advances in Neural Information Processing Systems37 (2024), 1270–1303
work page 2024
-
[17]
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al
- [18]
- [19]
-
[20]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. 2024. GEAR: An efficient error reduction framework for KV cache compression in LLM inference. In Proc. NeurIPS, Vol. 262. 305–321
work page 2024
- [22]
- [23]
-
[24]
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG] https: //arxiv.org/abs/2309.06180
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Yuanyuan Lei and Ruihong Huang. 2025. Multi-document Sum- marization through Multi-document Event Relation Graph Reason- ing in LLMs: a case study in Framing Bias Mitigation. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Tahe...
- [26]
-
[27]
Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wu- long Liu, Yiwu Yao, Sinno Jialin Pan, and Mingxuan Yuan. 2025. Kv- tuner: Sensitivity-aware layer-wise mixed-precision kv cache quanti- zation for efficient and nearly lossless llm inference.arXiv preprint arXiv:2502.04420(2025)
-
[28]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen
-
[29]
Advances in Neural Information Processing Systems37 (2024), 22947– 22970
Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970
work page 2024
- [30]
-
[31]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EA- GLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077 [cs.LG]https://arxiv.org/abs/2401.15077
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [32]
-
[33]
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2025. Qserve: W4a8kv4 quantization 14 and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems7 (2025)
work page 2025
-
[34]
Jingjing Liu, Silin Li, Zeming Liu, Zihao Cheng, Yuhang Guo, Yuan- fang Guo, Yunhong Wang, and Haifeng Wang. 2026. Towards multi- language repository-level code generation: From-scratch to guided tasks.Neurocomputing(2026), 133204
work page 2026
-
[35]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang
-
[36]
Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems(2023)
work page 2023
-
[37]
Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv:2306.03091 [cs.CL]https://arxiv.org/abs/2306.03091
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [38]
- [39]
-
[40]
Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG] https: //arxiv.org/abs/2510.09665
-
[41]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56
work page 2024
-
[42]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning- free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
LMCache. 2025. LMCache Agentic Traces. https://huggingface. co/datasets/sammshen/lmcache-agentic-traces
work page 2025
-
[44]
Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. 2024. RepoAgent: An LLM-Powered Open- Source Framework for Repository-level Code Documentation Gen- eration. arXiv:2402.16667 [cs.CL] https://arxiv.org/abs/2402. 16667
-
[45]
Mistral AI. 2025. Mistral Small 24B Instruct 2501. https: //huggingface.co/mistralai/Mistral-Small-24B-Instruct- 2501
work page 2025
-
[46]
NVIDIA Corporation. 2026. NemoClaw: Secure AI Agent Stack for OpenClaw. https://github.com/NVIDIA/NemoClaw. Accessed: 2026-04-01
work page 2026
-
[47]
OpenAI. 2026. Agents Guide. https://developers.openai.com/ api/docs/guides/agents. Accessed: 2026-04-01
work page 2026
-
[48]
Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Dong- han Yu, Jiawei Han, and Yelong Shen. 2024. Temperature-centric investigation of speculative decoding with knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP 2024. 13125–13137
work page 2024
- [49]
-
[50]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative LLM inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)
work page 2024
-
[51]
Joseph J Peper, Wenzhao Qiu, Ali Payani, and Lu Wang. 2025. Mdbench: A synthetic multi-document reasoning benchmark generated with knowledge guidance. InFindings of the Association for Computational Linguistics: ACL 2025. 25592–25621
work page 2025
-
[52]
Kimonas Provatas, Aris Karatzikos, Charalampos Koilakos, Michail Patsakis, Alexandros Tzanakakis, Akshatha Nayak, Ioannis Mouratidis, Evangelos Ioannis Avgoulas, and Ilias Georgakopoulos-Soares. 2026. Accelerating inference in genomic and proteomic foundation models via speculative decoding.bioRxiv(2026), 2026–01
work page 2026
-
[53]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)
work page 2024
-
[54]
RedHat AI. 2025. Llama-3.3-70B-Instruct-speculator.eagle3. https://huggingface.co/RedHatAI/Llama-3.3-70B-Instruct- speculator.eagle3
work page 2025
-
[55]
RedHat AI. 2025. Qwen3-32B-speculator.eagle3. https:// huggingface.co/RedHatAI/Qwen3-32B-speculator.eagle3
work page 2025
-
[56]
1994.An introduction to information theory
Fazlollah M Reza. 1994.An introduction to information theory. Courier Corporation
work page 1994
- [57]
- [58]
- [59]
-
[60]
Peter Steinberger. 2025. OpenClaw: Open-source autonomous AI agent. https://github.com/openclaw/openclaw. GitHub reposi- tory
work page 2025
- [61]
-
[62]
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2025. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence. InProceedings of the 42nd International Conference on Machine Learning
work page 2025
- [63]
-
[64]
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-aware sparsity for efficient long- context llm inference.arXiv preprint arXiv:2406.10774(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Mahoney, Kurt Keutzer, and Amir Gholami
Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Se- hoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache. InProceedings of the 42nd International Conference on Machine Learning
work page 2025
- [67]
- [68]
-
[69]
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [71]
-
[72]
Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, and Bryan Catanzaro. 2025. From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models.arXiv preprint (2025)
work page 2025
-
[73]
Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM. 265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture. 445–460
work page 2025
- [74]
-
[75]
Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. 2024. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Com- putational Linguistics: ACL 2024. 3258–3270
work page 2024
-
[76]
Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han
-
[77]
Lserve: Efficient long-sequence llm serving with unified sparse attention.Proceedings of Machine Learning and Systems7 (2025)
work page 2025
-
[78]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems. 94–109
work page 2025
-
[79]
Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. 2025. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874(2025)
work page internal anchor Pith review arXiv 2025
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.