Recognition: no theorem link
Efficient Remote KV Cache Reuse with GPU-native Video Codec
Pith reviewed 2026-05-16 05:18 UTC · model grok-4.3
The pith
GPU video codecs enable remote KV cache reuse for LLMs by compressing tensors into compact video formats, reducing TTFT by up to 3.51 times with lossless accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KVCodec achieves effective KV cache coding with a codec-friendly tensor layout that compresses KV caches into highly compact video formats for fast transmission and an efficient KV fetcher that orchestrates transmission, decoding, and restoration in a pipelined manner to eliminate resource contention, mask network fluctuations, and minimize TTFT, delivering up to 3.51 times reduction compared to state-of-the-art methods while preserving lossless accuracy.
What carries the argument
The codec-friendly tensor layout that rearranges KV cache tensors to match the input requirements of GPU video codecs for high-ratio compression with negligible overhead.
Load-bearing premise
KV cache tensors admit a layout that lets GPU video codecs deliver high compression ratios with no information loss for the downstream LLM computation.
What would settle it
Measure TTFT and output-token identity for identical LLM queries over a 1 Gbps link using KVCodec versus prior compressed or uncompressed baselines; the claimed speedup and exact token match would fail if either metric deviates.
Figures
read the original abstract
Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVCodec, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention, masking network fluctuations, and achieving minimum time-to-first-token (TTFT). We prototype KVCodec on diverse GPUs from high- to low-end. Experiments reveal that it reduces TTFT by up to 3.51 times while maintaining lossless accuracy, compared to SOTA methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KVCodec for remote KV cache reuse in LLM inference, leveraging GPU-native video codecs via two techniques: a codec-friendly tensor layout that compresses KV caches into compact video format for fast transmission, and an efficient KV fetcher that pipelines transmission, decoding, and restoration to minimize TTFT. Experiments on diverse GPUs claim up to 3.51× TTFT reduction versus SOTA methods while maintaining lossless accuracy.
Significance. If the central claims hold, KVCodec could make remote KV reuse practical in bandwidth-limited networks by delivering substantial TTFT gains without accuracy loss, using widely available GPU video codecs and avoiding heavyweight decompression overheads.
major comments (2)
- [§3] §3 (codec-friendly tensor layout): The claim that this layout enables high-ratio compression with zero accuracy impact via GPU video codecs is load-bearing for the TTFT and lossless-accuracy assertions, yet the manuscript supplies no quantitative validation such as per-layer PSNR, exact-match rates on reconstructed tensors, attention-score deltas, or perplexity changes across model families and context lengths.
- [§5] §5 (evaluation): The headline 3.51× TTFT reduction and 'lossless accuracy' results are reported without error bars, detailed baseline implementations, full experimental methodology, or hardware-specific configurations, making it impossible to assess reproducibility or whether the gains survive network variability.
minor comments (2)
- [Abstract] Abstract: The phrase 'diverse GPUs from high- to low-end' is used without naming the specific models or memory capacities tested.
- [§3] Notation: The paper should explicitly define how the KV tensor dimensions are reshaped into video-frame format (e.g., channel, height, width mappings) to allow readers to reproduce the layout.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: [§3] §3 (codec-friendly tensor layout): The claim that this layout enables high-ratio compression with zero accuracy impact via GPU video codecs is load-bearing for the TTFT and lossless-accuracy assertions, yet the manuscript supplies no quantitative validation such as per-layer PSNR, exact-match rates on reconstructed tensors, attention-score deltas, or perplexity changes across model families and context lengths.
Authors: We agree that explicit quantitative validation for reconstruction fidelity is necessary to support the lossless claim. While end-to-end accuracy is preserved in our experiments, we did not report intermediate metrics. In the revised manuscript we will add per-layer PSNR values, exact-match rates on reconstructed KV tensors, attention-score deltas, and perplexity measurements across multiple model families and context lengths to substantiate the zero-accuracy-impact assertion. revision: yes
-
Referee: [§5] §5 (evaluation): The headline 3.51× TTFT reduction and 'lossless accuracy' results are reported without error bars, detailed baseline implementations, full experimental methodology, or hardware-specific configurations, making it impossible to assess reproducibility or whether the gains survive network variability.
Authors: We acknowledge that the evaluation section lacks sufficient detail for full reproducibility. The manuscript summarizes results but omits error bars, baseline implementation specifics, and hardware/network configurations. In the revision we will expand §5 with error bars from repeated runs, detailed baseline descriptions, complete hardware setups for each tested GPU, and additional experiments or analysis addressing network variability. revision: yes
Circularity Check
No circularity; empirical TTFT gains measured on prototypes, not derived from self-referential equations or fits.
full rationale
The paper describes an engineering system (KVCodec) whose headline claims rest on direct prototype measurements across GPU tiers rather than any derivation chain. The abstract and provided text contain no equations, fitted parameters, uniqueness theorems, or ansatzes that could reduce a 'prediction' to its own inputs. The two techniques (codec-friendly layout and pipelined fetcher) are presented as implementation choices whose effectiveness is validated by experiment, not by construction. No self-citations are used to justify core premises. This is the common case of a measurement-driven systems paper whose results remain externally falsifiable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
https://github.com/vllm-project/aibrix
Aibrix: An open-source, large-scale llm inference infrastructure for system research. https://github.com/vllm-project/aibrix
-
[2]
https://aws.amazon.com/ec2/ instance-types/?nc1=h_ls
Amazon ec2 instance types. https://aws.amazon.com/ec2/ instance-types/?nc1=h_ls
-
[3]
https://gpuopen.com/ advanced-media-framework/
Amd advanced media framework. https://gpuopen.com/ advanced-media-framework/
-
[4]
Intel quick sync video installation. https://www.intel.com/content/ www/us/en/architecture-and-technology/quick-sync-video/ quick-sync-video-installation.html?wapkw=quick%20sync%20video
-
[5]
https://developer.nvidia.com/video-codec-sdk
Nvidia video codec sdk. https://developer.nvidia.com/video-codec-sdk
-
[6]
https://huggingface.co/01-ai/Yi-34B, (Accessed on 02/04/2026)
01-ai/yi-34b. https://huggingface.co/01-ai/Yi-34B, (Accessed on 02/04/2026)
work page 2026
-
[7]
https://cursor.com, (Accessed on 02/04/2026)
Cursor - the ai code editor. https://cursor.com, (Accessed on 02/04/2026)
work page 2026
-
[8]
https://www.ffmpeg.org/, (Accessed on 02/04/2026)
Ffmpeg, a complete, cross-platform solution to record, convert and stream audio and video. https://www.ffmpeg.org/, (Accessed on 02/04/2026)
work page 2026
-
[9]
Gstreamer: open source multimedia framework. https://gstreamer. freedesktop.org/, (Accessed on 02/04/2026)
work page 2026
-
[10]
https://huggingface.co/ LargeWorldModel/LWM-Text-Chat-1M, (Accessed on 02/04/2026)
Largeworldmodel/lwm-text-chat-1m. https://huggingface.co/ LargeWorldModel/LWM-Text-Chat-1M, (Accessed on 02/04/2026)
work page 2026
-
[11]
https://huggingface.co/ meta-llama/Llama-3.3-70B-Instruct, (Accessed on 02/04/2026)
meta-llama/llama-3.3-70b-instruct. https://huggingface.co/ meta-llama/Llama-3.3-70B-Instruct, (Accessed on 02/04/2026)
work page 2026
-
[12]
https://www.anthropic.com/news/claude-4, (Ac- cessed on 07/14/2025)
Introducing claude 4. https://www.anthropic.com/news/claude-4, (Ac- cessed on 07/14/2025)
work page 2025
-
[13]
https://github.com/LMCache/LMCache, (Accessed on 07/14/2025)
Lmcache. https://github.com/LMCache/LMCache, (Accessed on 07/14/2025)
work page 2025
-
[14]
https://docs.anthropic.com/ en/release-notes/system-prompts#august-5-2025, (Accessed on 08/05/2025)
Long system prompts in claude. https://docs.anthropic.com/ en/release-notes/system-prompts#august-5-2025, (Accessed on 08/05/2025)
work page 2025
-
[15]
Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117–134, 2024
work page 2024
-
[16]
L-eval: Instituting standard- ized evaluation for long context language models
Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standard- ized evaluation for long context language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14388–14411, 2024
work page 2024
-
[17]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025
work page 2025
-
[18]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[19]
Improving lan- guage models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean- Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving lan- guage models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR, 2022
work page 2022
-
[20]
Moe- lightning: High-throughput moe inference on memory-constrained gpus
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. Moe- lightning: High-throughput moe inference on memory-constrained gpus. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ACM ASPLOS), 2025
work page 2025
-
[21]
Locality-aware fair scheduling in llm serving.arXiv preprint arXiv:2501.14312, 2025
Shiyi Cao, Yichuan Wang, Ziming Mao, Pin-Lun Hsu, Liangsheng Yin, Tian Xia, Dacheng Li, Shu Liu, Yineng Zhang, Yang Zhou, Ying Sheng, Joseph Gonzalez, and Ion Stoica. Locality-aware fair scheduling in llm serving.arXiv preprint arXiv:2501.14312, 2025
-
[22]
Evaluating Large Language Models Trained on Code
Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Dynamo distributed kv cache manager (nvidia dynamo sdk v0.2.0)
NVIDIA Corporation. Dynamo distributed kv cache manager (nvidia dynamo sdk v0.2.0). InNVIDIA Corporation, 2025
work page 2025
-
[25]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.Ad- vances in neural information processing systems (NeurIPS), 2022
work page 2022
-
[26]
Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, and Yu Wang. Cache-to-cache: Direct semantic communica- tion between large language models.arXiv preprint arXiv:2510.03215, 2025
-
[27]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai.{ServerlessLLM}:{Low-Latency} serverless inference for large language models. In18th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024
work page 2024
-
[28]
{Cost- Efficient} large language model serving for multi-turn conversations with {CachedAttention}
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevd- jic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. {Cost- Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 111–126, 2024
work page 2024
-
[29]
Fast state restoration in llm serving with hcache
Shiwei Gao, Youmin Chen, and Jiwu Shu. Fast state restoration in llm serving with hcache. InProceedings of the Twentieth European Conference on Computer Systems, pages 128–143, 2025
work page 2025
-
[30]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khan- delwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.arXiv preprint arXiv:2311.04934, 2023
-
[31]
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chi- ang, Alvin Cheung, and Ion Stoica. M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024
-
[32]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Metagpt: Meta programming for a multi-agent collab- orative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, 13 Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collab- orative framework. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[34]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation.Advances in Neural Information Processing Systems (NeurIPS), 37:1270–1303, 2024
work page 2024
-
[35]
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, et al. Memserve: Context caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565, 2024
-
[36]
Accelerating llm serving for multi- turn dialogues with efficient resource management
Jinwoo Jeong and Jeongseob Ahn. Accelerating llm serving for multi- turn dialogues with efficient resource management. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1–15, 2025
work page 2025
-
[37]
Jiazhi Jiang, Yao Chen, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen, Hongbing Zhang, Jiangsu Du, Dan Huang, et al. Efficient kv cache spillover management on memory-constrained gpu for llm inference.IEEE Transactions on Parallel and Distributed Systems, 37(1):90–105, 2025
work page 2025
-
[38]
De- mystifying cost-efficiency in llm serving over heterogeneous gpus
YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Guoliang HE, Xupeng Miao, Ana Klimovic, Bin CUI, Binhang Yuan, and Eiko Yoneki. De- mystifying cost-efficiency in llm serving over heterogeneous gpus. In International Conference on Machine Learning (ICML), 2025
work page 2025
-
[39]
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval- augmented generation.arXiv preprint arXiv:2404.12457, 2024
-
[40]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In29th Symposium on Operating Systems Principles (ACM SOSP), 2023
work page 2023
-
[41]
Mingcong Lei, Honghao Cai, Zezhou Cui, Liangchen Tan, Junkun Hong, Gehan Hu, Shuangyu Zhu, Yimou Wu, Shaohan Jiang, Ge Wang, et al. Robomemory: A brain-inspired multi-memory agentic framework for lifelong learning in physical embodied systems. InNeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, 2025
work page 2025
-
[42]
fabric-lib: RDMA Point-to-Point Communication for LLM Systems
Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen. Rdma point-to-point communication for llm systems.arXiv preprint arXiv:2510.27656, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, et al. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache.arXiv preprint arXiv:2401.02669, 2024
-
[44]
Parrot: Efficient serving of {LLM- based} applications with semantic variable
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. Parrot: Efficient serving of {LLM- based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 929– 945, 2024
work page 2024
-
[45]
Onet- wovla: A unified vision-language-action model with adaptive reasoning
Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025
-
[46]
Cachegen: Kv cache compression and streaming for fast large language model serving
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, pages 38–56, 2024
work page 2024
-
[47]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asym- metric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Sky- serve: Serving ai models across regions and clouds with spot instances
Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Sky- serve: Serving ai models across regions and clouds with spot instances. InProceedings of the Twentieth European Conference on Computer Sys- tems, pages 159–175, 2025
work page 2025
-
[49]
Helix: Serving large language models over heterogeneous gpus and network via max-flow
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. Helix: Serving large language models over heterogeneous gpus and network via max-flow. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ACM ASPLOS), 2025
work page 2025
-
[50]
Spotserve: Serving generative large language models on preemptible instances
Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. Spotserve: Serving generative large language models on preemptible instances. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 1112–1127, 2024
work page 2024
-
[51]
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. Instinfer: In-storage at- tention offloading for cost-effective long-context llm inference.arXiv preprint arXiv:2409.04992, 2024
-
[52]
Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. InACL (Findings), 2024
work page 2024
-
[53]
ChatDev: Communicative Agents for Software Development
Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for soft- ware development.arXiv preprint arXiv:2307.07924, 6(3):1, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric ar- chitecture for serving LLM chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 155–170, Santa Clara, CA, February 2025. USENIX Association
work page 2025
-
[55]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memo- ryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. Preble: Efficient distributed prompt scheduling for llm serving.arXiv preprint arXiv:2407.00023, 2024
-
[58]
Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving
Foteini Strati, Sara McAllister, Amar Phanishayee, Jakub Tarnawski, and Ana Klimovic. Déjàvu: Kv-cache streaming for fast, fault-tolerant generative llm serving. InProceedings of the 41st International Confer- ence on Machine Learning, 2024
work page 2024
-
[59]
Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (hevc) standard.IEEE Transactions on circuits and systems for video technology, 22(12):1649– 1668, 2012
work page 2012
-
[60]
Llumnix: Dynamic scheduling for large language model serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024
work page 2024
-
[61]
Quest: Query-aware sparsity for efficient long-context llm inference
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference. InInternational Conference on Machine Learning (ICML), 2024. 14
work page 2024
-
[62]
Tetrisched: global reschedul- ing with adaptive plan-ahead in dynamic heterogeneous clusters
Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor Harchol-Balter, and Gregory R Ganger. Tetrisched: global reschedul- ing with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the Eleventh European Conference on Computer Systems, pages 1–16, 2016
work page 2016
-
[63]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang, Christopher M Poskitt, and Jun Sun. Agentspec: Cus- tomizable runtime enforcement for safe and reliable llm agents.arXiv preprint arXiv:2503.18666, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Kvcache cache in the wild: Characterizing and optimizing kvcache cache at a large cloud provider
Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. Kvcache cache in the wild: Characterizing and optimizing kvcache cache at a large cloud provider. In2025 USENIX Annual Technical Conference (USENIX ATC 25). USENIX Association, July 2025
work page 2025
-
[65]
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large lan- guage models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024
work page 2024
-
[66]
Fast dis- tributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023
Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast dis- tributed inference serving for large language models.arXiv preprint arXiv:2305.05920, 2023
-
[67]
org/wiki/Grok_(large_language_model)
Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools.arXiv preprint arXiv:2502.04644, 2025
-
[68]
Xingyu Xiang, Raj Joshi, Yuhan Liu, Jiayi Yao, Chenxingyu Zhao, Junchen Jiang, Yang Zhou, Eddie Kohler, and Minlan Yu. Shadowserve: Interference-free kv cache fetching for distributed prefix caching.arXiv preprint arXiv:2509.16857, 2025
-
[69]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[70]
Layerkv: Optimizing large language model serving with layer-wise kv cache management
Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, and Zhenxuan Pan. Layerkv: Optimizing large language model serving with layer-wise kv cache management. arXiv preprint arXiv:2410.00428, 2024
-
[71]
Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. Llm. 265: Video codecs are se- cretly tensor codecs. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2025
work page 2025
-
[72]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[73]
Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition
Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. arXiv preprint arXiv:2402.15220, 2024
-
[74]
Yi: Open Foundation Models by 01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. Yi: Open foundation models by 01. ai.arXiv preprint arXiv:2403.04652, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Orca: A distributed serving system for transformer- based generative models
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer- based generative models. InUSENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 521–538, 2022
work page 2022
-
[76]
Stateful large language model serving with pensieve
Lingfan Yu, Jinkun Lin, and Jinyang Li. Stateful large language model serving with pensieve. InProceedings of the Twentieth European Con- ference on Computer Systems, pages 144–158, 2025
work page 2025
-
[77]
Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k
Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, et al. Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k. arXiv preprint arXiv:2402.05136, 2024
-
[78]
H2o: Heavy-hitter oracle for efficient generative infer- ence of large language models, 2023
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative infer- ence of large language models, 2023
work page 2023
-
[79]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. 2024
work page 2024
-
[80]
Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 24), pages 193–210, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.