ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse
Pith reviewed 2026-05-25 00:19 UTC · model grok-4.3
The pith
ObjectCache co-designs object storage protocol and transfer schedule to deliver KV cache in GPU consumption order, adding 5.6% latency for 64K contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ObjectCache co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. For 64K contexts, it adds only 5.6% latency over local DRAM; for 4K contexts, where less compute is available to mask transfer, ObjectCache adds 56--75 ms over the optimal local layerwise baseline. Under shared bandwidth caps, our scheduler reduces added TTFT by 1.2--1.8x compared with equal bandwidth sharing.
What carries the argument
The layerwise object-storage retrieval protocol and transfer scheduler that enforce consumption-order delivery from the object store to the inference engine.
If this is right
- KV cache capacity is no longer bounded by DRAM pool size, enabling larger shared prefixes without proportional hardware cost.
- Serving clusters can replace dedicated remote DRAM with commodity object storage while preserving competitive TTFT.
- The ordered scheduler improves TTFT under bandwidth contention compared with naive equal sharing.
- Layerwise overlap becomes feasible for contexts where compute time exceeds transfer time.
Where Pith is reading between the lines
- The same ordering principle could apply to other sequential data patterns in inference or training pipelines that use object backends.
- Performance gains would vary with network latency and object-store internals beyond the tested 100 Gbps RoCE setup.
- Future serving systems might treat object storage as a native cache tier rather than a last-resort fallback.
Load-bearing premise
The storage and network layers can sustain the required bandwidth and deliver data strictly in GPU consumption order without reordering overhead or contention that breaks the transfer-compute overlap.
What would settle it
A test run with many concurrent long-context requests where measured TTFT exceeds the reported overheads because the object store cannot maintain exact delivery order or bandwidth drops below the level needed for overlap.
Figures
read the original abstract
Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache is often larger than what GPU memory and local DRAM can hold. To preserve latency, current systems keep the KV cache in remote DRAM pools, increasing serving-cluster size and cost. In this paper, we explore a different approach: storing the KV cache in S3-compatible object storage so that capacity is no longer the constraint, while minimizing the impact on TTFT. We propose ObjectCache, which co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. We prototype ObjectCache on a 100 Gbps RoCE cluster with NIXL (an inference library that abstracts storage and memory), Ceph RGW (an Object Gateway for clusters), and DAOS (an open source storage system). For 64K contexts, common in today's systems, ObjectCache adds only 5.6\% latency over local DRAM; for 4K contexts, where less compute is available to mask transfer, ObjectCache adds 56--75\,ms over the optimal local layerwise baseline. Under shared bandwidth caps, our scheduler reduces added TTFT by 1.2--1.8x compared with equal bandwidth sharing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ObjectCache, a co-designed storage protocol and transfer scheduler for retrieving KV cache from S3-compatible object storage (Ceph RGW, DAOS) during LLM inference. The key idea is to emit KV cache bytes from the remote server in the exact layerwise order the GPU will consume them, enabling full overlap of network transfer with compute even for concurrent requests. On a 100 Gbps RoCE prototype, the system reports 5.6% added TTFT versus local DRAM for 64K contexts and 1.2–1.8× better TTFT under bandwidth caps than equal sharing; for 4K contexts the absolute overhead is 56–75 ms.
Significance. If the ordering guarantee and overlap hold under realistic contention, the result would allow KV cache capacity to be decoupled from expensive DRAM pools, materially lowering the cost of large-context serving while preserving acceptable latency. The direct prototype measurements (no fitted parameters) and explicit comparison to a layerwise local baseline are strengths.
major comments (3)
- [Prototype and transfer schedule description] The central performance claim (5.6% overhead for 64K contexts) rests on the assumption that the co-designed protocol delivers bytes strictly in GPU consumption order with negligible reordering or contention overhead. The manuscript provides no concrete description of the required server-side indexing, custom GET semantics, or client-side reassembly logic that would enforce this ordering on Ceph RGW or DAOS; standard object-storage range GETs do not supply such a guarantee. This directly affects whether the reported overlap is achievable.
- [Evaluation section] Quantitative results are presented without error bars, without stating the number of runs, without baseline implementation details (e.g., exact NIXL configuration or how the local layerwise baseline was realized), and without data-exclusion criteria. Because the 5.6% and 56–75 ms figures are the primary evidence for the overlap claim, these omissions make it impossible to judge statistical reliability or reproducibility.
- [Shared-bandwidth experiments] Under shared bandwidth the scheduler is claimed to reduce added TTFT by 1.2–1.8× versus equal sharing, yet no description is given of how the scheduler detects or reacts to contention, nor of the bandwidth cap values used in the experiment. This leaves the bandwidth-sharing result only partially supported.
minor comments (2)
- [Abstract] The abstract states results for “64K contexts” and “4K contexts” but does not define whether these are prompt lengths, total context lengths, or batch sizes; consistent terminology should be used throughout.
- [Figures and tables] Figure captions and table headers should explicitly state the hardware (100 Gbps RoCE, specific CPU/GPU models) and the exact workload parameters so that readers can interpret the numbers without returning to the text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We agree that additional implementation and methodological details are needed to strengthen the paper and will revise accordingly. We respond to each major comment below.
read point-by-point responses
-
Referee: [Prototype and transfer schedule description] The central performance claim (5.6% overhead for 64K contexts) rests on the assumption that the co-designed protocol delivers bytes strictly in GPU consumption order with negligible reordering or contention overhead. The manuscript provides no concrete description of the required server-side indexing, custom GET semantics, or client-side reassembly logic that would enforce this ordering on Ceph RGW or DAOS; standard object-storage range GETs do not supply such a guarantee. This directly affects whether the reported overlap is achievable.
Authors: We agree the manuscript would benefit from greater detail on the protocol. The high-level co-design is described, but low-level server-side indexing, custom GET extensions, and client reassembly are not fully specified. In revision we will add Section 3.2 with concrete descriptions of the object metadata used for layer ordering, the extended range-GET semantics implemented for Ceph RGW and DAOS, and the NIXL client logic that queues requests to enforce consumption order. This will make explicit how the ordering guarantee is provided beyond standard range GETs. revision: yes
-
Referee: [Evaluation section] Quantitative results are presented without error bars, without stating the number of runs, without baseline implementation details (e.g., exact NIXL configuration or how the local layerwise baseline was realized), and without data-exclusion criteria. Because the 5.6% and 56–75 ms figures are the primary evidence for the overlap claim, these omissions make it impossible to judge statistical reliability or reproducibility.
Authors: The referee correctly identifies these omissions. We will revise the evaluation section to report: error bars as standard deviation across 10 runs per data point; the precise NIXL version and configuration flags; a description of the local layerwise baseline (identical NIXL scheduler with local DRAM backend); and confirmation that no measurements were excluded. These changes will support reproducibility of the 5.6% and 56–75 ms results. revision: yes
-
Referee: [Shared-bandwidth experiments] Under shared bandwidth the scheduler is claimed to reduce added TTFT by 1.2–1.8× versus equal sharing, yet no description is given of how the scheduler detects or reacts to contention, nor of the bandwidth cap values used in the experiment. This leaves the bandwidth-sharing result only partially supported.
Authors: We will expand Section 5.3 with the missing details. The scheduler detects contention via NIXL telemetry on per-request progress and instantaneous available bandwidth, then reacts by re-prioritizing layer chunks and issuing smaller requests. The bandwidth caps tested were 25 Gbps, 50 Gbps, and 75 Gbps on the 100 Gbps RoCE link. Pseudocode for the contention reaction logic will also be added. revision: yes
Circularity Check
No circularity; results are direct prototype measurements
full rationale
The paper presents ObjectCache as a co-designed storage protocol and scheduler for KV cache retrieval from object storage, with all performance claims (5.6% latency overhead for 64K contexts, 1.2-1.8x improvement under bandwidth caps) derived from direct measurements on a physical 100 Gbps RoCE prototype using Ceph RGW and DAOS. No equations, fitted parameters, predictions, or derivation chains appear in the provided text; the central claim rests on empirical overlap of transfer and compute rather than any self-referential reduction or self-citation load-bearing step. This is a standard systems paper whose results are externally falsifiable via replication on the described hardware.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ObjectCache co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Stall-opt problem (Equation 5) and calibrated zero-stall rate r*_i = s_i / c_i
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. 2025. Agentic AI: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review59, 1 (2025), 11
work page 2025
-
[2]
Arney Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachan- dran Ramjee. 2025. Efficient llm inference via chunked prefills.ACM SIGOPS Operating Systems Review59, 1 (2025), 9–16
work page 2025
-
[3]
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training general- ized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901
work page 2023
-
[4]
Anthropic. 2026. Models & Pricing.https://platform.claude.com/ docs/en/about-claude/pricing
work page 2026
-
[5]
Ceph. 2026. Ceph - a scalable distributed storage system.https:// github.com/ceph/ceph
work page 2026
-
[6]
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu- Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed Ab- delfattah, and Kai-Chiang Wu. 2025. Palu: KV-cache compression with low-rank projection. InInternational Conference on Learning Rep- resentations, Vol. 2025. 50222–50249
work page 2025
-
[7]
Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. {IMPRESS}: An{Importance-Informed} {Multi-Tier}prefix{KV} storage system for large language model inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 187–201
work page 2025
-
[8]
Yukang Chen, Weihao Cui, Han Zhao, Ziyi Xu, Xiaoze Fan, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Bingsheng He, and Quan Chen
-
[9]
Towards High-Goodput LLM Serving with Prefill-decode Mul- tiplexing. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems, Volume 2. 2030–2047
work page 2030
-
[10]
Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, and Haibo Chen
-
[11]
InProceedings of the 21st European Conference on Computer Systems (EuroSys ’26)
KUNSERVE: Parameter-centric Memory Management for Effi- cient Memory Overloading Handling in LLM Serving. InProceedings of the 21st European Conference on Computer Systems (EuroSys ’26). As- sociation for Computing Machinery, 1244–1260. doi:10.1145/3767295. 3769348
-
[12]
Cloudian. 2025. Supercharging Vector Database Indexing: 8x Faster with Cloudian S3 RDMA, Milvus and NVIDIA. https://cloudian.com/blog/supercharging-vector-database-indexing- 8x-faster-with-cloudian-s3-rdma-and-nvidia/
work page 2025
-
[13]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[14]
Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359
work page 2022
-
[15]
DAOS. 2026. DAOS.https://docs.daos.io/v2.6/
work page 2026
-
[16]
DeepSeek. 2026. DeepSeek R1 Distill Qwen 7B.https://huggingface. co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
work page 2026
-
[17]
DeepSeek. 2026. Models & Pricing.https://api-docs.deepseek.com/ quick_start/pricing
work page 2026
-
[18]
Dell Technologies. 2026. Accelerating AI Workloads with RDMA for S3-compatible storage: A Game-Changer with Dell ObjectScale. https://infohub.delltechnologies.com/en-uk/p/accelerating-ai- workloads-with-rdma-for-s3-compatible-storage-a-game-changer- with-dell-objectscale/
work page 2026
-
[19]
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks Lakshmanan, and Ahmed H Awadal- lah. 2024. Hybrid llm: Cost-efficient and quality-aware query rout- ing. InInternational Conference on Learning Representations, Vol. 2024. 41348–41366
work page 2024
-
[20]
Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, and Mao Yang. 2026. Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–13
work page 2026
-
[21]
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meet- ing llms: Towards retrieval-augmented large language models. InPro- ceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 6491–6501
work page 2024
-
[22]
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2026. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.Advances in Neural Information Processing Systems38 (2026), 113152–113188
work page 2026
-
[23]
2024.{ServerlessLLM}:{Low- Latency}serverless inference for large language models
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024.{ServerlessLLM}:{Low- Latency}serverless inference for large language models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 135–153
work page 2024
-
[24]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024.{Cost- Efficient}large language model serving for multi-turn conversations with{CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24). 111–126
work page 2024
-
[25]
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model tells you what to discard: Adaptive kv cache compression for llms. InInternational Conference on Learning Representations, Vol. 2024. 22975–22988
work page 2024
-
[26]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for 13 Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, and Gustavo Alonso low-latency inference.Proceedings of Machine Learning and Systems 6 (2024), 325–338
work page 2024
-
[27]
Google. 2026. Models & Pricing.https://cloud.google.com/gemini- enterprise-agent-platform/generative-ai/pricing
work page 2026
-
[28]
Yingyi Hao, Ting Yao, Xingda Wei, Dingyan Zhang, Tianle Sun, Yi- wen Zhang, Zhiyong Fu, Huatao Wu, and Rong Chen. 2026. Fast Cloud Storage for{AI}Jobs via Grouped{I/O} {API}with Trans- parent{Read/Write}Optimizations. In24th USENIX Conference on File and Storage Technologies (FAST 26). 255–270
work page 2026
-
[29]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Sys- tems37 (2024), 1270–1303
work page 2024
-
[30]
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, et al
- [31]
-
[32]
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, et al. 2025. ShuffleInfer: Disaggregate LLM inference for mixed down- stream workloads.ACM Transactions on Architecture and Code Opti- mization22, 2 (2025), 1–24
work page 2025
-
[33]
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al. 2025. {DEEPSERVE}: Serverless Large Language Model Serving at Scale. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 57– 72
work page 2025
-
[34]
IBM. 2026. Granite 3.3 8B.https://huggingface.co/ibm-granite/ granite-3.3-8b-instruct
work page 2026
-
[35]
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ra- machandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Un- locking full prefill-decode overlap for faster llm inference. InProceed- ings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897– 912
work page 2025
-
[36]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[37]
InProceedings of the 29th symposium on op- erating systems principles
Efficient memory management for large language model serv- ing with pagedattention. InProceedings of the 29th symposium on op- erating systems principles. 611–626
-
[38]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic{KV}cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172
work page 2024
-
[39]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen
-
[40]
Advances in Neural Information Processing Systems37 (2024), 22947– 22970
Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970
work page 2024
-
[41]
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al
- [43]
-
[44]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56
work page 2024
-
[45]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava
-
[46]
Scissorhands: Exploiting the persistence of importance hypoth- esis for llm kv cache compression at test time.Advances in Neural Information Processing Systems36 (2023), 52342–52364
work page 2023
-
[47]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: a tuning- free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning. 32332–32344
work page 2024
-
[48]
LMCache. 2026. LMCache.https://github.com/lmcache/lmcache
work page 2026
-
[49]
Jiaqi Lou, Xinhao Kong, Jinghan Huang, Wei Bai, Nam Sung Kim, and Danyang Zhuo. 2024. Harmonic: Hardware-assisted{RDMA}per- formance isolation for public clouds. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 1479–1496
work page 2024
-
[50]
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving large language models over heterogeneous gpus and network via max-flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 586–602
work page 2025
-
[51]
Fanxu Meng, Pingzhi Tang, Zengwei Yao, Xing Sun, and Muhan Zhang. 2026. TransMLA: Migrating GQA models to MLA with full deepseek compatibility and speedup.Advances in Neural Information Processing Systems38 (2026), 81977–82019
work page 2026
-
[52]
Meta. 2026. Llama 3.1 8B.https://huggingface.co/meta-llama/Llama- 3.1-8B
work page 2026
-
[53]
MinIO. 2025. MinIO S3 over RDMA.https://blog.min.io/s3-over- rdma/
work page 2025
-
[54]
Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, and Fan Lai
-
[55]
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration.arXiv preprint arXiv:2604.25080(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
NVIDIA. 2025. How to Unlock Accelerated AI Storage Performance With RDMA for S3-Compatible Storage.https://blogs.nvidia.com/ blog/s3-compatible-ai-storage/
work page 2025
-
[57]
NVIDIA. 2026. NVIDIA cuObject: GPUDirect Storage for Objects. https://docs.nvidia.com/gpudirect-storage/cuobject/index.html
work page 2026
-
[58]
NVIDIA. 2026. NVIDIA cuObject server v1.0.0 Release Notes. https://docs.nvidia.com/gpudirect-storage/cuobject/cuobject- server-release-notes/index.html
work page 2026
-
[59]
Nvidia. 2026. NVIDIA Inference Xfer Library (NIXL).https://github. com/ai-dynamo/nixl
work page 2026
-
[60]
OpenAI. 2026. Models & Pricing.https://developers.openai.com/api/ docs/pricing
work page 2026
-
[61]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132
work page 2024
-
[62]
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ram- jee, and Ashish Panwar. 2025. vattention: Dynamic memory manage- ment for serving llms without pagedattention. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 1. 1133–1150
work page 2025
-
[63]
Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yong- wei Wu, Weimin Zheng, and Mingxing Zhang. 2026. Prefill-as- a-Service: KVCache of Next-Generation Models Could Go Cross- Datacenter.arXiv preprint arXiv:2604.15039(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[64]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)
work page 2024
-
[65]
2021.{ReDMArk}: Bypassing{RDMA}security mechanisms
Benjamin Rothenberger, Konstantin Taranov, Adrian Perrig, and Torsten Hoefler. 2021.{ReDMArk}: Bypassing{RDMA}security mechanisms. In30th USENIX Security Symposium (USENIX Security 21). 4277–4292. 14 ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse
work page 2021
-
[66]
Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[67]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116
work page 2023
-
[68]
Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. 2020. Securing{RDMA}for{High-Performance}datacen- ter storage systems. In12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20)
work page 2020
-
[69]
Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Juncheng Yang, and Yue Cheng. 2026. MorphServe: Efficient and Workload- Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing. InProceedings of Machine Learning and Systems
work page 2026
-
[70]
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. [n. d.]. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence. InForty-second International Conference on Machine Learning
-
[71]
Shuwen Sun, Isaac Khor, Ji-Yong Shin, and Peter Desnoyers. 2025. A Fast, Efficient, and Strongly-Consistent Object Store. InProceedings of the 2025 ACM Symposium on Cloud Computing. 708–721
work page 2025
-
[72]
Shin-Yeh Tsai, Mathias Payer, and Yiying Zhang. 2019. Pythia: remote oracles for the masses. In28th USENIX Security Symposium (USENIX Security 19). 693–710
work page 2019
-
[73]
UCX. 2026. Unified Communication X.https://github.com/openucx/ ucx
work page 2026
-
[74]
VAST Data. 2025. S3 over RDMA: Scaling the KV Cache Data Plane.https://www.vastdata.com/blog/s3-over-rdma-scaling-the- kv-cache-data-plane
work page 2025
-
[75]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need.Advances in neural information processing systems30 (2017)
work page 2017
-
[76]
vLLM. 2026. vLLM.https://github.com/vllm-project/vllm
work page 2026
-
[77]
Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. {KVCache}Cache in the Wild: Characterizing and Optimizing {KVCache}Cache at a Large Cloud Provider. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 465–482
work page 2025
-
[78]
Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. 2025. Burstgpt: A real-world workload dataset to optimize llm serving sys- tems. InProceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V. 2. 5831–5841
work page 2025
-
[79]
Yuxing Xiang, Xue Li, Kun Qian, Yan Zhang, Wenyuan Yu, Ennan Zhai, Xin Jin, and Jingren Zhou. 2026.{ServeGen}: Workload Char- acterization and Generation of Large Language Model Serving in Pro- duction. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26). 1845–1859
work page 2026
-
[80]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with atten- tion sinks. InInternational Conference on Learning Representations, Vol. 2024. 21875–21895
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.