ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

Aditya Dhakal; Dejan Milojicic; Gustavo Alonso; Yunming Xiao; Yu Zhu

arxiv: 2605.22850 · v1 · pith:JG4JPZIXnew · submitted 2026-05-16 · 💻 cs.DC · cs.AI

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

Yu Zhu , Aditya Dhakal , Yunming Xiao , Dejan Milojicic , Gustavo Alonso This is my paper

Pith reviewed 2026-05-25 00:19 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords KV cacheobject storageLLM servingprefix cachingdistributed storagetransfer schedulinginference optimization

0 comments

The pith

ObjectCache co-designs object storage protocol and transfer schedule to deliver KV cache in GPU consumption order, adding 5.6% latency for 64K contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large KV caches for shared prefixes in LLM serving often exceed GPU and local DRAM capacity, forcing systems to use expensive remote DRAM pools. The paper instead stores these caches in scalable S3-compatible object storage while aiming to keep time-to-first-token low. ObjectCache achieves low overhead by aligning the storage server's delivery order exactly with the sequence in which the GPU will consume the data during layerwise inference. This alignment lets data transfers overlap with compute across concurrent requests. The result is a system that decouples cache capacity from local memory size without large latency penalties for long contexts.

Core claim

ObjectCache co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. For 64K contexts, it adds only 5.6% latency over local DRAM; for 4K contexts, where less compute is available to mask transfer, ObjectCache adds 56--75 ms over the optimal local layerwise baseline. Under shared bandwidth caps, our scheduler reduces added TTFT by 1.2--1.8x compared with equal bandwidth sharing.

What carries the argument

The layerwise object-storage retrieval protocol and transfer scheduler that enforce consumption-order delivery from the object store to the inference engine.

If this is right

KV cache capacity is no longer bounded by DRAM pool size, enabling larger shared prefixes without proportional hardware cost.
Serving clusters can replace dedicated remote DRAM with commodity object storage while preserving competitive TTFT.
The ordered scheduler improves TTFT under bandwidth contention compared with naive equal sharing.
Layerwise overlap becomes feasible for contexts where compute time exceeds transfer time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ordering principle could apply to other sequential data patterns in inference or training pipelines that use object backends.
Performance gains would vary with network latency and object-store internals beyond the tested 100 Gbps RoCE setup.
Future serving systems might treat object storage as a native cache tier rather than a last-resort fallback.

Load-bearing premise

The storage and network layers can sustain the required bandwidth and deliver data strictly in GPU consumption order without reordering overhead or contention that breaks the transfer-compute overlap.

What would settle it

A test run with many concurrent long-context requests where measured TTFT exceeds the reported overheads because the object store cannot maintain exact delivery order or bandwidth drops below the level needed for overlap.

Figures

Figures reproduced from arXiv: 2605.22850 by Aditya Dhakal, Dejan Milojicic, Gustavo Alonso, Yunming Xiao, Yu Zhu.

**Figure 1.** Figure 1: Long-context LLM tasks increasingly reuse longlived prefixes, growing the aggregate KV cache footprint that a serving cluster must retain. lengths and reuse opportunities grow, however, the aggregate KV-cache footprint that a serving cluster must retain also grows rapidly (Appendix Figure A1). The challenge is that reusable KV cache is much larger than the memory capacity naturally available near GPUs. G… view at source ↗

**Figure 2.** Figure 2: Per-layer KV payload for a 16-token chunk across recent open-weight LLM families. The 64 KB dashed line marks the grouped-query attention (GQA) baseline with 8 KV heads of 128 dimensions; multi-head latent attention (MLA) and smaller head counts push recent models below this threshold. • We demonstrate on a 100 Gbps RoCE prototype that ObjectCache approaches local layerwise KV-cache performance for long-c… view at source ↗

**Figure 3.** Figure 3: Prefix reuse under fine and coarse storage granularities. Coarse chunks reduce index depth but lose branch points where requests can diverge, forcing otherwise reusable tokens to be recomputed. 1K 4K 16K 64K 256K Context length (tokens) 10 0 10 3 Time (ms) Tokenize Lookup, G=16 Lookup, G=256 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Prefix-hash lookup cost is small relative to tokenization, even at 16-token granularity. Fine-grained indexing is therefore not the request critical-path bottleneck. that below the efficient object-transfer regime [6, 75]. Increasing the chunk granularity enables larger physical data transfers, but also coarsens prefix reuse. ObjectCache instead keeps the logical reuse granularity independent of the eff… view at source ↗

**Figure 5.** Figure 5: ObjectCache in a disaggregated cluster. Prefix KV caches are stored in an object-storage tier through an S3- compatible interface, decoupling prefill and decode workers from the machines that produced the cache. ObjectCache extends this interface so object storage can serve KV cache reuse with the granularity and layerwise delivery order expected by LLM serving systems. vLLM / SGLang (inference engine) LMC… view at source ↗

**Figure 6.** Figure 6: ObjectCache system design. HTTP preserves S3- compatible control, while the storage server aggregates matched chunk ranges into layerwise KV payloads and delivers them to the serving node over RDMA [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: KV cache fetch scheduling. Chunkwise delivery serializes cache loading before prefill, while layerwise delivery exposes per-layer readiness so transfer can overlap compute when bandwidth is sufficient and otherwise appears as per-layer stall. and link rate; the dispatch rule itself does not. A full sensitivity analysis of Θ is left to future work. Equation 2 is also what decides which requests enter mul… view at source ↗

**Figure 8.** Figure 8: Raw object-storage interface baseline. The gray region marks throughput above the 100 Gbps link capacity; points in this region are limited by local storage and host execution rather than the network. 64KB 256KB 1MB 4MB P50 P99 S3TCP S3RDMA Buffer S3RDMA Direct 0 4 8 12 Throughput (GB/s) 10 2 10 3 10 4 10 5 Latency (µs) (a) S3 GET, 𝐶=8 S3TCP S3RDMA Buffer S3RDMA Direct 0 4 8 12 Throughput (GB/s) 10 2 10 3 … view at source ↗

**Figure 9.** Figure 9: S3-compatible interface baseline. S3RDMA Direct preserves high large-object throughput, while S3TCP and S3RDMA Buffer expose protocol and staging bottlenecks. 5.1 Raw Storage Baseline Setup. We measure DAOS throughput as seen by the NIXL object client without the Ceph RGW gateway, isolating the storage backend from S3 protocol overhead ( [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Per-request latency breakdown of S3RDMA Direct. Storage is backend object I/O, Network is RDMA dataplane transfer, and Control Plane is S3 frontend request and metadata processing. For small objects, fixed control-plane work dominates the remaining latency after RDMA removes TCP data movement. S3RDMA Direct S3RDMA Batch S3RDMA Agg Agg Data Plane G=16 64KB G=32 128KB G=64 256KB G=128 512KB G=256 1MB G=51… view at source ↗

**Figure 12.** Figure 12: connects aggregation throughput to serving. The first heatmap reports measured per-layer compute time for Llama 3.1 8B, while the remaining heatmaps report the transfer throughput required for Llama, Granite [30], and DeepSeek [13] models. Configurations requiring less bandwidth than the ObjectCache layer throughput are compute-bound; configurations above that boundary suffer from added latency. The cou… view at source ↗

**Figure 11.** Figure 11: Server-side aggregation amortizes per-object overhead and achieves high speedups at small chunk granularities. and low throughput for fine-grained objects. After RDMA reduces transfer overhead, HTTP and RGW metadata work dominate the remaining per-request cost at small objects ( [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 13.** Figure 13: TTFT overhead for Llama 3.1 8B relative to the measured optimal local layerwise baseline for each workload configuration. CW denotes chunkwise delivery and LW denotes layerwise delivery; S3Agg-LW stays close to the local baseline except in the transfer-bound corner where layer delivery cannot be hidden by compute. 0 10 20 30 Δ TTFT (%) 4K, hit=12.5% 4649 46 0 50 100 150 4K, hit=50% 196 184 183 23 6 -1 0 1… view at source ↗

**Figure 14.** Figure 14: Sensitivity of S3-backed KV loading to bandwidth changes for Llama 3.1 8B. Each bar reports the relative TTFT increase when the same path and granularity are capped at 10 Gbps, using its 100 Gbps result as the baseline. most cases. In several configurations, S3Agg-LW even achieves lower TTFT than Local-DRAM-LW. We interpret this as an observed resource-isolation effect: server-side aggregation uses dedi… view at source ↗

**Figure 15.** Figure 15: Sensitivity of layerwise TTFT to throttled transfer throughput with S3Agg-LW. Each panel normalizes TTFT increase relative to its best measured point. Dashed lines show the perfect-overlap bandwidth estimate; dashdot lines show the calibrated scheduler target. and startup costs, so ObjectCache should not assume that aggregation is always the better delivery mode. 5.6 Sensitivity to Bandwidth Changes We … view at source ↗

**Figure 16.** Figure 16: Bandwidth scheduling under shared transfer caps. For each workload, the left panel shows the bandwidth allocated by each policy and the right panel shows the resulting added TTFT. Workload-A uses an 80 Gbps cap; Workload-B and Workload-C use 50 Gbps caps. 5.7 Bandwidth Allocation for Multi Tenants We next evaluate how the bandwidth allocator behaves when multiple S3Agg-LW retrievals share a fixed transfe… view at source ↗

read the original abstract

Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache is often larger than what GPU memory and local DRAM can hold. To preserve latency, current systems keep the KV cache in remote DRAM pools, increasing serving-cluster size and cost. In this paper, we explore a different approach: storing the KV cache in S3-compatible object storage so that capacity is no longer the constraint, while minimizing the impact on TTFT. We propose ObjectCache, which co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. We prototype ObjectCache on a 100 Gbps RoCE cluster with NIXL (an inference library that abstracts storage and memory), Ceph RGW (an Object Gateway for clusters), and DAOS (an open source storage system). For 64K contexts, common in today's systems, ObjectCache adds only 5.6\% latency over local DRAM; for 4K contexts, where less compute is available to mask transfer, ObjectCache adds 56--75\,ms over the optimal local layerwise baseline. Under shared bandwidth caps, our scheduler reduces added TTFT by 1.2--1.8x compared with equal bandwidth sharing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ObjectCache shows a workable path to object storage for KV caches via consumption-order scheduling, but the ordering guarantee and evaluation details need more evidence.

read the letter

The main takeaway is that this work makes object storage practical for large KV caches in LLM serving by co-designing retrieval to match the GPU's layerwise consumption order, so transfers overlap with compute across requests. For 64K contexts it reports just 5.6% added latency over local DRAM, and the scheduler cuts the penalty under bandwidth sharing by 1.2-1.8x versus equal allocation. That combination of object backend plus ordered scheduler is not a standard extension of prior remote-memory work, and the prototype on Ceph RGW and DAOS with NIXL on 100 Gbps RoCE gives concrete numbers rather than simulation.

Referee Report

3 major / 2 minor

Summary. The paper proposes ObjectCache, a co-designed storage protocol and transfer scheduler for retrieving KV cache from S3-compatible object storage (Ceph RGW, DAOS) during LLM inference. The key idea is to emit KV cache bytes from the remote server in the exact layerwise order the GPU will consume them, enabling full overlap of network transfer with compute even for concurrent requests. On a 100 Gbps RoCE prototype, the system reports 5.6% added TTFT versus local DRAM for 64K contexts and 1.2–1.8× better TTFT under bandwidth caps than equal sharing; for 4K contexts the absolute overhead is 56–75 ms.

Significance. If the ordering guarantee and overlap hold under realistic contention, the result would allow KV cache capacity to be decoupled from expensive DRAM pools, materially lowering the cost of large-context serving while preserving acceptable latency. The direct prototype measurements (no fitted parameters) and explicit comparison to a layerwise local baseline are strengths.

major comments (3)

[Prototype and transfer schedule description] The central performance claim (5.6% overhead for 64K contexts) rests on the assumption that the co-designed protocol delivers bytes strictly in GPU consumption order with negligible reordering or contention overhead. The manuscript provides no concrete description of the required server-side indexing, custom GET semantics, or client-side reassembly logic that would enforce this ordering on Ceph RGW or DAOS; standard object-storage range GETs do not supply such a guarantee. This directly affects whether the reported overlap is achievable.
[Evaluation section] Quantitative results are presented without error bars, without stating the number of runs, without baseline implementation details (e.g., exact NIXL configuration or how the local layerwise baseline was realized), and without data-exclusion criteria. Because the 5.6% and 56–75 ms figures are the primary evidence for the overlap claim, these omissions make it impossible to judge statistical reliability or reproducibility.
[Shared-bandwidth experiments] Under shared bandwidth the scheduler is claimed to reduce added TTFT by 1.2–1.8× versus equal sharing, yet no description is given of how the scheduler detects or reacts to contention, nor of the bandwidth cap values used in the experiment. This leaves the bandwidth-sharing result only partially supported.

minor comments (2)

[Abstract] The abstract states results for “64K contexts” and “4K contexts” but does not define whether these are prompt lengths, total context lengths, or batch sizes; consistent terminology should be used throughout.
[Figures and tables] Figure captions and table headers should explicitly state the hardware (100 Gbps RoCE, specific CPU/GPU models) and the exact workload parameters so that readers can interpret the numbers without returning to the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that additional implementation and methodological details are needed to strengthen the paper and will revise accordingly. We respond to each major comment below.

read point-by-point responses

Referee: [Prototype and transfer schedule description] The central performance claim (5.6% overhead for 64K contexts) rests on the assumption that the co-designed protocol delivers bytes strictly in GPU consumption order with negligible reordering or contention overhead. The manuscript provides no concrete description of the required server-side indexing, custom GET semantics, or client-side reassembly logic that would enforce this ordering on Ceph RGW or DAOS; standard object-storage range GETs do not supply such a guarantee. This directly affects whether the reported overlap is achievable.

Authors: We agree the manuscript would benefit from greater detail on the protocol. The high-level co-design is described, but low-level server-side indexing, custom GET extensions, and client reassembly are not fully specified. In revision we will add Section 3.2 with concrete descriptions of the object metadata used for layer ordering, the extended range-GET semantics implemented for Ceph RGW and DAOS, and the NIXL client logic that queues requests to enforce consumption order. This will make explicit how the ordering guarantee is provided beyond standard range GETs. revision: yes
Referee: [Evaluation section] Quantitative results are presented without error bars, without stating the number of runs, without baseline implementation details (e.g., exact NIXL configuration or how the local layerwise baseline was realized), and without data-exclusion criteria. Because the 5.6% and 56–75 ms figures are the primary evidence for the overlap claim, these omissions make it impossible to judge statistical reliability or reproducibility.

Authors: The referee correctly identifies these omissions. We will revise the evaluation section to report: error bars as standard deviation across 10 runs per data point; the precise NIXL version and configuration flags; a description of the local layerwise baseline (identical NIXL scheduler with local DRAM backend); and confirmation that no measurements were excluded. These changes will support reproducibility of the 5.6% and 56–75 ms results. revision: yes
Referee: [Shared-bandwidth experiments] Under shared bandwidth the scheduler is claimed to reduce added TTFT by 1.2–1.8× versus equal sharing, yet no description is given of how the scheduler detects or reacts to contention, nor of the bandwidth cap values used in the experiment. This leaves the bandwidth-sharing result only partially supported.

Authors: We will expand Section 5.3 with the missing details. The scheduler detects contention via NIXL telemetry on per-request progress and instantaneous available bandwidth, then reacts by re-prioritizing layer chunks and issuing smaller requests. The bandwidth caps tested were 25 Gbps, 50 Gbps, and 75 Gbps on the 100 Gbps RoCE link. Pseudocode for the contention reaction logic will also be added. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct prototype measurements

full rationale

The paper presents ObjectCache as a co-designed storage protocol and scheduler for KV cache retrieval from object storage, with all performance claims (5.6% latency overhead for 64K contexts, 1.2-1.8x improvement under bandwidth caps) derived from direct measurements on a physical 100 Gbps RoCE prototype using Ceph RGW and DAOS. No equations, fitted parameters, predictions, or derivation chains appear in the provided text; the central claim rests on empirical overlap of transfer and compute rather than any self-referential reduction or self-citation load-bearing step. This is a standard systems paper whose results are externally falsifiable via replication on the described hardware.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on the empirical behavior of the implemented prototype; no free parameters, mathematical axioms, or new postulated entities are invoked in the abstract.

pith-pipeline@v0.9.0 · 5804 in / 1059 out tokens · 66374 ms · 2026-05-25T00:19:57.449773+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ObjectCache co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stall-opt problem (Equation 5) and calibrated zero-stall rate r*_i = s_i / c_i

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 4 internal anchors

[1]

Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. 2025. Agentic AI: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review59, 1 (2025), 11

work page 2025
[2]

Arney Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachan- dran Ramjee. 2025. Efficient llm inference via chunked prefills.ACM SIGOPS Operating Systems Review59, 1 (2025), 9–16

work page 2025
[3]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training general- ized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901

work page 2023
[4]

Anthropic. 2026. Models & Pricing.https://platform.claude.com/ docs/en/about-claude/pricing

work page 2026
[5]

Ceph. 2026. Ceph - a scalable distributed storage system.https:// github.com/ceph/ceph

work page 2026
[6]

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu- Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed Ab- delfattah, and Kai-Chiang Wu. 2025. Palu: KV-cache compression with low-rank projection. InInternational Conference on Learning Rep- resentations, Vol. 2025. 50222–50249

work page 2025
[7]

Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. {IMPRESS}: An{Importance-Informed} {Multi-Tier}prefix{KV} storage system for large language model inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 187–201

work page 2025
[8]

Yukang Chen, Weihao Cui, Han Zhao, Ziyi Xu, Xiaoze Fan, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Bingsheng He, and Quan Chen

work page
[9]

InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems, Volume 2

Towards High-Goodput LLM Serving with Prefill-decode Mul- tiplexing. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems, Volume 2. 2030–2047

work page 2030
[10]

Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, and Haibo Chen

work page
[11]

InProceedings of the 21st European Conference on Computer Systems (EuroSys ’26)

KUNSERVE: Parameter-centric Memory Management for Effi- cient Memory Overloading Handling in LLM Serving. InProceedings of the 21st European Conference on Computer Systems (EuroSys ’26). As- sociation for Computing Machinery, 1244–1260. doi:10.1145/3767295. 3769348

work page doi:10.1145/3767295
[12]

Cloudian. 2025. Supercharging Vector Database Indexing: 8x Faster with Cloudian S3 RDMA, Milvus and NVIDIA. https://cloudian.com/blog/supercharging-vector-database-indexing- 8x-faster-with-cloudian-s3-rdma-and-nvidia/

work page 2025
[13]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page
[14]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

work page 2022
[15]

DAOS. 2026. DAOS.https://docs.daos.io/v2.6/

work page 2026
[16]

DeepSeek. 2026. DeepSeek R1 Distill Qwen 7B.https://huggingface. co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

work page 2026
[17]

DeepSeek. 2026. Models & Pricing.https://api-docs.deepseek.com/ quick_start/pricing

work page 2026
[18]

Dell Technologies. 2026. Accelerating AI Workloads with RDMA for S3-compatible storage: A Game-Changer with Dell ObjectScale. https://infohub.delltechnologies.com/en-uk/p/accelerating-ai- workloads-with-rdma-for-s3-compatible-storage-a-game-changer- with-dell-objectscale/

work page 2026
[19]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks Lakshmanan, and Ahmed H Awadal- lah. 2024. Hybrid llm: Cost-efficient and quality-aware query rout- ing. InInternational Conference on Learning Representations, Vol. 2024. 41348–41366

work page 2024
[20]

Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, and Mao Yang. 2026. Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–13

work page 2026
[21]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meet- ing llms: Towards retrieval-augmented large language models. InPro- ceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 6491–6501

work page 2024
[22]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2026. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.Advances in Neural Information Processing Systems38 (2026), 113152–113188

work page 2026
[23]

2024.{ServerlessLLM}:{Low- Latency}serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024.{ServerlessLLM}:{Low- Latency}serverless inference for large language models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 135–153

work page 2024
[24]

2024.{Cost- Efficient}large language model serving for multi-turn conversations with{CachedAttention}

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024.{Cost- Efficient}large language model serving for multi-turn conversations with{CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24). 111–126

work page 2024
[25]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model tells you what to discard: Adaptive kv cache compression for llms. InInternational Conference on Learning Representations, Vol. 2024. 22975–22988

work page 2024
[26]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for 13 Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, and Gustavo Alonso low-latency inference.Proceedings of Machine Learning and Systems 6 (2024), 325–338

work page 2024
[27]

Google. 2026. Models & Pricing.https://cloud.google.com/gemini- enterprise-agent-platform/generative-ai/pricing

work page 2026
[28]

Yingyi Hao, Ting Yao, Xingda Wei, Dingyan Zhang, Tianle Sun, Yi- wen Zhang, Zhiyong Fu, Huatao Wu, and Rong Chen. 2026. Fast Cloud Storage for{AI}Jobs via Grouped{I/O} {API}with Trans- parent{Read/Write}Optimizations. In24th USENIX Conference on File and Storage Technologies (FAST 26). 255–270

work page 2026
[29]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Sys- tems37 (2024), 1270–1303

work page 2024
[30]

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, et al

work page
[31]

Memserve: Context caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565(2024)

work page arXiv 2024
[32]

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, et al. 2025. ShuffleInfer: Disaggregate LLM inference for mixed down- stream workloads.ACM Transactions on Architecture and Code Opti- mization22, 2 (2025), 1–24

work page 2025
[33]

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al. 2025. {DEEPSERVE}: Serverless Large Language Model Serving at Scale. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 57– 72

work page 2025
[34]

IBM. 2026. Granite 3.3 8B.https://huggingface.co/ibm-granite/ granite-3.3-8b-instruct

work page 2026
[35]

Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ra- machandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Un- locking full prefill-decode overlap for faster llm inference. InProceed- ings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897– 912

work page 2025
[36]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[37]

InProceedings of the 29th symposium on op- erating systems principles

Efficient memory management for large language model serv- ing with pagedattention. InProceedings of the 29th symposium on op- erating systems principles. 611–626

work page
[38]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic{KV}cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

work page 2024
[39]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

work page
[40]

Advances in Neural Information Processing Systems37 (2024), 22947– 22970

Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970

work page 2024
[41]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al

work page
[43]

Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025
[44]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

work page 2024
[45]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava

work page
[46]

Scissorhands: Exploiting the persistence of importance hypoth- esis for llm kv cache compression at test time.Advances in Neural Information Processing Systems36 (2023), 52342–52364

work page 2023
[47]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: a tuning- free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning. 32332–32344

work page 2024
[48]

LMCache. 2026. LMCache.https://github.com/lmcache/lmcache

work page 2026
[49]

Jiaqi Lou, Xinhao Kong, Jinghan Huang, Wei Bai, Nam Sung Kim, and Danyang Zhuo. 2024. Harmonic: Hardware-assisted{RDMA}per- formance isolation for public clouds. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 1479–1496

work page 2024
[50]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving large language models over heterogeneous gpus and network via max-flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 586–602

work page 2025
[51]

Fanxu Meng, Pingzhi Tang, Zengwei Yao, Xing Sun, and Muhan Zhang. 2026. TransMLA: Migrating GQA models to MLA with full deepseek compatibility and speedup.Advances in Neural Information Processing Systems38 (2026), 81977–82019

work page 2026
[52]

Meta. 2026. Llama 3.1 8B.https://huggingface.co/meta-llama/Llama- 3.1-8B

work page 2026
[53]

MinIO. 2025. MinIO S3 over RDMA.https://blog.min.io/s3-over- rdma/

work page 2025
[54]

Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, and Fan Lai

work page
[55]

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration.arXiv preprint arXiv:2604.25080(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

NVIDIA. 2025. How to Unlock Accelerated AI Storage Performance With RDMA for S3-Compatible Storage.https://blogs.nvidia.com/ blog/s3-compatible-ai-storage/

work page 2025
[57]

NVIDIA. 2026. NVIDIA cuObject: GPUDirect Storage for Objects. https://docs.nvidia.com/gpudirect-storage/cuobject/index.html

work page 2026
[58]

NVIDIA. 2026. NVIDIA cuObject server v1.0.0 Release Notes. https://docs.nvidia.com/gpudirect-storage/cuobject/cuobject- server-release-notes/index.html

work page 2026
[59]

Nvidia. 2026. NVIDIA Inference Xfer Library (NIXL).https://github. com/ai-dynamo/nixl

work page 2026
[60]

OpenAI. 2026. Models & Pricing.https://developers.openai.com/api/ docs/pricing

work page 2026
[61]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

work page 2024
[62]

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ram- jee, and Ashish Panwar. 2025. vattention: Dynamic memory manage- ment for serving llms without pagedattention. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 1. 1133–1150

work page 2025
[63]

Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yong- wei Wu, Weimin Zheng, and Mingxing Zhang. 2026. Prefill-as- a-Service: KVCache of Next-Generation Models Could Go Cross- Datacenter.arXiv preprint arXiv:2604.15039(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

work page 2024
[65]

2021.{ReDMArk}: Bypassing{RDMA}security mechanisms

Benjamin Rothenberger, Konstantin Taranov, Adrian Perrig, and Torsten Hoefler. 2021.{ReDMArk}: Bypassing{RDMA}security mechanisms. In30th USENIX Security Symposium (USENIX Security 21). 4277–4292. 14 ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

work page 2021
[66]

Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[67]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116

work page 2023
[68]

Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. 2020. Securing{RDMA}for{High-Performance}datacen- ter storage systems. In12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20)

work page 2020
[69]

Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Juncheng Yang, and Yue Cheng. 2026. MorphServe: Efficient and Workload- Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing. InProceedings of Machine Learning and Systems

work page 2026
[70]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. [n. d.]. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence. InForty-second International Conference on Machine Learning

work page
[71]

Shuwen Sun, Isaac Khor, Ji-Yong Shin, and Peter Desnoyers. 2025. A Fast, Efficient, and Strongly-Consistent Object Store. InProceedings of the 2025 ACM Symposium on Cloud Computing. 708–721

work page 2025
[72]

Shin-Yeh Tsai, Mathias Payer, and Yiying Zhang. 2019. Pythia: remote oracles for the masses. In28th USENIX Security Symposium (USENIX Security 19). 693–710

work page 2019
[73]

UCX. 2026. Unified Communication X.https://github.com/openucx/ ucx

work page 2026
[74]

VAST Data. 2025. S3 over RDMA: Scaling the KV Cache Data Plane.https://www.vastdata.com/blog/s3-over-rdma-scaling-the- kv-cache-data-plane

work page 2025
[75]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[76]

vLLM. 2026. vLLM.https://github.com/vllm-project/vllm

work page 2026
[77]

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. {KVCache}Cache in the Wild: Characterizing and Optimizing {KVCache}Cache at a Large Cloud Provider. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 465–482

work page 2025
[78]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. 2025. Burstgpt: A real-world workload dataset to optimize llm serving sys- tems. InProceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V. 2. 5831–5841

work page 2025
[79]

2026.{ServeGen}: Workload Char- acterization and Generation of Large Language Model Serving in Pro- duction

Yuxing Xiang, Xue Li, Kun Qian, Yan Zhang, Wenyuan Yu, Ennan Zhai, Xin Jin, and Jingren Zhou. 2026.{ServeGen}: Workload Char- acterization and Generation of Large Language Model Serving in Pro- duction. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26). 1845–1859

work page 2026
[80]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with atten- tion sinks. InInternational Conference on Learning Representations, Vol. 2024. 21875–21895

work page 2024

Showing first 80 references.

[1] [1]

Mohamad Abou Ali, Fadi Dornaika, and Jinan Charafeddine. 2025. Agentic AI: a comprehensive survey of architectures, applications, and future directions.Artificial Intelligence Review59, 1 (2025), 11

work page 2025

[2] [2]

Arney Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachan- dran Ramjee. 2025. Efficient llm inference via chunked prefills.ACM SIGOPS Operating Systems Review59, 1 (2025), 9–16

work page 2025

[3] [3]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training general- ized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901

work page 2023

[4] [4]

Anthropic. 2026. Models & Pricing.https://platform.claude.com/ docs/en/about-claude/pricing

work page 2026

[5] [5]

Ceph. 2026. Ceph - a scalable distributed storage system.https:// github.com/ceph/ceph

work page 2026

[6] [6]

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu- Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed Ab- delfattah, and Kai-Chiang Wu. 2025. Palu: KV-cache compression with low-rank projection. InInternational Conference on Learning Rep- resentations, Vol. 2025. 50222–50249

work page 2025

[7] [7]

Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. {IMPRESS}: An{Importance-Informed} {Multi-Tier}prefix{KV} storage system for large language model inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 187–201

work page 2025

[8] [8]

Yukang Chen, Weihao Cui, Han Zhao, Ziyi Xu, Xiaoze Fan, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Bingsheng He, and Quan Chen

work page

[9] [9]

InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems, Volume 2

Towards High-Goodput LLM Serving with Prefill-decode Mul- tiplexing. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems, Volume 2. 2030–2047

work page 2030

[10] [10]

Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, and Haibo Chen

work page

[11] [11]

InProceedings of the 21st European Conference on Computer Systems (EuroSys ’26)

KUNSERVE: Parameter-centric Memory Management for Effi- cient Memory Overloading Handling in LLM Serving. InProceedings of the 21st European Conference on Computer Systems (EuroSys ’26). As- sociation for Computing Machinery, 1244–1260. doi:10.1145/3767295. 3769348

work page doi:10.1145/3767295

[12] [12]

Cloudian. 2025. Supercharging Vector Database Indexing: 8x Faster with Cloudian S3 RDMA, Milvus and NVIDIA. https://cloudian.com/blog/supercharging-vector-database-indexing- 8x-faster-with-cloudian-s3-rdma-and-nvidia/

work page 2025

[13] [13]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page

[14] [14]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

work page 2022

[15] [15]

DAOS. 2026. DAOS.https://docs.daos.io/v2.6/

work page 2026

[16] [16]

DeepSeek. 2026. DeepSeek R1 Distill Qwen 7B.https://huggingface. co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

work page 2026

[17] [17]

DeepSeek. 2026. Models & Pricing.https://api-docs.deepseek.com/ quick_start/pricing

work page 2026

[18] [18]

Dell Technologies. 2026. Accelerating AI Workloads with RDMA for S3-compatible storage: A Game-Changer with Dell ObjectScale. https://infohub.delltechnologies.com/en-uk/p/accelerating-ai- workloads-with-rdma-for-s3-compatible-storage-a-game-changer- with-dell-objectscale/

work page 2026

[19] [19]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Rühle, Laks Lakshmanan, and Ahmed H Awadal- lah. 2024. Hybrid llm: Cost-efficient and quality-aware query rout- ing. InInternational Conference on Learning Representations, Vol. 2024. 41348–41366

work page 2024

[20] [20]

Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, and Mao Yang. 2026. Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–13

work page 2026

[21] [21]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meet- ing llms: Towards retrieval-augmented large language models. InPro- ceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 6491–6501

work page 2024

[22] [22]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2026. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.Advances in Neural Information Processing Systems38 (2026), 113152–113188

work page 2026

[23] [23]

2024.{ServerlessLLM}:{Low- Latency}serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024.{ServerlessLLM}:{Low- Latency}serverless inference for large language models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 135–153

work page 2024

[24] [24]

2024.{Cost- Efficient}large language model serving for multi-turn conversations with{CachedAttention}

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024.{Cost- Efficient}large language model serving for multi-turn conversations with{CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24). 111–126

work page 2024

[25] [25]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model tells you what to discard: Adaptive kv cache compression for llms. InInternational Conference on Learning Representations, Vol. 2024. 22975–22988

work page 2024

[26] [26]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for 13 Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, and Gustavo Alonso low-latency inference.Proceedings of Machine Learning and Systems 6 (2024), 325–338

work page 2024

[27] [27]

Google. 2026. Models & Pricing.https://cloud.google.com/gemini- enterprise-agent-platform/generative-ai/pricing

work page 2026

[28] [28]

Yingyi Hao, Ting Yao, Xingda Wei, Dingyan Zhang, Tianle Sun, Yi- wen Zhang, Zhiyong Fu, Huatao Wu, and Rong Chen. 2026. Fast Cloud Storage for{AI}Jobs via Grouped{I/O} {API}with Trans- parent{Read/Write}Optimizations. In24th USENIX Conference on File and Storage Technologies (FAST 26). 255–270

work page 2026

[29] [29]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Sys- tems37 (2024), 1270–1303

work page 2024

[30] [30]

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, et al

work page

[31] [31]

Memserve: Context caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565(2024)

work page arXiv 2024

[32] [32]

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, et al. 2025. ShuffleInfer: Disaggregate LLM inference for mixed down- stream workloads.ACM Transactions on Architecture and Code Opti- mization22, 2 (2025), 1–24

work page 2025

[33] [33]

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al. 2025. {DEEPSERVE}: Serverless Large Language Model Serving at Scale. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 57– 72

work page 2025

[34] [34]

IBM. 2026. Granite 3.3 8B.https://huggingface.co/ibm-granite/ granite-3.3-8b-instruct

work page 2026

[35] [35]

Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ra- machandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Un- locking full prefill-decode overlap for faster llm inference. InProceed- ings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897– 912

work page 2025

[36] [36]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page

[37] [37]

InProceedings of the 29th symposium on op- erating systems principles

Efficient memory management for large language model serv- ing with pagedattention. InProceedings of the 29th symposium on op- erating systems principles. 611–626

work page

[38] [38]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic{KV}cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

work page 2024

[39] [39]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

work page

[40] [40]

Advances in Neural Information Processing Systems37 (2024), 22947– 22970

Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970

work page 2024

[41] [41]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al

work page

[43] [43]

Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025

[44] [44]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

work page 2024

[45] [45]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava

work page

[46] [46]

Scissorhands: Exploiting the persistence of importance hypoth- esis for llm kv cache compression at test time.Advances in Neural Information Processing Systems36 (2023), 52342–52364

work page 2023

[47] [47]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: a tuning- free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning. 32332–32344

work page 2024

[48] [48]

LMCache. 2026. LMCache.https://github.com/lmcache/lmcache

work page 2026

[49] [49]

Jiaqi Lou, Xinhao Kong, Jinghan Huang, Wei Bai, Nam Sung Kim, and Danyang Zhuo. 2024. Harmonic: Hardware-assisted{RDMA}per- formance isolation for public clouds. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 1479–1496

work page 2024

[50] [50]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving large language models over heterogeneous gpus and network via max-flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 586–602

work page 2025

[51] [51]

Fanxu Meng, Pingzhi Tang, Zengwei Yao, Xing Sun, and Muhan Zhang. 2026. TransMLA: Migrating GQA models to MLA with full deepseek compatibility and speedup.Advances in Neural Information Processing Systems38 (2026), 81977–82019

work page 2026

[52] [52]

Meta. 2026. Llama 3.1 8B.https://huggingface.co/meta-llama/Llama- 3.1-8B

work page 2026

[53] [53]

MinIO. 2025. MinIO S3 over RDMA.https://blog.min.io/s3-over- rdma/

work page 2025

[54] [54]

Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, and Fan Lai

work page

[55] [55]

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration.arXiv preprint arXiv:2604.25080(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

NVIDIA. 2025. How to Unlock Accelerated AI Storage Performance With RDMA for S3-Compatible Storage.https://blogs.nvidia.com/ blog/s3-compatible-ai-storage/

work page 2025

[57] [57]

NVIDIA. 2026. NVIDIA cuObject: GPUDirect Storage for Objects. https://docs.nvidia.com/gpudirect-storage/cuobject/index.html

work page 2026

[58] [58]

NVIDIA. 2026. NVIDIA cuObject server v1.0.0 Release Notes. https://docs.nvidia.com/gpudirect-storage/cuobject/cuobject- server-release-notes/index.html

work page 2026

[59] [59]

Nvidia. 2026. NVIDIA Inference Xfer Library (NIXL).https://github. com/ai-dynamo/nixl

work page 2026

[60] [60]

OpenAI. 2026. Models & Pricing.https://developers.openai.com/api/ docs/pricing

work page 2026

[61] [61]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

work page 2024

[62] [62]

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ram- jee, and Ashish Panwar. 2025. vattention: Dynamic memory manage- ment for serving llms without pagedattention. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 1. 1133–1150

work page 2025

[63] [63]

Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yong- wei Wu, Weimin Zheng, and Mingxing Zhang. 2026. Prefill-as- a-Service: KVCache of Next-Generation Models Could Go Cross- Datacenter.arXiv preprint arXiv:2604.15039(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [64]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

work page 2024

[65] [65]

2021.{ReDMArk}: Bypassing{RDMA}security mechanisms

Benjamin Rothenberger, Konstantin Taranov, Adrian Perrig, and Torsten Hoefler. 2021.{ReDMArk}: Bypassing{RDMA}security mechanisms. In30th USENIX Security Symposium (USENIX Security 21). 4277–4292. 14 ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

work page 2021

[66] [66]

Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[67] [67]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116

work page 2023

[68] [68]

Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. 2020. Securing{RDMA}for{High-Performance}datacen- ter storage systems. In12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20)

work page 2020

[69] [69]

Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Juncheng Yang, and Yue Cheng. 2026. MorphServe: Efficient and Workload- Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing. InProceedings of Machine Learning and Systems

work page 2026

[70] [70]

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. [n. d.]. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Infer- ence. InForty-second International Conference on Machine Learning

work page

[71] [71]

Shuwen Sun, Isaac Khor, Ji-Yong Shin, and Peter Desnoyers. 2025. A Fast, Efficient, and Strongly-Consistent Object Store. InProceedings of the 2025 ACM Symposium on Cloud Computing. 708–721

work page 2025

[72] [72]

Shin-Yeh Tsai, Mathias Payer, and Yiying Zhang. 2019. Pythia: remote oracles for the masses. In28th USENIX Security Symposium (USENIX Security 19). 693–710

work page 2019

[73] [73]

UCX. 2026. Unified Communication X.https://github.com/openucx/ ucx

work page 2026

[74] [74]

VAST Data. 2025. S3 over RDMA: Scaling the KV Cache Data Plane.https://www.vastdata.com/blog/s3-over-rdma-scaling-the- kv-cache-data-plane

work page 2025

[75] [75]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[76] [76]

vLLM. 2026. vLLM.https://github.com/vllm-project/vllm

work page 2026

[77] [77]

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. {KVCache}Cache in the Wild: Characterizing and Optimizing {KVCache}Cache at a Large Cloud Provider. In2025 USENIX Annual Technical Conference (USENIX ATC 25). 465–482

work page 2025

[78] [78]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al. 2025. Burstgpt: A real-world workload dataset to optimize llm serving sys- tems. InProceedings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V. 2. 5831–5841

work page 2025

[79] [79]

2026.{ServeGen}: Workload Char- acterization and Generation of Large Language Model Serving in Pro- duction

Yuxing Xiang, Xue Li, Kun Qian, Yan Zhang, Wenyuan Yu, Ennan Zhai, Xin Jin, and Jingren Zhou. 2026.{ServeGen}: Workload Char- acterization and Generation of Large Language Model Serving in Pro- duction. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26). 1845–1859

work page 2026

[80] [80]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with atten- tion sinks. InInternational Conference on Learning Representations, Vol. 2024. 21875–21895

work page 2024