Mooncake: A kvcache-centric disaggregated architecture for llm serving

Qin Ruoyu, Li Zheming, He Weiran, Cui Jialei, Tang Heyi, Ren Feng, Ma Teng, Cai Shangming, Zhang Yineng, Zhang Mingxing, Wu Yongwei, Zheng Weimin, Xu Xinran · 2025 · DOI 10.1145/3773772

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

cs.DC · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

cs.DC · 2026-05-03 · unverdicted · novelty 6.0 · 3 refs

SplitZip introduces a fast lossless KV cache compressor for disaggregated LLM inference that achieves 613 GB/s compression throughput on BF16 tensors and up to 1.32x end-to-end speedup.

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

cs.PF · 2026-06-02 · unverdicted · novelty 5.0

NetKV is a network-aware O(|D|) greedy scheduler for decode instance selection that reduces mean TTFT by up to 21.2% versus round-robin and 17.6% versus cache+load baselines in 64-GPU fat-tree simulations.

citing papers explorer

Showing 2 of 2 citing papers after filters.

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference cs.DC · 2026-05-04 · unverdicted · none · ref 39 · 2 links
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving cs.DC · 2026-05-03 · unverdicted · none · ref 23 · 3 links
SplitZip introduces a fast lossless KV cache compressor for disaggregated LLM inference that achieves 613 GB/s compression throughput on BF16 tensors and up to 1.32x end-to-end speedup.

Mooncake: A kvcache-centric disaggregated architecture for llm serving

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer