Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu · 2025

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

cs.AR · 2026-04-09 · conditional · novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

cs.DC · 2026-05-16 · unverdicted · novelty 4.0

GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

citing papers explorer

Showing 2 of 2 citing papers.

A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators cs.AR · 2026-04-09 · conditional · none · ref 61
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources cs.DC · 2026-05-16 · unverdicted · none · ref 23
GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot

fields

years

verdicts

representative citing papers

citing papers explorer