CacheGen: KV cache compression and streaming for fast large language model serving, 2023

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang · 2024 · arXiv 2310.07240

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving

cs.AR · 2026-05-20 · unverdicted · novelty 6.0

CacheTune delivers 3.72x-4.86x TTFT speedup and 3.93x-6.21x throughput in long-context LLM serving via frequency-guided selective KV recomputation and hardware-aware I/O overlap while keeping output quality near full recompute.

Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving

cs.DC · 2026-06-29 · unverdicted · novelty 5.0 · 2 refs

Organizes the heterogeneous LLM prefill-decode design space along four axes and extracts three boundary decisions with guidance on precision, KV representation, and ownership.

citing papers explorer

Showing 3 of 3 citing papers.

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving cs.AI · 2026-06-04 · unverdicted · none · ref 45
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving cs.AR · 2026-05-20 · unverdicted · none · ref 26
CacheTune delivers 3.72x-4.86x TTFT speedup and 3.93x-6.21x throughput in long-context LLM serving via frequency-guided selective KV recomputation and hardware-aware I/O overlap while keeping output quality near full recompute.
Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving cs.DC · 2026-06-29 · unverdicted · none · ref 40 · 2 links
Organizes the heterogeneous LLM prefill-decode design space along four axes and extracts three boundary decisions with guidance on precision, KV representation, and ownership.

CacheGen: KV cache compression and streaming for fast large language model serving, 2023

fields

years

verdicts

representative citing papers

citing papers explorer