QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
CacheGen: KV cache compression and streaming for fast large language model serving, 2023
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
CacheTune delivers 3.72x-4.86x TTFT speedup and 3.93x-6.21x throughput in long-context LLM serving via frequency-guided selective KV recomputation and hardware-aware I/O overlap while keeping output quality near full recompute.
Organizes the heterogeneous LLM prefill-decode design space along four axes and extracts three boundary decisions with guidance on precision, KV representation, and ownership.
citing papers explorer
-
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
-
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
CacheTune delivers 3.72x-4.86x TTFT speedup and 3.93x-6.21x throughput in long-context LLM serving via frequency-guided selective KV recomputation and hardware-aware I/O overlap while keeping output quality near full recompute.
-
Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving
Organizes the heterogeneous LLM prefill-decode design space along four axes and extracts three boundary decisions with guidance on precision, KV representation, and ownership.