QCFuse speeds up RAG inference in LLMs by 40% through query-focused cache fusion and selective token recomputation while preserving accuracy.
From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2representative citing papers
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.
citing papers explorer
-
QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference
QCFuse speeds up RAG inference in LLMs by 40% through query-focused cache fusion and selective token recomputation while preserving accuracy.
-
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.