From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation

Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxin Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jin Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, Cong Jiang · 2026 · arXiv 2601.12904

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference

cs.DB · 2026-03-30 · conditional · novelty 7.0

QCFuse speeds up RAG inference in LLMs by 40% through query-focused cache fusion and selective token recomputation while preserving accuracy.

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

cs.DC · 2026-04-16 · unverdicted · novelty 6.0

PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.

citing papers explorer

Showing 2 of 2 citing papers.

QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference cs.DB · 2026-03-30 · conditional · none · ref 2
QCFuse speeds up RAG inference in LLMs by 40% through query-focused cache fusion and selective token recomputation while preserving accuracy.
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter cs.DC · 2026-04-16 · unverdicted · none · ref 30
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 64% lower P90 TTFT than homogeneous baselines in a 1T-parameter case study.

From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation

fields

years

verdicts

representative citing papers

citing papers explorer