QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Haoyang Li; Jiabao Jin; Jianxin Yan; Jia Zhu; Kui Ren; Lei Chen; Peng Cheng; Wangze Ni; Xuemin Lin; Zhenxin Li

arxiv: 2606.05875 · v1 · pith:FPDXDU6Inew · submitted 2026-06-04 · 💻 cs.AI · cs.DB

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Jianxin Yan , Wangze Ni , Zhenxin Li , Jiabao Jin , Zhitao Shen , Haoyang Li , Jia Zhu , Peng Cheng

show 3 more authors

Xuemin Lin Lei Chen Kui Ren

This is my paper

Pith reviewed 2026-06-28 01:45 UTC · model grok-4.3

classification 💻 cs.AI cs.DB

keywords RAG servingKV cache fusionquery-aware selectionprefill optimizationLLM inferencecache reusecompressed view

0 comments

The pith

QCFuse achieves full-prefill quality in RAG serving by fusing KV caches with a compressed-view query-aware selector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RAG improves LLM answers by grounding them in external evidence but makes the prefill stage expensive due to long retrieved contexts. QCFuse addresses this by reusing precomputed KV caches for chunks while selectively recomputing only necessary tokens under the current prompt. It does so with a selector that probes user queries against compact per-chunk anchors and profiles only critical layers to decide what to recompute. This design avoids stalling the layer-wise pipeline that full-view selectors would cause. Evaluations show the approach matches full-prefill quality while delivering 1.7x average prefill speedup over full prefill and 1.5x over the prior strongest baseline.

Core claim

QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline, by using chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection.

What carries the argument

The compressed-view query-aware selector that uses chunk-anchor query probing and critical-layer profiling to identify relevant evidence and recomputation tokens.

If this is right

QCFuse matches full-prefill quality on RAG tasks.
It delivers 1.7x prefill-time speedup over full prefill at matched quality.
It delivers 1.5x prefill-time speedup over ProphetKV at matched quality.
The speedups hold across four open-weight LLMs and six datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compressed probing idea might reduce recomputation needs in non-RAG settings that reuse past KV states.
If the critical-layer choice generalizes, it could cut visibility requirements in other layer-pipelined inference systems.
Lower prefill latency could make longer retrieved contexts practical in production RAG without extra hardware.

Load-bearing premise

Chunk-anchor query probing combined with critical-layer profiling can reliably identify relevant evidence and necessary recomputation tokens without full context or all-layer visibility.

What would settle it

Measure whether answer quality on a held-out dataset or model drops below full-prefill levels when QCFuse is applied at its reported recomputation budget.

Figures

Figures reproduced from arXiv: 2606.05875 by Haoyang Li, Jiabao Jin, Jianxin Yan, Jia Zhu, Kui Ren, Lei Chen, Peng Cheng, Wangze Ni, Xuemin Lin, Zhenxin Li, Zhitao Shen.

**Figure 2.** Figure 2: Cache-fusion workflow for RAG. tokens needed to answer the current user query, leaving a gap in full-prefill quality. In contrast, ProphetKV [58] aggregates userquery-to-context relevance across chunks and layers, improving selection quality but requiring broad KV-cache visibility before recomputation, which stalls the layer-wise cache-fusion pipeline. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Full prefill computes all token states, whereas selec [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: QCFuse forms query-aware recomputation masks using compact anchor tokens and a few critical layers. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Full-view profiling reveals compression opportuni [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: KVzip@10% provides a compact operating point near the full-context reference while using only a small anchor cache. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Single-layer evidence localization peaks in model-dependent middle layers rather than final layers. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Top-3 profiled layers capture most of the Re [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: QCFuse shortens the pre-fusion selection path and [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: QCFuse reaches full-prefill-level quality at lower TTFT. Each panel shows one model–task pair. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: QCFuse preserves quality while reducing TTFT by [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: QCFuse maintains high Normalized-SM as RULER contexts grow. The sweep varies chunk count and chunk size at [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: QCFuse is less sensitive to cache-loading band [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: QCFuse sustains lower TTFT at higher request throughput across models. [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: A small chunk-anchor set captures most of the [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

**Figure 16.** Figure 16: The profiled Top-1 critical layer gives the highest [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QCFuse gives a workable compressed-view selector for RAG cache fusion that claims full-prefill quality at 1.7x speedup, but the experiments need checking to confirm the selector actually holds quality without full context.

read the letter

The main takeaway is that this paper presents QCFuse as a way to break the quality-efficiency deadlock in RAG KV cache fusion. It uses chunk-anchor query probing to let the query condition on small per-chunk anchors and critical-layer profiling to decide what to recompute, all without waiting for the full context or every layer. That design keeps the layer-wise pipeline from stalling while still trying to match full prefill quality.

What stands out as new is the combination of those two techniques to address the specific problem the authors lay out: query-agnostic methods miss relevant chunks and full-view query-aware ones slow everything down. The implementation in SGLang and the reported numbers—matching quality with 1.7x over full prefill and 1.5x over ProphetKV across four models and six datasets—show they took the serving constraints seriously.

The soft spot is exactly the one the stress-test flags. The whole claim rests on the anchors and critical-layer signals being reliable enough to pick the right evidence and recompute tokens. If those signals fall short on some queries or datasets, the pipeline runs but quality drops below the asserted full-prefill level. The abstract gives no detail on how they measured that reliability, what the failure cases look like, or whether the gains hold after accounting for variance across runs.

Because the full experimental section was not available, it is hard to judge the strength of the evidence or whether the baselines were handled fairly. This is a practical systems paper aimed at people building or tuning RAG serving stacks. Readers working on inference optimizations will find the selector design useful even if they end up tweaking it.

It deserves peer review so the experiments can be examined directly. The core idea is clear enough and the problem is real, but the quality claim needs the full data to stand.

Referee Report

2 major / 1 minor

Summary. The paper presents QCFuse, a compressed-view query-aware selector for RAG cache fusion that employs chunk-anchor query probing to condition on per-chunk anchors and critical-layer profiling to select recomputation tokens. It claims this reaches full-prefill quality while delivering 1.7x average prefill-time speedup over full prefill and 1.5x over ProphetKV across four open-weight LLMs and six datasets, implemented in SGLang.

Significance. If the quality-matching claim holds under the compressed-view constraints, the work would meaningfully advance efficient RAG serving by enabling layer-wise cache fusion without pipeline stalls, directly addressing the quality-efficiency tradeoff noted in prior selectors.

major comments (2)

[Evaluation / §4] The central claim of full-prefill-level quality rests on the assumption that chunk-anchor probing plus critical-layer profiling suffices to identify relevant evidence and recomputation tokens. The manuscript should include a direct comparison (e.g., in the evaluation section) of selector accuracy against a full-view baseline on queries where broad context is required, with quantitative metrics such as evidence recall or end-to-end answer quality delta.
[Experiments] Table reporting speedups at matched quality (presumably Table X) does not appear to include per-dataset variance, statistical significance, or controls for post-hoc threshold tuning; without these, the 1.7x and 1.5x averages cannot be assessed as robust across the six datasets.

minor comments (1)

[Method] Notation for 'chunk-anchor' and 'critical-layer' should be defined with a small example in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of QCFuse. We address each major comment below and commit to revisions that improve the evaluation rigor without altering the core claims.

read point-by-point responses

Referee: [Evaluation / §4] The central claim of full-prefill-level quality rests on the assumption that chunk-anchor probing plus critical-layer profiling suffices to identify relevant evidence and recomputation tokens. The manuscript should include a direct comparison (e.g., in the evaluation section) of selector accuracy against a full-view baseline on queries where broad context is required, with quantitative metrics such as evidence recall or end-to-end answer quality delta.

Authors: We agree that an explicit selector-level comparison would further substantiate the compressed-view design. While our end-to-end results already show QCFuse matching full-prefill answer quality (which serves as the ultimate validation of evidence selection), we will add a new analysis in §4. This will report evidence recall for QCFuse versus a full-view oracle on a curated subset of queries requiring broad context, using the same six datasets. The addition will be limited to post-hoc analysis on existing traces to avoid new experiments. revision: yes
Referee: [Experiments] Table reporting speedups at matched quality (presumably Table X) does not appear to include per-dataset variance, statistical significance, or controls for post-hoc threshold tuning; without these, the 1.7x and 1.5x averages cannot be assessed as robust across the six datasets.

Authors: We acknowledge the table lacks these details. In the revised manuscript we will expand the table (and its caption) to report per-dataset means with standard deviations across three random seeds, include paired t-test p-values against baselines, and explicitly state that all thresholds were selected via 5-fold cross-validation on a held-out portion of each dataset rather than post-hoc on test data. These changes will be made to the existing results without new runs. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems paper with external baselines

full rationale

The paper is an empirical systems contribution that implements QCFuse in SGLang and reports direct speed/quality measurements against full prefill and ProphetKV on six datasets and four LLMs. No mathematical derivation chain, fitted-parameter-as-prediction, or self-citation load-bearing step is present; all quality claims rest on external experimental comparison rather than reduction to the method's own inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details on free parameters, axioms, or invented entities are available from the abstract; the contribution appears to be an engineering optimization without new theoretical constructs.

pith-pipeline@v0.9.1-grok · 5787 in / 1056 out tokens · 29203 ms · 2026-06-28T01:45:32.253580+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025
[2]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

2024
[3]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4895–4901. https://doi.org/1...

work page doi:10.18653/v1/2023.emnlp- 2023
[4]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv:2308.14508 [cs.CL] https://arxiv.org/abs/2308.14508

Pith/arXiv arXiv 2024
[5]

Muhammad Imam Luthfi Balaka, David Alexander, Qiming Wang, Yue Gong, Adila Krisnadhi, and Raul Castro Fernandez. 2025. Pneuma: Leveraging llms for tabular data representation and retrieval in an end-to-end system.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025
[6]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069 [cs.CL] https://arxiv.org/abs/2406.02069 QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Pith/arXiv arXiv 2024
[7]

Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (2024), 3772–3785. https://doi.org/10.14778/3685800.3685805

work page doi:10.14778/3685800.3685805 2024
[8]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. 16344–16359

2022
[9]

Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, et al. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. In Companion of the 2025 International Conference on Management of Data. 364–377. https://doi.org/10.1145/3722212.3724428

work page doi:10.1145/3722212.3724428 2025
[10]

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. 2025. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribu- tion. https://doi.org/10.48550/arXiv.2510.00636 arXiv:2510.00636 [cs.AI]

work page doi:10.48550/arxiv.2510.00636 2025
[11]

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. 2024. A Simple and Effective 𝐿2 Norm-Based Strategy for KV Cache Compression. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, 18476–18499. https://doi.org/10.18653/v1/2024.emnlp-main.1027

work page doi:10.18653/v1/2024.emnlp-main.1027 2024
[12]

Amr Elmeleegy, Nick Comly, and Sharan Chetlur. 2024. Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill. NVIDIA Technical Blog. https://developer.nvidia.com/blog/streamlining-ai- inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked- prefill/ Accessed: 2026-05-28

2024
[13]

Kevin Zhou

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. 2024. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. arXiv:2407.11550 [cs.CL] https://arxiv.org/abs/2407.11550

Pith/arXiv arXiv 2024
[14]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28. https://doi.org/10.1145/ 3725394

2025
[15]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 https: //arxiv.org/abs/2312.10997

Pith/arXiv arXiv 2023
[16]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao
[17]

arXiv preprint arXiv:2310.01801(2023)

Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801(2023)

Pith/arXiv arXiv 2023
[18]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems6 (2024), 325–338

2024
[19]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

Pith/arXiv arXiv 2024
[20]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang
[21]

InInternational confer- ence on machine learning

Retrieval augmented language model pre-training. InInternational confer- ence on machine learning. PMLR, 3929–3938
[22]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. arXiv:2011.01060 [cs.CL] https://arxiv.org/abs/2011.01060

Pith/arXiv arXiv 2020
[23]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654 [cs.CL] https://arxiv.org/abs/2404.06654

Pith/arXiv arXiv 2024
[24]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2024. EPIC: Efficient Position-Independent Caching for Serving Large Language Models.arXiv preprint arXiv:2410.15332(2024)

arXiv 2024
[25]

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 3543–3556. https: //doi.org/10.18653/v1/N19-1357

work page doi:10.18653/v1/n19-1357 2019
[26]

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?. InProceedings of the 57th annual meeting of the association for computational linguistics. 3651–3657

2019
[27]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

Pith/arXiv arXiv 2023
[28]

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al
[29]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems37 (2024), 52481–52515

2024
[30]

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736 [cs.CL] https://arxiv.org/abs/2310.05736

arXiv 2023
[31]

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, B...

2024
[32]

https://doi.org/10.18653/v1/2024.acl-long.91

work page doi:10.18653/v1/2024.acl-long.91 2024
[33]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2024. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems(2024)

2024
[34]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 6769–6781. https://doi.org/10.18653/v1/2020....

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[35]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. 2025. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction.arXiv preprint arXiv:2505.23416(2025)

arXiv 2025
[36]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title =

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

work page doi:10.1162/tacl_a_00276 2019
[37]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient Memory Management for Large Language Model Serving with PagedAt- tention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 61...

work page doi:10.1145/3600006.3613165 2023
[38]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020
[39]

Guoliang Li, Xuanhe Zhou, and Xinyang Zhao. 2024. LLM for Data Management. Proc. VLDB Endow.17, 12 (2024), 4213–4216. https://doi.org/10.14778/3685800. 3685838

work page doi:10.14778/3685800 2024
[40]

Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. Hotprefix: Hotness-aware kv cache scheduling for efficient prefix sharing in llm inference systems.Proceedings of the ACM on Management of Data3, 4 (2025), 1–27. https://doi.org/10.1145/3749168

work page doi:10.1145/3749168 2025
[41]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Informa- tion Processing Systems37 (2024), 22947–22970

2024
[42]

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. QServe: W4A8KV4 Quantization and System Co- design for Efficient LLM Serving. arXiv:2405.04532 [cs.CL] https://arxiv.org/ abs/2405.04532

arXiv 2024
[43]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024
[44]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG] https://arxiv.org/abs/2510.09665

arXiv 2025
[45]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. arXiv:2310.07240 [cs.NI] https://arxiv.org/abs/2310.07240

arXiv 2024
[46]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. InAdvances in Neural Information Process- ing Systems, Vol. 36. https://papers.nips.cc/paper_files/paper/2023/ha...

2023
[47]

Cafarella, Michael J

Samuel Madden, Michael J. Cafarella, Michael J. Franklin, and Tim Kraska. 2024. Databases Unbound: Querying All of the World’s Bytes with AI.Proc. VLDB Endow.17, 12 (2024), 4546–4554. https://doi.org/10.14778/3685800.3685916

work page doi:10.14778/3685800.3685916 2024
[48]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. InInterna- tional Conference on Learning Representations. https://openreview.net/forum? id=R8sQPpGCv0 Yan et al

2022
[49]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceed- ings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016
[50]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, 464–468. https://doi.org/10.18653...

work page doi:10.18653/v1/n18-2074 2018
[51]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150 https://arxiv.org/abs/1911.02150

Pith/arXiv arXiv 2019
[52]

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013(2025)

Pith/arXiv arXiv 2025
[53]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 [cs.CL] https://arxiv.org/abs/2104.09864

Pith/arXiv arXiv 2023
[54]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 235. PMLR, 47901–47911. https: //proceedings.mlr.press/v235/tang24l.html

2024
[55]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025
[56]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th annual meeting of the association for computational linguistics. 4593–4601

2019
[57]

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A Machine Comprehen- sion Dataset. InProceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, Vancouver, Canada, 191–200. https://doi.org/10.18653/v1/W17-2623

work page doi:10.18653/v1/w17-2623 2017
[58]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal
[59]

arXiv:2108.00573 [cs.CL] https://arxiv.org/abs/2108.00573

MuSiQue: Multihop Questions via Single-hop Question Composition. arXiv:2108.00573 [cs.CL] https://arxiv.org/abs/2108.00573

arXiv
[60]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[61]

Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284(2019)

Pith/arXiv arXiv 2019
[62]

Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxin Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jingqi Tang, Yaochen Han, Zhiyuan Ai, et al. 2026. From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation.Proceedings of the ACM on Management of Data4, 1 (SIGMOD (2026), 1–28

2026
[63]

Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xiangyu Zou, Wen Xia, Wentao Zhang, Chong Qiu, and Pengfei Wang. 2026. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation.ArXivabs/2602.02579 (2026)

arXiv 2026
[64]

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. 2024. Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in Neural Information Processing Systems37 (2024), 119638–119661

2024
[65]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InInternational Conference on Learning Representations. https://openreview.net/forum?id= NG7sS51zVF

2024
[66]

Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408 [cs.CL] https://arxiv.org/abs/2310.04408

arXiv 2023
[67]

Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao
[68]

InFindings of the Association for Computational Linguistics: ACL

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference. InFindings of the Association for Computational Linguistics: ACL
[69]

https://doi.org/10.18653/v1/2024.findings-acl.195

Association for Computational Linguistics, Bangkok, Thailand, 3258–3270. https://doi.org/10.18653/v1/2024.findings-acl.195

work page doi:10.18653/v1/2024.findings-acl.195 2024
[70]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL] https://arxiv.org/abs/1809.09600

Pith/arXiv arXiv 2018
[71]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the Twentieth European Conference on Computer Systems. 94–109

2025
[72]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). USENIX Association, Carlsbad, CA, USA, 521–538

2022
[73]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. 2025. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23078–23097

2025
[74]

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. 2025. Pqcache: Product quantization-based kvcache for long context llm inference.Proceedings of the ACM on Management of Data3, 3 (2025), 1–30. https://doi.org/10.1145/3725338

work page doi:10.1145/3725338 2025
[75]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36 (2023), 34661–34710

2023
[76]

Xinyang Zhao, Xuanhe Zhou, and Guoliang Li. 2024. Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs.Proc. VLDB Endow. 17, 12 (2024), 4481–4484. https://doi.org/10.14778/3685800.3685905

work page doi:10.14778/3685800.3685905 2024
[77]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2023. SGLang: Efficient Execution of Struc- tured Language Model Programs. https://doi.org/10.48550/arXiv.2312.07104 arXiv:2312.07104 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2023
[78]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, Santa Clara, CA, USA, 193–210

2024

[1] [1]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025

[2] [2]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

2024

[3] [3]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4895–4901. https://doi.org/1...

work page doi:10.18653/v1/2023.emnlp- 2023

[4] [4]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv:2308.14508 [cs.CL] https://arxiv.org/abs/2308.14508

Pith/arXiv arXiv 2024

[5] [5]

Muhammad Imam Luthfi Balaka, David Alexander, Qiming Wang, Yue Gong, Adila Krisnadhi, and Raul Castro Fernandez. 2025. Pneuma: Leveraging llms for tabular data representation and retrieval in an end-to-end system.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025

[6] [6]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069 [cs.CL] https://arxiv.org/abs/2406.02069 QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

Pith/arXiv arXiv 2024

[7] [7]

Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (2024), 3772–3785. https://doi.org/10.14778/3685800.3685805

work page doi:10.14778/3685800.3685805 2024

[8] [8]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, Vol. 35. 16344–16359

2022

[9] [9]

Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, et al. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. In Companion of the 2025 International Conference on Management of Data. 364–377. https://doi.org/10.1145/3722212.3724428

work page doi:10.1145/3722212.3724428 2025

[10] [10]

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. 2025. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribu- tion. https://doi.org/10.48550/arXiv.2510.00636 arXiv:2510.00636 [cs.AI]

work page doi:10.48550/arxiv.2510.00636 2025

[11] [11]

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. 2024. A Simple and Effective 𝐿2 Norm-Based Strategy for KV Cache Compression. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, 18476–18499. https://doi.org/10.18653/v1/2024.emnlp-main.1027

work page doi:10.18653/v1/2024.emnlp-main.1027 2024

[12] [12]

Amr Elmeleegy, Nick Comly, and Sharan Chetlur. 2024. Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill. NVIDIA Technical Blog. https://developer.nvidia.com/blog/streamlining-ai- inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked- prefill/ Accessed: 2026-05-28

2024

[13] [13]

Kevin Zhou

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. 2024. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. arXiv:2407.11550 [cs.CL] https://arxiv.org/abs/2407.11550

Pith/arXiv arXiv 2024

[14] [14]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28. https://doi.org/10.1145/ 3725394

2025

[15] [15]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 https: //arxiv.org/abs/2312.10997

Pith/arXiv arXiv 2023

[16] [16]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao

[17] [17]

arXiv preprint arXiv:2310.01801(2023)

Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801(2023)

Pith/arXiv arXiv 2023

[18] [18]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems6 (2024), 325–338

2024

[19] [19]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

Pith/arXiv arXiv 2024

[20] [20]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang

[21] [21]

InInternational confer- ence on machine learning

Retrieval augmented language model pre-training. InInternational confer- ence on machine learning. PMLR, 3929–3938

[22] [22]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. arXiv:2011.01060 [cs.CL] https://arxiv.org/abs/2011.01060

Pith/arXiv arXiv 2020

[23] [23]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654 [cs.CL] https://arxiv.org/abs/2404.06654

Pith/arXiv arXiv 2024

[24] [24]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. 2024. EPIC: Efficient Position-Independent Caching for Serving Large Language Models.arXiv preprint arXiv:2410.15332(2024)

arXiv 2024

[25] [25]

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 3543–3556. https: //doi.org/10.18653/v1/N19-1357

work page doi:10.18653/v1/n19-1357 2019

[26] [26]

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?. InProceedings of the 57th annual meeting of the association for computational linguistics. 3651–3657

2019

[27] [27]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

Pith/arXiv arXiv 2023

[28] [28]

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al

[29] [29]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems37 (2024), 52481–52515

2024

[30] [30]

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736 [cs.CL] https://arxiv.org/abs/2310.05736

arXiv 2023

[31] [31]

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, B...

2024

[32] [32]

https://doi.org/10.18653/v1/2024.acl-long.91

work page doi:10.18653/v1/2024.acl-long.91 2024

[33] [33]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2024. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems(2024)

2024

[34] [34]

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 6769–6781. https://doi.org/10.18653/v1/2020....

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[35] [35]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. 2025. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction.arXiv preprint arXiv:2505.23416(2025)

arXiv 2025

[36] [36]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title =

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

work page doi:10.1162/tacl_a_00276 2019

[37] [37]

Efficient memory management for large language model serving with pagedattention,

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient Memory Management for Large Language Model Serving with PagedAt- tention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 61...

work page doi:10.1145/3600006.3613165 2023

[38] [38]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020

[39] [39]

Guoliang Li, Xuanhe Zhou, and Xinyang Zhao. 2024. LLM for Data Management. Proc. VLDB Endow.17, 12 (2024), 4213–4216. https://doi.org/10.14778/3685800. 3685838

work page doi:10.14778/3685800 2024

[40] [40]

Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. Hotprefix: Hotness-aware kv cache scheduling for efficient prefix sharing in llm inference systems.Proceedings of the ACM on Management of Data3, 4 (2025), 1–27. https://doi.org/10.1145/3749168

work page doi:10.1145/3749168 2025

[41] [41]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Informa- tion Processing Systems37 (2024), 22947–22970

2024

[42] [42]

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. QServe: W4A8KV4 Quantization and System Co- design for Efficient LLM Serving. arXiv:2405.04532 [cs.CL] https://arxiv.org/ abs/2405.04532

arXiv 2024

[43] [43]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024

[44] [44]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG] https://arxiv.org/abs/2510.09665

arXiv 2025

[45] [45]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. arXiv:2310.07240 [cs.NI] https://arxiv.org/abs/2310.07240

arXiv 2024

[46] [46]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. InAdvances in Neural Information Process- ing Systems, Vol. 36. https://papers.nips.cc/paper_files/paper/2023/ha...

2023

[47] [47]

Cafarella, Michael J

Samuel Madden, Michael J. Cafarella, Michael J. Franklin, and Tim Kraska. 2024. Databases Unbound: Querying All of the World’s Bytes with AI.Proc. VLDB Endow.17, 12 (2024), 4546–4554. https://doi.org/10.14778/3685800.3685916

work page doi:10.14778/3685800.3685916 2024

[48] [48]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. InInterna- tional Conference on Learning Representations. https://openreview.net/forum? id=R8sQPpGCv0 Yan et al

2022

[49] [49]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceed- ings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016

[50] [50]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, 464–468. https://doi.org/10.18653...

work page doi:10.18653/v1/n18-2074 2018

[51] [51]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150 https://arxiv.org/abs/1911.02150

Pith/arXiv arXiv 2019

[52] [52]

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013(2025)

Pith/arXiv arXiv 2025

[53] [53]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864 [cs.CL] https://arxiv.org/abs/2104.09864

Pith/arXiv arXiv 2023

[54] [54]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 235. PMLR, 47901–47911. https: //proceedings.mlr.press/v235/tang24l.html

2024

[55] [55]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025

[56] [56]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th annual meeting of the association for computational linguistics. 4593–4601

2019

[57] [57]

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A Machine Comprehen- sion Dataset. InProceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, Vancouver, Canada, 191–200. https://doi.org/10.18653/v1/W17-2623

work page doi:10.18653/v1/w17-2623 2017

[58] [58]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

[59] [59]

arXiv:2108.00573 [cs.CL] https://arxiv.org/abs/2108.00573

MuSiQue: Multihop Questions via Single-hop Question Composition. arXiv:2108.00573 [cs.CL] https://arxiv.org/abs/2108.00573

arXiv

[60] [60]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017

[61] [61]

Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284(2019)

Pith/arXiv arXiv 2019

[62] [62]

Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxin Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jingqi Tang, Yaochen Han, Zhiyuan Ai, et al. 2026. From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation.Proceedings of the ACM on Management of Data4, 1 (SIGMOD (2026), 1–28

2026

[63] [63]

Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xiangyu Zou, Wen Xia, Wentao Zhang, Chong Qiu, and Pengfei Wang. 2026. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation.ArXivabs/2602.02579 (2026)

arXiv 2026

[64] [64]

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. 2024. Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in Neural Information Processing Systems37 (2024), 119638–119661

2024

[65] [65]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InInternational Conference on Learning Representations. https://openreview.net/forum?id= NG7sS51zVF

2024

[66] [66]

Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408 [cs.CL] https://arxiv.org/abs/2310.04408

arXiv 2023

[67] [67]

Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao

[68] [68]

InFindings of the Association for Computational Linguistics: ACL

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference. InFindings of the Association for Computational Linguistics: ACL

[69] [69]

https://doi.org/10.18653/v1/2024.findings-acl.195

Association for Computational Linguistics, Bangkok, Thailand, 3258–3270. https://doi.org/10.18653/v1/2024.findings-acl.195

work page doi:10.18653/v1/2024.findings-acl.195 2024

[70] [70]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL] https://arxiv.org/abs/1809.09600

Pith/arXiv arXiv 2018

[71] [71]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the Twentieth European Conference on Computer Systems. 94–109

2025

[72] [72]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). USENIX Association, Carlsbad, CA, USA, 521–538

2022

[73] [73]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. 2025. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23078–23097

2025

[74] [74]

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. 2025. Pqcache: Product quantization-based kvcache for long context llm inference.Proceedings of the ACM on Management of Data3, 3 (2025), 1–30. https://doi.org/10.1145/3725338

work page doi:10.1145/3725338 2025

[75] [75]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36 (2023), 34661–34710

2023

[76] [76]

Xinyang Zhao, Xuanhe Zhou, and Guoliang Li. 2024. Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs.Proc. VLDB Endow. 17, 12 (2024), 4481–4484. https://doi.org/10.14778/3685800.3685905

work page doi:10.14778/3685800.3685905 2024

[77] [77]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2023. SGLang: Efficient Execution of Struc- tured Language Model Programs. https://doi.org/10.48550/arXiv.2312.07104 arXiv:2312.07104 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2023

[78] [78]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, Santa Clara, CA, USA, 193–210

2024