HYPIC: Accelerating Hybrid-Attention LLM Serving with Position-Independent Caching

Junhao Hu; Juntong Wu; Minghao Li; Weihang Chen; Xiaoxu Chen; Yang Liu; Yifei Liu

arxiv: 2607.01299 · v1 · pith:BFXK4QWPnew · submitted 2026-07-01 · 💻 cs.DC

HYPIC: Accelerating Hybrid-Attention LLM Serving with Position-Independent Caching

Yifei Liu , Juntong Wu , Yang Liu , Junhao Hu , Minghao Li , Xiaoxu Chen , Weihang Chen This is my paper

Pith reviewed 2026-07-03 18:50 UTC · model grok-4.3

classification 💻 cs.DC

keywords hybrid attentionposition-independent cachingLLM servinglinear attentionsegment cachingprefill optimizationRAG servingKV cache reuse

0 comments

The pith

Hypic lets hybrid-attention LLMs reuse cached segments across requests by composing linear-layer states with a transition operator and fixing full-attention boundaries with small seam recomputes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that position-independent caching can be extended to hybrid-attention models even though their linear layers hold recurrent states instead of per-token KV pairs. This matters for RAG and agentic workloads where prompts are pieced together from independent segments and the prefill phase dominates latency. Hypic supplies the missing operator for constant-time state composition on linear layers and a boundary-seam fix for the remaining full-attention layers, turning long cold requests into parallelizable work. A reader would care because the result is faster first-token delivery and higher peak throughput while accuracy stays close to full recompute.

Core claim

Hypic is the first serving system for hybrid-attention LLMs that supports position-independent caching. For linear-attention layers it caches the segment-cumulative transition operator together with each segment's zero-start end-state, allowing near-exact constant-time composition of independently stored segments. For the remaining full-attention layers it recomputes only a small seam window at each segment boundary to recover cross-segment lookback accuracy. Segment-level self-containment is further used to parallelize cache-miss prefill across instances.

What carries the argument

The segment-cumulative transition operator, the algebraic primitive that enables constant-time state composition of independently cached segments in linear-attention layers.

If this is right

Time-to-first-token drops 2.45x on average across tested workloads.
Peak throughput rises by up to 2.0x while accuracy remains within 3.3 points of full recompute.
Long cold requests become accelerable by parallel prefill across instances.
The approach applies to four different hybrid-attention models and five workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same seam-window idea might reduce recompute cost in other mixed-attention or state-space models that lack per-token states.
Segment self-containment could be leveraged for dynamic load balancing across serving clusters when requests share many segments.
The transition operator might allow incremental updates when a cached segment is later extended rather than replaced.

Load-bearing premise

Recomputing only a small seam window at each segment boundary restores enough cross-segment lookback accuracy in full-attention layers without per-token hidden states from the linear layers.

What would settle it

Measure accuracy on a multi-segment prompt workload when the seam window size is forced to zero versus the paper's chosen window size, and compare both to full recompute.

Figures

Figures reproduced from arXiv: 2607.01299 by Junhao Hu, Juntong Wu, Minghao Li, Weihang Chen, Xiaoxu Chen, Yang Liu, Yifei Liu.

**Figure 1.** Figure 1: Existing PIC methods reuse per-token KV cache in full-attention models via splice and correction (left); on hybrid stacks, both primitives fail because linear-attention layers expose only a per-request recurrent state, with no per-token handle (right). 1 Introduction Large language model (LLM) serving is shifting from singleturn chat toward retrieval-augmented question answering [11, 15, 34, 47], multi-d… view at source ↗

**Figure 2.** Figure 2: Memory-access footprint of correction. (a) Fullattention stack: every token’s prefix state is in the KV cache, so correction can read it directly. (b) Hybrid stack: linear layers retain only the per-request recurrent state, leaving non-final tokens’ prefix states uncached. C1: Thenum District -> Chrysan Company. C2: Derek lives in Thenum District. C3: answer with company name only. Query: Which company do… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Cache-miss prefill under PDC and existing PIC vs. (b) Parallel execution enabled by segment selfcontainment. However, this migration assumes a prerequisite that does not hold in a hybrid stack. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Linear-attention state composition with cached transitions. Each segment caches the tuple (𝑇𝐶, 𝑆𝐶|0) at first prefill; at reuse time Hypic composes the prefix end-state and the cached tuples via Equation (6). segment-cumulative transition operator—a quantity computed as a transient intermediate at every recurrence step yet never persisted by current serving systems. To address this, Hypic caches not only t… view at source ↗

**Figure 7.** Figure 7: Seam window across adjacent segments (𝐶1,𝐶2): the last 𝑤 tokens of 𝐶1 and the first 𝑤 tokens of 𝐶2 are excluded from each segment’s cached state and recomputed jointly at splice time. At cache time, Hypic stores the zero-start end-state 𝑆𝐶|0 in the public pool. At reuse time, Hypic replaces each 𝑆𝐶𝑖 |0 in Equation (6) with 𝑅(𝑝𝑖) 𝑆𝐶𝑖 |0 before the prefix𝑇 -products act, where 𝑝𝑖 is segment 𝐶𝑖 ’s global star… view at source ↗

**Figure 8.** Figure 8: Seam-window handling at linear-attention layers. Each segment caches (𝑇𝐶, 𝑆𝐶|0) over interior tokens only; at splice time Hypic recomputes the seam window’s own 𝑇 and 𝑆 on the fly and inserts them into the composition law, jointly advancing the running state and forwarding pertoken outputs to the layer above. key from start 𝑎 to start 𝑏 reduces to one left-multiplication: 𝐾𝑏 = 𝑅(𝑏 − 𝑎) 𝐾𝑎. (8) These inter… view at source ↗

**Figure 9.** Figure 9: Accelerate long cold requests with segment parallelism. The Hypic Router probes hit status for each segment (Seg 1, 3 hit; Seg 2, 4, 5 miss), LPT-dispatches the miss segments across the worker pool (Seg 2 and 4 to Worker 2; Seg 5 to Worker 3), and designates Worker 1 as the combine node, which pulls cache from peers and assembles the running state. to a prefill worker pool—each worker prefills its segmen… view at source ↗

**Figure 10.** Figure 10: Accuracy–TTFT Pareto across four models and four datasets. 0.0 0.5 1.0 1.5 TTFT p50 (s) Ring-mini (TP=1) 0.0 0.5 1.0 1.5 Ring-flash (TP=4) 0.0 0.5 1.0 1.5 Qwen3.5-35B (TP=2) 0.0 0.5 1.0 1.5 Qwen3.5-122B (TP=4) 5 10 15 20 Request rate (req/s) 50k 100k 150k Throughput (tokens/s/GPU) 2 4 6 8 10 Request rate (req/s) 5k 10k 15k 2 4 6 8 10 Request rate (req/s) 10k 20k 30k 1 2 3 4 5 Request rate (req/s) 2.5k 5k … view at source ↗

**Figure 11.** Figure 11: P50 TTFT and per-GPU token throughput at various QPS on the Prod-RAG trace. 1000 2000 3000 4000 (a) Segment length 0.2 0.4 0.6 TTFT p50 (s) 5 10 15 (b) Number of Segments Full Recompute HYPIC [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Linear-attention composition scaling: accuracy and TTFT against (a) per-segment length at a fixed segment count of 4, and (b) segment count at a fixed per-segment length of 1k tokens. Recompute grows from 0.141 s to 0.624 s as the prompt becomes 4× longer, while Hypic grows only from 0.103 s to 0.127 s—a speedup that rises from 1.37× at 4k tokens to 4.91× at 16k tokens. We next vary the number of retriev… view at source ↗

**Figure 14.** Figure 14: Segment parallelism TTFT breakdown into dispatch forward (parallel per-segment prefill), comm (crossnode KV pull), and combine forward (state composition and seam recompute) as we sweep prefill worker count 𝑛. from 8 to 32 raises TTFT by 76 ms while ROUGE-L varies within 0.15 points. Thus, 𝑤=8 suffices as the default. 6.5 Segment parallelism for cache-miss prefill Segment parallelism TTFT breakdown. Her… view at source ↗

**Figure 13.** Figure 13: Task accuracy and TTFT against window width 𝑤 per segment boundary. compute per-segment end-states and transitions independently, compose the running state via Equation (6), and compare it against Full Recompute [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

read the original abstract

In retrieval augmented generation (RAG) and agentic LLM serving, prompts are assembled from independent segments into long contexts, making the prefill stage dominate the per-request computation cost. To this cost, two directions have emerged in parallel: position-independent caching (PIC) admits KV reuse for non-contiguous segments shared across different requests, while hybrid-attention models reduce computation complexity by replacing most full-attention layers with linear attention. However, they cannot coexist: applying PIC to hybrid-attention models breaks down because per-token KV-cache reuse primitives do not transfer to the per-request recurrent state. In this work, we present Hypic, the first serving system for hybrid-attention LLMs with position-independent caching. For linear-attention layers, we identify the segment-cumulative transition operator as the missing algebraic primitive, and cache it alongside each segment's zero-start end-state, enabling near-exact and constant-time state composition of independently cached segments. For the remaining full-attention layers, existing PIC methods also fail as linear layers do not expose the per-token hidden states for selective recomputation. We show that the most significant attention deviation concentrates at segment boundaries, so recomputing only a small seam window at each boundary suffices to restore cross-segment lookback. Finally, Hypic exploits segment-level self-containment to parallelize cache-miss prefill across instances, turning long cold requests -- a major tail-latency contributor under both prefix caching and prior PIC -- into an accelerable workload. Evaluated across four hybrid-attention models and five workloads, Hypic reduces time-to-first-token (TTFT) by 2.45x on average and improves peak throughput by up to 2.0x over existing systems, while staying within 3.3 points of full-recompute accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hypic supplies the first concrete primitives to combine position-independent caching with hybrid attention, but the accuracy claim hinges on an untested assumption about error localization at segment boundaries.

read the letter

The paper's core contribution is the segment-cumulative transition operator that lets linear-attention layers compose cached states across non-contiguous segments in constant time, plus the seam-window recomputation that tries to patch the full-attention layers without needing their per-token hidden states.

It does a clean job of stating the incompatibility: standard PIC primitives break on the recurrent state of linear layers, and full-attention layers lose the selective recompute path once linear layers are in the mix. The parallel cache-miss prefill is a practical addition for the tail-latency case that both prefix caching and earlier PIC suffer from.

The soft spot is the accuracy argument. The abstract asserts that deviation concentrates at boundaries so a small fixed seam window restores cross-segment lookback, yet it supplies no measurement of how far the deviation actually propagates through residuals or heads, nor how the window size was chosen relative to depth or segment length. If the localization does not hold, the 3.3-point accuracy delta becomes unreliable. The reported 2.45x TTFT and 2x throughput numbers also lack any description of workloads, baselines, or variance, so they cannot be taken at face value from the abstract alone.

This is for systems people already working on LLM serving for RAG or agent workloads. A reader who needs to support hybrid models with shared segment caches will find the algebraic primitive and the seam heuristic worth examining. The work is concrete enough to deserve referee time even if the evaluation section needs tightening.

I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript presents HYPIC, a serving system for hybrid-attention LLMs supporting position-independent caching. For linear-attention layers it caches the segment-cumulative transition operator together with each segment's zero-start end-state to enable constant-time composition of independently cached segments. For full-attention layers it recomputes only a small seam window at each segment boundary on the premise that attention deviation concentrates there. Segment self-containment is exploited to parallelize cache-miss prefill. Across four hybrid-attention models and five workloads the system is reported to reduce TTFT by 2.45× on average, improve peak throughput by up to 2.0×, and stay within 3.3 points of full-recompute accuracy.

Significance. If the accuracy and performance numbers hold, the work is significant for RAG and agentic serving workloads that assemble long contexts from non-contiguous segments. The algebraic identification of the segment-cumulative transition operator as a reusable primitive cleanly extends PIC to linear-attention layers; the parallelization of cold requests directly targets tail latency. These contributions are concrete and could influence practical serving-system design.

major comments (2)

[Full-attention handling description and accuracy evaluation] The accuracy claim (within 3.3 points of full recompute) rests on the assertion that attention deviation in full-attention layers concentrates at segment boundaries so that a fixed seam window suffices. No quantitative bound on deviation propagation through residual connections, layer stacking, or multi-head interactions is supplied, nor is there a sensitivity study relating seam size to model depth or segment length. This is load-bearing for the central accuracy guarantee.
[Evaluation section] The reported 2.45× TTFT and 2.0× throughput figures are presented without workload definitions, baseline implementation details, measurement methodology, or statistical significance tests. These omissions prevent verification that the speedups are reproducible and not the result of post-hoc tuning or unstated workload characteristics.

minor comments (1)

[Linear-attention caching subsection] Notation for the transition operator and zero-start end-state would benefit from an explicit equation or pseudocode block to make the constant-time composition claim immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Full-attention handling description and accuracy evaluation] The accuracy claim (within 3.3 points of full recompute) rests on the assertion that attention deviation in full-attention layers concentrates at segment boundaries so that a fixed seam window suffices. No quantitative bound on deviation propagation through residual connections, layer stacking, or multi-head interactions is supplied, nor is there a sensitivity study relating seam size to model depth or segment length. This is load-bearing for the central accuracy guarantee.

Authors: We agree that the manuscript would be strengthened by a quantitative characterization of deviation propagation. The current work relies on empirical measurements across four hybrid-attention models showing that a fixed seam window keeps accuracy within 3.3 points of full recompute. In the revision we will add a sensitivity study that varies seam-window size against model depth and segment length, reports per-layer deviation statistics, and discusses observed empirical bounds on propagation through residuals and stacking. revision: yes
Referee: [Evaluation section] The reported 2.45× TTFT and 2.0× throughput figures are presented without workload definitions, baseline implementation details, measurement methodology, or statistical significance tests. These omissions prevent verification that the speedups are reproducible and not the result of post-hoc tuning or unstated workload characteristics.

Authors: We acknowledge that additional detail is needed for reproducibility. The manuscript already names the four models and five workloads, but the revision will expand the evaluation section with explicit workload definitions (segment counts, lengths, and composition), baseline system configurations and code references, complete measurement methodology (hardware, software stack, timing methodology), and statistical reporting (means and standard deviations over repeated runs). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements independent of inputs

full rationale

The paper is a systems contribution presenting an implementation (Hypic) and its measured performance on four models and five workloads. No equations, algebraic derivations, fitted parameters, or predictions appear in the abstract or description. TTFT and throughput gains are reported as direct experimental outcomes, and the accuracy delta is an empirical comparison to full-recompute rather than a quantity obtained by construction from any input or self-citation. The seam-window claim is presented as an observed property enabling the design, not as a self-defined or fitted result. The derivation chain therefore contains no load-bearing steps that reduce to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems paper whose central claims are empirical performance numbers; no mathematical axioms, free parameters, or invented physical entities are invoked in the abstract.

pith-pipeline@v0.9.1-grok · 5883 in / 1152 out tokens · 30353 ms · 2026-07-03T18:50:46.641099+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 17 canonical work pages · 6 internal anchors

[1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 117–134

2024
[2]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhid- ian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al
[3]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

LongBench: A Bilingual, Multitask Benchmark for Long Con- text Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 3119–3137
[4]

Ziyi Cao, Qingsi Si, Jingbin Zhang, and Bingquan Liu. 2026. Sparse Attention Across Multiple-Context KV Cache. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 30165–30173. doi:10. 1609/aaai.v40i36.40266

2026
[5]

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. 2025. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention.arXiv preprint arXiv:2506.13585

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Bing Li, and Ulf Schlichtmann. 2026. KV Packet: Recomputation-Free Context- Independent KV Caching for LLMs. arXiv:2604.13226 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Du- ality. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 10041–10071

2024
[8]

Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-News: A Large-Scale Multi-Document Summariza- tion Dataset and Abstractive Hierarchical Model. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1074–1084

2019
[9]

In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of Machine Learning and Systems, Vol. 6

2024
[10]

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer
[11]

InProceedings of the 2nd Workshop on New Frontiers in Summarization (EMNLP-IJCNLP 2019 Workshop)

SAMSum Corpus: A Human-Annotated Dialogue Dataset for Abstractive Summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization (EMNLP-IJCNLP 2019 Workshop). 70–79

2019
[12]

Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. InProceedings of the ACM SIGCOMM Conference. 202–215. doi:10.1145/2934872.2934908

work page doi:10.1145/2934872.2934908 2016
[13]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a Multi-hop QA Dataset for Compre- hensive Evaluation of Reasoning Steps. InProceedings of the 28th International Conference on Computational Linguistics. 6609–6625

2020
[14]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie
[15]

InProceedings of the 42nd International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol

EPIC: Efficient Position-Independent Caching for Serving Large Language Models. InProceedings of the 42nd International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 24391–24402
[16]

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient Attentions for Long Document Summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 1419–1436

2021
[17]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language models resolve real-world github issues?. InInternational Conference on Learning Representations, Vol. 2024. 54107–54157

2024
[18]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer
[19]

InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1601–1611
[20]

Fu, Christo- pher Ré, and Azalia Mirhoseini

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christo- pher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. InProceedings of the 41st Interna- tional Conference on Machine Learning (ICML ’24)

2024
[21]

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast Autoregressive Trans- formers with Linear Attention. InProceedings of the 37th International Conference on Machine Learning (ICML)

2020
[22]

Kimi Team. 2025. Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.26692 [cs.CL]https://arxiv.org/abs/2510. 26692

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica
[24]

InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23)

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23). ACM, 611–626
[25]

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al . 2024. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. 74–81

2004
[27]

Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. 2026. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serv- ing. InProceedings of the 24th USENIX Conference on File and Storage Technologies (FAST ’26). USENIX Association, 83–99

2026
[28]

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang
[29]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 6588–6601. doi:10.18653/v1/2025.emnlp-main.334

work page doi:10.18653/v1/2025.emnlp-main.334 2025
[30]

Dongyang Ma, Yan Wang, and Tian Lan. 2025. Block-Attention for Effi- cient Prefilling. InThe Thirteenth International Conference on Learning Representations

2025
[31]

2024.NCCL: NVIDIA Collective Communications Library

NVIDIA. 2024.NCCL: NVIDIA Collective Communications Library. https://github.com/NVIDIA/nccl

2024
[32]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA ’24). IEEE, 118–132

2024
[33]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Minxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation – A KVCache-Centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST ’25). USENIX Association

2025
[34]

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. 2024. Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models. arXiv:2401.04658 [cs.CL]

work page arXiv 2024
[35]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. Qwen Technical Blog.https://qwen.ai/blog?id=qwen3.5 13 Liu et al

2026
[36]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang
[37]

InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383–2392

2016
[38]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2024. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing568 (2024), 127063

2024
[39]

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive Net- work: A Successor to Transformer for Large Language Models. arXiv:2307.08621 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, et al . 2025. Every attention matters: An efficient hybrid architecture for long-context reasoning.arXiv preprint arXiv:2510.19338(2025)

work page arXiv 2025
[41]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop Questions via Single-hop Ques- tion Composition.Transactions of the Association for Computational Linguistics10 (2022), 539–554

2022
[42]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems, Vol. 30

2017
[43]

Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, and Congfeng Jiang. 2026. From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval- Augmented Generation.Proceedings of the ACM on Management of Data4, 1 (2026). doi:10.1145/3786655

work page doi:10.1145/3786655 2026
[44]

Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Xiaolong Bai, Yizhou Shan, Wei Zhang, Lan Wang, Ying Xiong, Yong Zhang, and Zhenan Fan. 2025. MEPIC: Memory Efficient Position Independent Caching for LLM Serving. arXiv:2512.16822 [cs.LG]

work page arXiv 2025
[45]

Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xi- angyu Zou, Wen Xia, Wentao Zhang, Chongyang Qiu, and Pengfei Wang. 2026. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation. arXiv:2602.02579 [cs.AI]

work page arXiv 2026
[46]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Representa- tions

2024
[47]

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. 2025. CacheClip: Accelerating RAG with Effective KV Cache Reuse. arXiv:2510.10129 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Huan Yang, Renji Zhang, Mingzhe Huang, Weijun Wang, Yin Tang, Yuanchun Li, Yunxin Liu, and Deyu Zhang. 2025. KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse. arXiv:2503.16525 [cs.LG]

work page arXiv 2025
[49]

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. 2025. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse. InAdvances in Neural Information Processing Systems, Vol. 38

2025
[50]

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. 2025. Gated Delta Networks: Improving Mamba2 with Delta Rule. InThe Thirteenth International Conference on Learning Representations

2025
[51]

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. 2024. Gated Linear Attention Transformers with Hardware- Efficient Training. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 56501–56523

2024
[52]

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim
[53]

InAdvances in Neural Information Processing Systems, Vol

Parallelizing Linear Transformers with the Delta Rule over Se- quence Length. InAdvances in Neural Information Processing Systems, Vol. 37
[54]

2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism

Songlin Yang and Yu Zhang. 2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism. https://github.com/fla-org/flash-linear-attention

2024
[55]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369–2380

2018
[56]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowl- edge Fusion. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys ’25). ACM. doi:10.1145/3689031.3696098

work page doi:10.1145/3689031.3696098 2025
[57]

Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, and Yiran Chen. 2025. KVCOMM: Online Cross-context KV Cache Communication for Efficient LLM-based Multi-agent Systems. InAdvances in Neural Information Processing Systems

2025
[58]

Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 11608–11620

2024
[59]

Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, and Jie Tang. 2024. LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Ques- tion Answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 22600–22632

2024
[60]

Shiju Zhao, Junhao Hu, Jiaqi Zheng, and Guihai Chen. 2026. You Need an Encoder for Native Position-Independent Caching. arXiv:2602.01519 [cs.CL]https://arxiv.org/abs/2602.01519

work page arXiv 2026
[61]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems

2024
[62]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 193–210

2024
[63]

Yuechi Zhou, Yi Su, Jianxin Zhang, Juntao Li, Qingrong Xia, Zhefeng Wang, Xinyu Duan, and Baoxing Huai. 2025. A3: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving. arXiv:2511.17560 [cs.CL] 14

work page arXiv 2025

[1] [1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 117–134

2024

[2] [2]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhid- ian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al

[3] [3]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

LongBench: A Bilingual, Multitask Benchmark for Long Con- text Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 3119–3137

[4] [4]

Ziyi Cao, Qingsi Si, Jingbin Zhang, and Bingquan Liu. 2026. Sparse Attention Across Multiple-Context KV Cache. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 30165–30173. doi:10. 1609/aaai.v40i36.40266

2026

[5] [5]

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. 2025. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention.arXiv preprint arXiv:2506.13585

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Bing Li, and Ulf Schlichtmann. 2026. KV Packet: Recomputation-Free Context- Independent KV Caching for LLMs. arXiv:2604.13226 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Du- ality. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 10041–10071

2024

[8] [8]

Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-News: A Large-Scale Multi-Document Summariza- tion Dataset and Abstractive Hierarchical Model. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1074–1084

2019

[9] [9]

In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of Machine Learning and Systems, Vol. 6

2024

[10] [10]

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer

[11] [11]

InProceedings of the 2nd Workshop on New Frontiers in Summarization (EMNLP-IJCNLP 2019 Workshop)

SAMSum Corpus: A Human-Annotated Dialogue Dataset for Abstractive Summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization (EMNLP-IJCNLP 2019 Workshop). 70–79

2019

[12] [12]

Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. InProceedings of the ACM SIGCOMM Conference. 202–215. doi:10.1145/2934872.2934908

work page doi:10.1145/2934872.2934908 2016

[13] [13]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a Multi-hop QA Dataset for Compre- hensive Evaluation of Reasoning Steps. InProceedings of the 28th International Conference on Computational Linguistics. 6609–6625

2020

[14] [14]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie

[15] [15]

InProceedings of the 42nd International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol

EPIC: Efficient Position-Independent Caching for Serving Large Language Models. InProceedings of the 42nd International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 24391–24402

[16] [16]

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient Attentions for Long Document Summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 1419–1436

2021

[17] [17]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language models resolve real-world github issues?. InInternational Conference on Learning Representations, Vol. 2024. 54107–54157

2024

[18] [18]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer

[19] [19]

InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1601–1611

[20] [20]

Fu, Christo- pher Ré, and Azalia Mirhoseini

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christo- pher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. InProceedings of the 41st Interna- tional Conference on Machine Learning (ICML ’24)

2024

[21] [21]

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast Autoregressive Trans- formers with Linear Attention. InProceedings of the 37th International Conference on Machine Learning (ICML)

2020

[22] [22]

Kimi Team. 2025. Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.26692 [cs.CL]https://arxiv.org/abs/2510. 26692

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica

[24] [24]

InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23)

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23). ACM, 611–626

[25] [25]

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al . 2024. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. 74–81

2004

[27] [27]

Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. 2026. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serv- ing. InProceedings of the 24th USENIX Conference on File and Storage Technologies (FAST ’26). USENIX Association, 83–99

2026

[28] [28]

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang

[29] [29]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 6588–6601. doi:10.18653/v1/2025.emnlp-main.334

work page doi:10.18653/v1/2025.emnlp-main.334 2025

[30] [30]

Dongyang Ma, Yan Wang, and Tian Lan. 2025. Block-Attention for Effi- cient Prefilling. InThe Thirteenth International Conference on Learning Representations

2025

[31] [31]

2024.NCCL: NVIDIA Collective Communications Library

NVIDIA. 2024.NCCL: NVIDIA Collective Communications Library. https://github.com/NVIDIA/nccl

2024

[32] [32]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA ’24). IEEE, 118–132

2024

[33] [33]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Minxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation – A KVCache-Centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST ’25). USENIX Association

2025

[34] [34]

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. 2024. Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models. arXiv:2401.04658 [cs.CL]

work page arXiv 2024

[35] [35]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. Qwen Technical Blog.https://qwen.ai/blog?id=qwen3.5 13 Liu et al

2026

[36] [36]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang

[37] [37]

InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383–2392

2016

[38] [38]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2024. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing568 (2024), 127063

2024

[39] [39]

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive Net- work: A Successor to Transformer for Large Language Models. arXiv:2307.08621 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, et al . 2025. Every attention matters: An efficient hybrid architecture for long-context reasoning.arXiv preprint arXiv:2510.19338(2025)

work page arXiv 2025

[41] [41]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop Questions via Single-hop Ques- tion Composition.Transactions of the Association for Computational Linguistics10 (2022), 539–554

2022

[42] [42]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems, Vol. 30

2017

[43] [43]

Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, and Congfeng Jiang. 2026. From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval- Augmented Generation.Proceedings of the ACM on Management of Data4, 1 (2026). doi:10.1145/3786655

work page doi:10.1145/3786655 2026

[44] [44]

Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Xiaolong Bai, Yizhou Shan, Wei Zhang, Lan Wang, Ying Xiong, Yong Zhang, and Zhenan Fan. 2025. MEPIC: Memory Efficient Position Independent Caching for LLM Serving. arXiv:2512.16822 [cs.LG]

work page arXiv 2025

[45] [45]

Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xi- angyu Zou, Wen Xia, Wentao Zhang, Chongyang Qiu, and Pengfei Wang. 2026. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation. arXiv:2602.02579 [cs.AI]

work page arXiv 2026

[46] [46]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Representa- tions

2024

[47] [47]

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. 2025. CacheClip: Accelerating RAG with Effective KV Cache Reuse. arXiv:2510.10129 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Huan Yang, Renji Zhang, Mingzhe Huang, Weijun Wang, Yin Tang, Yuanchun Li, Yunxin Liu, and Deyu Zhang. 2025. KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse. arXiv:2503.16525 [cs.LG]

work page arXiv 2025

[49] [49]

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. 2025. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse. InAdvances in Neural Information Processing Systems, Vol. 38

2025

[50] [50]

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. 2025. Gated Delta Networks: Improving Mamba2 with Delta Rule. InThe Thirteenth International Conference on Learning Representations

2025

[51] [51]

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. 2024. Gated Linear Attention Transformers with Hardware- Efficient Training. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 56501–56523

2024

[52] [52]

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim

[53] [53]

InAdvances in Neural Information Processing Systems, Vol

Parallelizing Linear Transformers with the Delta Rule over Se- quence Length. InAdvances in Neural Information Processing Systems, Vol. 37

[54] [54]

2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism

Songlin Yang and Yu Zhang. 2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism. https://github.com/fla-org/flash-linear-attention

2024

[55] [55]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369–2380

2018

[56] [56]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowl- edge Fusion. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys ’25). ACM. doi:10.1145/3689031.3696098

work page doi:10.1145/3689031.3696098 2025

[57] [57]

Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, and Yiran Chen. 2025. KVCOMM: Online Cross-context KV Cache Communication for Efficient LLM-based Multi-agent Systems. InAdvances in Neural Information Processing Systems

2025

[58] [58]

Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 11608–11620

2024

[59] [59]

Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, and Jie Tang. 2024. LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Ques- tion Answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 22600–22632

2024

[60] [60]

Shiju Zhao, Junhao Hu, Jiaqi Zheng, and Guihai Chen. 2026. You Need an Encoder for Native Position-Independent Caching. arXiv:2602.01519 [cs.CL]https://arxiv.org/abs/2602.01519

work page arXiv 2026

[61] [61]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems

2024

[62] [62]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 193–210

2024

[63] [63]

Yuechi Zhou, Yi Su, Jianxin Zhang, Juntao Li, Qingrong Xia, Zhefeng Wang, Xinyu Duan, and Baoxing Huai. 2025. A3: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving. arXiv:2511.17560 [cs.CL] 14

work page arXiv 2025