pith. sign in

arxiv: 2607.01299 · v1 · pith:BFXK4QWPnew · submitted 2026-07-01 · 💻 cs.DC

HYPIC: Accelerating Hybrid-Attention LLM Serving with Position-Independent Caching

Pith reviewed 2026-07-03 18:50 UTC · model grok-4.3

classification 💻 cs.DC
keywords hybrid attentionposition-independent cachingLLM servinglinear attentionsegment cachingprefill optimizationRAG servingKV cache reuse
0
0 comments X

The pith

Hypic lets hybrid-attention LLMs reuse cached segments across requests by composing linear-layer states with a transition operator and fixing full-attention boundaries with small seam recomputes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that position-independent caching can be extended to hybrid-attention models even though their linear layers hold recurrent states instead of per-token KV pairs. This matters for RAG and agentic workloads where prompts are pieced together from independent segments and the prefill phase dominates latency. Hypic supplies the missing operator for constant-time state composition on linear layers and a boundary-seam fix for the remaining full-attention layers, turning long cold requests into parallelizable work. A reader would care because the result is faster first-token delivery and higher peak throughput while accuracy stays close to full recompute.

Core claim

Hypic is the first serving system for hybrid-attention LLMs that supports position-independent caching. For linear-attention layers it caches the segment-cumulative transition operator together with each segment's zero-start end-state, allowing near-exact constant-time composition of independently stored segments. For the remaining full-attention layers it recomputes only a small seam window at each segment boundary to recover cross-segment lookback accuracy. Segment-level self-containment is further used to parallelize cache-miss prefill across instances.

What carries the argument

The segment-cumulative transition operator, the algebraic primitive that enables constant-time state composition of independently cached segments in linear-attention layers.

If this is right

  • Time-to-first-token drops 2.45x on average across tested workloads.
  • Peak throughput rises by up to 2.0x while accuracy remains within 3.3 points of full recompute.
  • Long cold requests become accelerable by parallel prefill across instances.
  • The approach applies to four different hybrid-attention models and five workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same seam-window idea might reduce recompute cost in other mixed-attention or state-space models that lack per-token states.
  • Segment self-containment could be leveraged for dynamic load balancing across serving clusters when requests share many segments.
  • The transition operator might allow incremental updates when a cached segment is later extended rather than replaced.

Load-bearing premise

Recomputing only a small seam window at each segment boundary restores enough cross-segment lookback accuracy in full-attention layers without per-token hidden states from the linear layers.

What would settle it

Measure accuracy on a multi-segment prompt workload when the seam window size is forced to zero versus the paper's chosen window size, and compare both to full recompute.

Figures

Figures reproduced from arXiv: 2607.01299 by Junhao Hu, Juntong Wu, Minghao Li, Weihang Chen, Xiaoxu Chen, Yang Liu, Yifei Liu.

Figure 1
Figure 1. Figure 1: Existing PIC methods reuse per-token KV cache in full-attention models via splice and correction (left); on hybrid stacks, both primitives fail because linear-attention layers expose only a per-request recurrent state, with no per-token handle (right). 1 Introduction Large language model (LLM) serving is shifting from single￾turn chat toward retrieval-augmented question answer￾ing [11, 15, 34, 47], multi-d… view at source ↗
Figure 2
Figure 2. Figure 2: Memory-access footprint of correction. (a) Full￾attention stack: every token’s prefix state is in the KV cache, so correction can read it directly. (b) Hybrid stack: linear layers retain only the per-request recurrent state, leaving non-final tokens’ prefix states uncached. C1: Thenum District -> Chrysan Company. C2: Derek lives in Thenum District. C3: answer with company name only. Query: Which company do… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Cache-miss prefill under PDC and existing PIC vs. (b) Parallel execution enabled by segment self￾containment. However, this migration assumes a prerequisite that does not hold in a hybrid stack. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Linear-attention state composition with cached transitions. Each segment caches the tuple (𝑇𝐶, 𝑆𝐶|0) at first prefill; at reuse time Hypic composes the prefix end-state and the cached tuples via Equation (6). segment-cumulative transition operator—a quantity computed as a transient intermediate at every recurrence step yet never persisted by current serving systems. To address this, Hypic caches not only t… view at source ↗
Figure 7
Figure 7. Figure 7: Seam window across adjacent segments (𝐶1,𝐶2): the last 𝑤 tokens of 𝐶1 and the first 𝑤 tokens of 𝐶2 are excluded from each segment’s cached state and recomputed jointly at splice time. At cache time, Hypic stores the zero-start end-state 𝑆𝐶|0 in the public pool. At reuse time, Hypic replaces each 𝑆𝐶𝑖 |0 in Equation (6) with 𝑅(𝑝𝑖) 𝑆𝐶𝑖 |0 before the prefix𝑇 -products act, where 𝑝𝑖 is segment 𝐶𝑖 ’s global star… view at source ↗
Figure 8
Figure 8. Figure 8: Seam-window handling at linear-attention layers. Each segment caches (𝑇𝐶, 𝑆𝐶|0) over interior tokens only; at splice time Hypic recomputes the seam window’s own 𝑇 and 𝑆 on the fly and inserts them into the composition law, jointly advancing the running state and forwarding per￾token outputs to the layer above. key from start 𝑎 to start 𝑏 reduces to one left-multiplication: 𝐾𝑏 = 𝑅(𝑏 − 𝑎) 𝐾𝑎. (8) These inter… view at source ↗
Figure 9
Figure 9. Figure 9: Accelerate long cold requests with segment paral￾lelism. The Hypic Router probes hit status for each segment (Seg 1, 3 hit; Seg 2, 4, 5 miss), LPT-dispatches the miss seg￾ments across the worker pool (Seg 2 and 4 to Worker 2; Seg 5 to Worker 3), and designates Worker 1 as the combine node, which pulls cache from peers and assembles the running state. to a prefill worker pool—each worker prefills its segmen… view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy–TTFT Pareto across four models and four datasets. 0.0 0.5 1.0 1.5 TTFT p50 (s) Ring-mini (TP=1) 0.0 0.5 1.0 1.5 Ring-flash (TP=4) 0.0 0.5 1.0 1.5 Qwen3.5-35B (TP=2) 0.0 0.5 1.0 1.5 Qwen3.5-122B (TP=4) 5 10 15 20 Request rate (req/s) 50k 100k 150k Throughput (tokens/s/GPU) 2 4 6 8 10 Request rate (req/s) 5k 10k 15k 2 4 6 8 10 Request rate (req/s) 10k 20k 30k 1 2 3 4 5 Request rate (req/s) 2.5k 5k … view at source ↗
Figure 11
Figure 11. Figure 11: P50 TTFT and per-GPU token throughput at various QPS on the Prod-RAG trace. 1000 2000 3000 4000 (a) Segment length 0.2 0.4 0.6 TTFT p50 (s) 5 10 15 (b) Number of Segments Full Recompute HYPIC [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Linear-attention composition scaling: accuracy and TTFT against (a) per-segment length at a fixed segment count of 4, and (b) segment count at a fixed per-segment length of 1k tokens. Recompute grows from 0.141 s to 0.624 s as the prompt be￾comes 4× longer, while Hypic grows only from 0.103 s to 0.127 s—a speedup that rises from 1.37× at 4k tokens to 4.91× at 16k tokens. We next vary the number of retriev… view at source ↗
Figure 14
Figure 14. Figure 14: Segment parallelism TTFT breakdown into dis￾patch forward (parallel per-segment prefill), comm (cross￾node KV pull), and combine forward (state composition and seam recompute) as we sweep prefill worker count 𝑛. from 8 to 32 raises TTFT by 76 ms while ROUGE-L varies within 0.15 points. Thus, 𝑤=8 suffices as the default. 6.5 Segment parallelism for cache-miss prefill Segment parallelism TTFT breakdown. Her… view at source ↗
Figure 13
Figure 13. Figure 13: Task accuracy and TTFT against window width 𝑤 per segment boundary. compute per-segment end-states and transitions indepen￾dently, compose the running state via Equation (6), and com￾pare it against Full Recompute [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
read the original abstract

In retrieval augmented generation (RAG) and agentic LLM serving, prompts are assembled from independent segments into long contexts, making the prefill stage dominate the per-request computation cost. To this cost, two directions have emerged in parallel: position-independent caching (PIC) admits KV reuse for non-contiguous segments shared across different requests, while hybrid-attention models reduce computation complexity by replacing most full-attention layers with linear attention. However, they cannot coexist: applying PIC to hybrid-attention models breaks down because per-token KV-cache reuse primitives do not transfer to the per-request recurrent state. In this work, we present Hypic, the first serving system for hybrid-attention LLMs with position-independent caching. For linear-attention layers, we identify the segment-cumulative transition operator as the missing algebraic primitive, and cache it alongside each segment's zero-start end-state, enabling near-exact and constant-time state composition of independently cached segments. For the remaining full-attention layers, existing PIC methods also fail as linear layers do not expose the per-token hidden states for selective recomputation. We show that the most significant attention deviation concentrates at segment boundaries, so recomputing only a small seam window at each boundary suffices to restore cross-segment lookback. Finally, Hypic exploits segment-level self-containment to parallelize cache-miss prefill across instances, turning long cold requests -- a major tail-latency contributor under both prefix caching and prior PIC -- into an accelerable workload. Evaluated across four hybrid-attention models and five workloads, Hypic reduces time-to-first-token (TTFT) by 2.45x on average and improves peak throughput by up to 2.0x over existing systems, while staying within 3.3 points of full-recompute accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents HYPIC, a serving system for hybrid-attention LLMs supporting position-independent caching. For linear-attention layers it caches the segment-cumulative transition operator together with each segment's zero-start end-state to enable constant-time composition of independently cached segments. For full-attention layers it recomputes only a small seam window at each segment boundary on the premise that attention deviation concentrates there. Segment self-containment is exploited to parallelize cache-miss prefill. Across four hybrid-attention models and five workloads the system is reported to reduce TTFT by 2.45× on average, improve peak throughput by up to 2.0×, and stay within 3.3 points of full-recompute accuracy.

Significance. If the accuracy and performance numbers hold, the work is significant for RAG and agentic serving workloads that assemble long contexts from non-contiguous segments. The algebraic identification of the segment-cumulative transition operator as a reusable primitive cleanly extends PIC to linear-attention layers; the parallelization of cold requests directly targets tail latency. These contributions are concrete and could influence practical serving-system design.

major comments (2)
  1. [Full-attention handling description and accuracy evaluation] The accuracy claim (within 3.3 points of full recompute) rests on the assertion that attention deviation in full-attention layers concentrates at segment boundaries so that a fixed seam window suffices. No quantitative bound on deviation propagation through residual connections, layer stacking, or multi-head interactions is supplied, nor is there a sensitivity study relating seam size to model depth or segment length. This is load-bearing for the central accuracy guarantee.
  2. [Evaluation section] The reported 2.45× TTFT and 2.0× throughput figures are presented without workload definitions, baseline implementation details, measurement methodology, or statistical significance tests. These omissions prevent verification that the speedups are reproducible and not the result of post-hoc tuning or unstated workload characteristics.
minor comments (1)
  1. [Linear-attention caching subsection] Notation for the transition operator and zero-start end-state would benefit from an explicit equation or pseudocode block to make the constant-time composition claim immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Full-attention handling description and accuracy evaluation] The accuracy claim (within 3.3 points of full recompute) rests on the assertion that attention deviation in full-attention layers concentrates at segment boundaries so that a fixed seam window suffices. No quantitative bound on deviation propagation through residual connections, layer stacking, or multi-head interactions is supplied, nor is there a sensitivity study relating seam size to model depth or segment length. This is load-bearing for the central accuracy guarantee.

    Authors: We agree that the manuscript would be strengthened by a quantitative characterization of deviation propagation. The current work relies on empirical measurements across four hybrid-attention models showing that a fixed seam window keeps accuracy within 3.3 points of full recompute. In the revision we will add a sensitivity study that varies seam-window size against model depth and segment length, reports per-layer deviation statistics, and discusses observed empirical bounds on propagation through residuals and stacking. revision: yes

  2. Referee: [Evaluation section] The reported 2.45× TTFT and 2.0× throughput figures are presented without workload definitions, baseline implementation details, measurement methodology, or statistical significance tests. These omissions prevent verification that the speedups are reproducible and not the result of post-hoc tuning or unstated workload characteristics.

    Authors: We acknowledge that additional detail is needed for reproducibility. The manuscript already names the four models and five workloads, but the revision will expand the evaluation section with explicit workload definitions (segment counts, lengths, and composition), baseline system configurations and code references, complete measurement methodology (hardware, software stack, timing methodology), and statistical reporting (means and standard deviations over repeated runs). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements independent of inputs

full rationale

The paper is a systems contribution presenting an implementation (Hypic) and its measured performance on four models and five workloads. No equations, algebraic derivations, fitted parameters, or predictions appear in the abstract or description. TTFT and throughput gains are reported as direct experimental outcomes, and the accuracy delta is an empirical comparison to full-recompute rather than a quantity obtained by construction from any input or self-citation. The seam-window claim is presented as an observed property enabling the design, not as a self-defined or fitted result. The derivation chain therefore contains no load-bearing steps that reduce to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems paper whose central claims are empirical performance numbers; no mathematical axioms, free parameters, or invented physical entities are invoked in the abstract.

pith-pipeline@v0.9.1-grok · 5883 in / 1152 out tokens · 30353 ms · 2026-07-03T18:50:46.641099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 117–134

  2. [2]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhid- ian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al

  3. [3]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

    LongBench: A Bilingual, Multitask Benchmark for Long Con- text Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 3119–3137

  4. [4]

    Ziyi Cao, Qingsi Si, Jingbin Zhang, and Bingquan Liu. 2026. Sparse Attention Across Multiple-Context KV Cache. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 30165–30173. doi:10. 1609/aaai.v40i36.40266

  5. [5]

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. 2025. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention.arXiv preprint arXiv:2506.13585

  6. [6]

    Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Bing Li, and Ulf Schlichtmann. 2026. KV Packet: Recomputation-Free Context- Independent KV Caching for LLMs. arXiv:2604.13226 [cs.CL]

  7. [7]

    Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Du- ality. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 10041–10071

  8. [8]

    Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-News: A Large-Scale Multi-Document Summariza- tion Dataset and Abstractive Hierarchical Model. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1074–1084

  9. [9]

    In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandel- wal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of Machine Learning and Systems, Vol. 6

  10. [10]

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer

  11. [11]

    InProceedings of the 2nd Workshop on New Frontiers in Summarization (EMNLP-IJCNLP 2019 Workshop)

    SAMSum Corpus: A Human-Annotated Dialogue Dataset for Abstractive Summarization. InProceedings of the 2nd Workshop on New Frontiers in Summarization (EMNLP-IJCNLP 2019 Workshop). 70–79

  12. [12]

    Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitendra Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. InProceedings of the ACM SIGCOMM Conference. 202–215. doi:10.1145/2934872.2934908

  13. [13]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a Multi-hop QA Dataset for Compre- hensive Evaluation of Reasoning Steps. InProceedings of the 28th International Conference on Computational Linguistics. 6609–6625

  14. [14]

    Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie

  15. [15]

    InProceedings of the 42nd International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol

    EPIC: Efficient Position-Independent Caching for Serving Large Language Models. InProceedings of the 42nd International Confer- ence on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 24391–24402

  16. [16]

    Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. 2021. Efficient Attentions for Long Document Summarization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 1419–1436

  17. [17]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language models resolve real-world github issues?. InInternational Conference on Learning Representations, Vol. 2024. 54107–54157

  18. [18]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer

  19. [19]

    InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1601–1611

  20. [20]

    Fu, Christo- pher Ré, and Azalia Mirhoseini

    Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christo- pher Ré, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. InProceedings of the 41st Interna- tional Conference on Machine Learning (ICML ’24)

  21. [21]

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast Autoregressive Trans- formers with Linear Attention. InProceedings of the 37th International Conference on Machine Learning (ICML)

  22. [22]

    Kimi Team. 2025. Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.26692 [cs.CL]https://arxiv.org/abs/2510. 26692

  23. [23]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica

  24. [24]

    InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23)

    Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23). ACM, 611–626

  25. [25]

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al . 2024. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887(2024)

  26. [26]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. 74–81

  27. [27]

    Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. 2026. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serv- ing. InProceedings of the 24th USENIX Conference on File and Storage Technologies (FAST ’26). USENIX Association, 83–99

  28. [28]

    Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang

  29. [29]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 6588–6601. doi:10.18653/v1/2025.emnlp-main.334

  30. [30]

    Dongyang Ma, Yan Wang, and Tian Lan. 2025. Block-Attention for Effi- cient Prefilling. InThe Thirteenth International Conference on Learning Representations

  31. [31]

    2024.NCCL: NVIDIA Collective Communications Library

    NVIDIA. 2024.NCCL: NVIDIA Collective Communications Library. https://github.com/NVIDIA/nccl

  32. [32]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA ’24). IEEE, 118–132

  33. [33]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Minxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation – A KVCache-Centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST ’25). USENIX Association

  34. [34]

    Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. 2024. Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models. arXiv:2401.04658 [cs.CL]

  35. [35]

    Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. Qwen Technical Blog.https://qwen.ai/blog?id=qwen3.5 13 Liu et al

  36. [36]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang

  37. [37]

    InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383–2392

  38. [38]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2024. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing568 (2024), 127063

  39. [39]

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive Net- work: A Successor to Transformer for Large Language Models. arXiv:2307.08621 [cs.CL]

  40. [40]

    Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, et al . 2025. Every attention matters: An efficient hybrid architecture for long-context reasoning.arXiv preprint arXiv:2510.19338(2025)

  41. [41]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop Questions via Single-hop Ques- tion Composition.Transactions of the Association for Computational Linguistics10 (2022), 539–554

  42. [42]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems, Vol. 30

  43. [43]

    Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, and Congfeng Jiang. 2026. From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval- Augmented Generation.Proceedings of the ACM on Management of Data4, 1 (2026). doi:10.1145/3786655

  44. [44]

    Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Xiaolong Bai, Yizhou Shan, Wei Zhang, Lan Wang, Ying Xiong, Yong Zhang, and Zhenan Fan. 2025. MEPIC: Memory Efficient Position Independent Caching for LLM Serving. arXiv:2512.16822 [cs.LG]

  45. [45]

    Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xi- angyu Zou, Wen Xia, Wentao Zhang, Chongyang Qiu, and Pengfei Wang. 2026. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation. arXiv:2602.02579 [cs.AI]

  46. [46]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Representa- tions

  47. [47]

    Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. 2025. CacheClip: Accelerating RAG with Effective KV Cache Reuse. arXiv:2510.10129 [cs.LG]

  48. [48]

    Huan Yang, Renji Zhang, Mingzhe Huang, Weijun Wang, Yin Tang, Yuanchun Li, Yunxin Liu, and Deyu Zhang. 2025. KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse. arXiv:2503.16525 [cs.LG]

  49. [49]

    Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. 2025. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse. InAdvances in Neural Information Processing Systems, Vol. 38

  50. [50]

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. 2025. Gated Delta Networks: Improving Mamba2 with Delta Rule. InThe Thirteenth International Conference on Learning Representations

  51. [51]

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. 2024. Gated Linear Attention Transformers with Hardware- Efficient Training. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 56501–56523

  52. [52]

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim

  53. [53]

    InAdvances in Neural Information Processing Systems, Vol

    Parallelizing Linear Transformers with the Delta Rule over Se- quence Length. InAdvances in Neural Information Processing Systems, Vol. 37

  54. [54]

    2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism

    Songlin Yang and Yu Zhang. 2024.FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism. https://github.com/fla-org/flash-linear-attention

  55. [55]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369–2380

  56. [56]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowl- edge Fusion. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys ’25). ACM. doi:10.1145/3689031.3696098

  57. [57]

    Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, and Yiran Chen. 2025. KVCOMM: Online Cross-context KV Cache Communication for Efficient LLM-based Multi-agent Systems. InAdvances in Neural Information Processing Systems

  58. [58]

    Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 11608–11620

  59. [59]

    Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, and Jie Tang. 2024. LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Ques- tion Answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 22600–22632

  60. [60]

    Shiju Zhao, Junhao Hu, Jiaqi Zheng, and Guihai Chen. 2026. You Need an Encoder for Native Position-Independent Caching. arXiv:2602.01519 [cs.CL]https://arxiv.org/abs/2602.01519

  61. [61]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems

  62. [62]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’24). USENIX Association, 193–210

  63. [63]

    Yuechi Zhou, Yi Su, Jianxin Zhang, Juntao Li, Qingrong Xia, Zhefeng Wang, Xinyu Duan, and Baoxing Huai. 2025. A3: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving. arXiv:2511.17560 [cs.CL] 14