pith. sign in

arxiv: 2510.10129 · v2 · pith:KNUHEYPFnew · submitted 2025-10-11 · 💻 cs.LG · cs.AI

CacheClip: Accelerating RAG with Effective KV Cache Reuse

Pith reviewed 2026-05-22 12:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords RAGKV cache reuseattention distribution similaritytoken selectionTTFT reductioninter-chunk attentionauxiliary modelprefill acceleration
0
0 comments X

The pith

Small auxiliary LLMs identify critical tokens via last-layer attention similarity to enable selective KV cache recomputation in RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that small auxiliary models produce last-layer attention distributions close enough to those of the primary LLM to pick out the tokens whose KV states must be recomputed when stitching retrieved chunks together. This selection restores the inter-chunk attention that prefix caching and simple precomputation normally lose, while shared prefixes and a sliding-window grouping keep local coherence. The recomputation fraction is left as a tunable parameter so that deployments can trade speed for quality. If the similarity holds, RAG systems gain both lower time-to-first-token and generation quality that approaches full attention on cross-chunk reasoning tasks.

Core claim

CacheClip demonstrates that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs, enabling efficient identification of tokens critical for restoring inter-chunk attention; this supports auxiliary-model-guided token selection for selective KV cache recomputation, combined with shared prefixes, sliding-window grouping, and CPU-GPU hybrid execution, so that RAG inference can be accelerated while retaining up to 85.2 percent of full-attention performance on NIAH and 91.1 percent on LongBench at 20 percent recomputation.

What carries the argument

Auxiliary-model-guided token selection that uses last-layer attention similarity to decide which tokens require KV cache recomputation across chunk boundaries.

If this is right

  • Adjusting the recomputation ratio lets users control the speed-quality trade-off without retraining.
  • Shared prefixes remove repeated attention sinks that would otherwise waste compute on every chunk.
  • Sliding-window grouping preserves local token coherence while only partial KV states are updated.
  • Offloading the auxiliary model to CPU avoids extra GPU memory or compute cost during prefill.
  • The method outperforms prior reuse techniques such as CacheBlend and APE on both NIAH and LongBench at equal recomputation budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention-similarity signal might let future systems decide on the fly whether a retrieved chunk needs any recomputation at all.
  • Because the auxiliary runs on CPU, the technique could be deployed on hardware where GPU memory is the main constraint.
  • If attention similarity proves stable across model families, CacheClip-style selection could become a standard prefill optimization for any long-context retrieval pipeline.
  • Testing the approach on even smaller auxiliaries or distilled versions would directly measure how far the similarity assumption can be pushed.

Load-bearing premise

The last-layer attention maps of the small auxiliary model remain similar enough to the primary model's maps across different primary models, tasks, and chunk boundaries to select the right tokens for recomputation.

What would settle it

On a cross-chunk reasoning benchmark, run both the auxiliary and primary models on the same input chunks and measure whether their last-layer attention rankings differ enough that the selected recomputation set produces quality more than 10 points below full attention.

Figures

Figures reproduced from arXiv: 2510.10129 by Bin Yang, Jun Zeng, Qiuyu Leng, Zhenhua Wu.

Figure 1
Figure 1. Figure 1: Illustration of CacheClip. However, this approach is overly strict for RAG scenarios where retrieved chunks vary across queries. Prefix caching typically only benefits the first few chunks, while RAG systems usually involve ten or more retrieved chunks, making it insufficient for meaningful TTFT acceleration. A more straightforward approach is direct precomputation: independently precomputing KV cache for … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Attention Similarity between Qwen2.5-0.5B and Qwen2.5-7B on 500 Sequences per Input [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of CacheClip. 3 Methodology of CacheClip Building on the key insights discussed in the previous sections, we design CacheClip with two guiding principles drawn from prior studies. APE [16] demonstrate that using the original local KV cache can still yield reasonable outputs, even without modification. Meanwhile, H2O [31] and CacheBlend [17] observe that only a small subset of tokens significantly … view at source ↗
Figure 4
Figure 4. Figure 4: Performance Comparison across RULER Test Cases (Qwen2.5-14B, Input Length = 8192). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Recompute Ratio on RULER Performance (Input Length = 8192). [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of Recompute Ratio on RULER-multivalue Performance (Input Length = 8192). [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance Comparison across Longbench Test Cases (Qwen2.5-14B). [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates four techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, (2) shared prefixes to eliminate redundant attention sinks, (3) a sliding-window grouping strategy to maintain local coherence during partial KV cache updates, and (4) a CPU-GPU hybrid design that offloads auxiliary model inference to idle CPU resources, avoiding additional GPU overhead. The recomputation ratio is adjustable, allowing users to flexibly balance efficiency and quality for different deployment requirements. Experiments show CacheClip retains up to 85.2% and 91.1% of full-attention performance on NIAH and LongBench, outperforming CacheBlend and APE by 16.1 and 12.8 points on NIAH, and by 4.5 and 4.2 points on LongBench (with recomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 3.33$\times$ in prefill time (with recomp% = 20%), providing a practical solution to the efficiency-quality trade-off in RAG systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CacheClip, a framework to accelerate RAG inference by mitigating TTFT bottlenecks through KV cache reuse. Its central approach relies on small auxiliary LLMs to identify tokens for selective recomputation by exploiting similar last-layer attention distributions to the primary model, combined with shared prefixes to reduce attention sinks, a sliding-window grouping strategy, and a CPU-GPU hybrid execution model. At a 20% recomputation ratio, it reports retaining 85.2% of full-attention performance on NIAH and 91.1% on LongBench while achieving up to 3.33× prefill speedup and outperforming APE and CacheBlend by 16.1/12.8 and 4.5/4.2 points respectively.

Significance. If the auxiliary-to-primary attention similarity assumption proves robust, CacheClip would offer a practical, adjustable engineering solution to the efficiency-quality trade-off in RAG systems, with concrete reported speedups and retention figures that could influence deployment practices for long-context retrieval-augmented tasks. The hybrid CPU offload and adjustable recomp% parameter are pragmatic strengths that distinguish it from prior prefix-caching or full precomputation baselines.

major comments (2)
  1. [Abstract] Abstract (key insight paragraph): The claim that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs is presented as the enabling observation for token selection, yet no quantitative metrics (token overlap, KL divergence, or correlation values) or cross-boundary validation are supplied. This assumption is load-bearing for the selective recomputation that supports the 85.2% NIAH and 91.1% LongBench retention figures at 20% recomp%; divergence on cross-chunk reasoning would directly weaken those results.
  2. [Experiments] Experiments (reported results): The retention and speedup numbers are given without error bars, per-dataset breakdowns, or ablations testing attention-similarity transfer across model families, chunk sizes, or query types. This omission makes it impossible to evaluate whether the 16.1-point NIAH gain over CacheBlend generalizes or depends on specific conditions where the auxiliary-primary similarity holds.
minor comments (2)
  1. Clarify the exact auxiliary and primary model pairs, chunk sizes, and query distributions used for the NIAH and LongBench numbers so readers can reproduce the attention-similarity premise.
  2. The sliding-window grouping strategy and shared-prefix mechanism are described at a high level; a diagram or pseudocode would improve clarity on how local coherence is preserved during partial KV updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. We will revise the manuscript to address the concerns raised, particularly by adding quantitative support for our key assumption and enhancing the experimental reporting.

read point-by-point responses
  1. Referee: [Abstract] Abstract (key insight paragraph): The claim that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs is presented as the enabling observation for token selection, yet no quantitative metrics (token overlap, KL divergence, or correlation values) or cross-boundary validation are supplied. This assumption is load-bearing for the selective recomputation that supports the 85.2% NIAH and 91.1% LongBench retention figures at 20% recomp%; divergence on cross-chunk reasoning would directly weaken those results.

    Authors: We agree that explicit quantitative metrics would better substantiate the core assumption. While the manuscript presents the similarity observation as the foundation for auxiliary-guided selection, the initial version omitted direct metrics such as KL divergence or token overlap. In the revised manuscript we will add a dedicated analysis section reporting average KL divergence, Pearson correlation, and top-k token overlap between last-layer attention distributions of the auxiliary and primary models across representative RAG inputs. We will also include cross-chunk boundary validation to confirm robustness on inter-chunk reasoning tasks, directly supporting the reported retention figures at 20% recomputation. revision: yes

  2. Referee: [Experiments] Experiments (reported results): The retention and speedup numbers are given without error bars, per-dataset breakdowns, or ablations testing attention-similarity transfer across model families, chunk sizes, or query types. This omission makes it impossible to evaluate whether the 16.1-point NIAH gain over CacheBlend generalizes or depends on specific conditions where the auxiliary-primary similarity holds.

    Authors: We acknowledge that greater statistical detail and ablation coverage would improve evaluation of generalizability. The current results reflect single-run point estimates under our experimental configuration. In revision we will add error bars derived from multiple random seeds for the primary metrics, provide per-dataset breakdowns within LongBench and NIAH, and include targeted ablations examining attention-similarity transfer across model families and chunk sizes. These additions will clarify the conditions under which the reported gains over CacheBlend and APE hold. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical engineering approach with independent experimental validation.

full rationale

The paper presents CacheClip as a practical framework relying on an empirical observation that small auxiliary LLMs show similar last-layer attention distributions to primary LLMs. This is stated as a key insight without derivation from equations or reduction to fitted parameters by construction. Techniques like auxiliary-guided token selection, shared prefixes, sliding-window grouping, and CPU-GPU hybrid design are described as engineering choices, with performance claims supported by reported experiments on NIAH and LongBench rather than self-referential identities. No load-bearing steps invoke self-citations for uniqueness theorems or smuggle ansatzes; the central claim does not reduce to its inputs and remains falsifiable via external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on one central domain assumption about attention similarity and one tunable parameter; no new entities are postulated.

free parameters (1)
  • recomp%
    User-adjustable fraction of tokens selected for recomputation; example value 20% used in reported results.
axioms (1)
  • domain assumption Small auxiliary LLMs exhibit similar last-layer attention distributions to the primary LLM on the same inputs
    This is the key insight that justifies using the auxiliary model to select critical tokens.

pith-pipeline@v0.9.0 · 5883 in / 1231 out tokens · 37346 ms · 2026-05-22T12:32:25.850043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Large language models in healthcare and medical domain: A review

    Zabir Al Nazi and Wei Peng. Large language models in healthcare and medical domain: A review. InInformatics, volume 11, page 57. MDPI, 2024

  2. [2]

    Large language models in finance: A survey

    Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. In Proceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023

  3. [3]

    Advancements in the application of large language models in urban studies: A systematic review.Cities, 165:106142, 2025

    Junhao Xia, Yao Tong, and Ying Long. Advancements in the application of large language models in urban studies: A systematic review.Cities, 165:106142, 2025

  4. [4]

    Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

    Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

  5. [5]

    Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024

    Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024

  6. [6]

    Large language models struggle to learn long-tail knowledge

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InInternational conference on machine learning, pages 15696–15707. PMLR, 2023

  7. [7]

    Knowledge boundary of large language models: A survey.arXiv preprint arXiv:2412.12472, 2024

    Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, and Yang Deng. Knowledge boundary of large language models: A survey.arXiv preprint arXiv:2412.12472, 2024

  8. [8]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  9. [9]

    Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

    Patrice Béchard and Orlando Marquez Ayala. Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

  10. [10]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  11. [11]

    On the computational complexity of self-attention

    Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. On the computational complexity of self-attention. InInternational conference on algorithmic learning theory, pages 597–619. PMLR, 2023

  12. [12]

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025

  13. [13]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  14. [14]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody_Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023

  15. [15]

    Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024

  16. [16]

    Ape: Faster and longer context-augmented generation via adaptive parallel encoding

    Xinyu Yang, Tianqi Chen, and Beidi Chen. Ape: Faster and longer context-augmented generation via adaptive parallel encoding. InICLR 2025, 2025. 12 APREPRINT- OCTOBER14, 2025

  17. [17]

    Cacheblend: Fast large language model serving with cached knowledge fusion

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

  18. [18]

    The power of noise: Redefining retrieval for rag systems

    Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 719–729, 2024

  19. [19]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

  20. [20]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  21. [21]

    Towards understanding systems trade-offs in retrieval-augmented generation model inference.arXiv preprint arXiv:2412.11854, 2024

    Michael Shen, Muhammad Umar, Kiwan Maeng, G Edward Suh, and Udit Gupta. Towards understanding systems trade-offs in retrieval-augmented generation model inference.arXiv preprint arXiv:2412.11854, 2024

  22. [22]

    Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.Proceedings of the ACM on Management of Data, 3(3):1–28, 2025

    Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nir- mal Joshua Kapu, Tong Yu, and Shiv Saini. Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.Proceedings of the ACM on Management of Data, 3(3):1–28, 2025

  23. [23]

    Ragcache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

  24. [24]

    Available: https://arxiv.org/abs/2410.09342

    Zihan Zhou, Chong Li, Xinyi Chen, Shuo Wang, Yu Chao, Zhili Li, Haoyu Wang, Rongqiao An, Qi Shi, Zhixing Tan, et al. Llm×mapreduce: Simplified long-sequence processing using large language models.arXiv preprint arXiv:2410.09342, 2024

  25. [25]

    Kvlink: Accelerating large language models via efficient kv cache reuse.arXiv preprint arXiv:2502.16002, 2025

    Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse.arXiv preprint arXiv:2502.16002, 2025

  26. [26]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  27. [27]

    Block-attention for efficient rag.arXiv preprint arXiv:2409.15355, 2024

    East Sun, Yan Wang, and Lan Tian. Block-attention for efficient rag.arXiv preprint arXiv:2409.15355, 2024

  28. [28]

    Turborag: Accelerating retrieval-augmented generation with precomputed kv caches for chunked text.arXiv preprint arXiv:2410.07590, 2024

    Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. Turborag: Accelerating retrieval-augmented generation with precomputed kv caches for chunked text.arXiv preprint arXiv:2410.07590, 2024

  29. [29]

    Attention entropy is a key factor: An analysis of parallel context encoding with full-attention-based pre-trained language models.arXiv preprint arXiv:2412.16545, 2024

    Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, and Dong Yu. Attention entropy is a key factor: An analysis of parallel context encoding with full-attention-based pre-trained language models.arXiv preprint arXiv:2412.16545, 2024

  30. [30]

    Scatterbrain: Unifying sparse and low-rank attention.Advances in Neural Information Processing Systems, 34:17413–17426, 2021

    Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention.Advances in Neural Information Processing Systems, 34:17413–17426, 2021

  31. [31]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

  32. [32]

    What does bert learn about the structure of language? InACL 2019-57th Annual Meeting of the Association for Computational Linguistics, 2019

    Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What does bert learn about the structure of language? InACL 2019-57th Annual Meeting of the Association for Computational Linguistics, 2019

  33. [33]

    Analyzing the Structure of Attention in a Transformer Language Model

    Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284, 2019

  34. [34]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  35. [35]

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774, 2024

  36. [36]

    On the importance of local information in transformer based models.arXiv preprint arXiv:2008.05828, 2020

    Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar, and Mitesh M Khapra. On the importance of local information in transformer based models.arXiv preprint arXiv:2008.05828, 2020

  37. [37]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  38. [38]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. 13 APREPRINT-...

  39. [39]

    Accelerate artificial intelligence (ai) workloads with intel advanced matrix extensions (intel amx), December 2022

    Intel. Accelerate artificial intelligence (ai) workloads with intel advanced matrix extensions (intel amx), December 2022

  40. [40]

    Sparamx: Accelerating compressed llms token generation on amx-powered cpus.arXiv preprint arXiv:2502.12444, 2025

    Ahmed F AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J Pablo Muñoz, Vui Seng Chua, Nilesh Jain, and Mohamed S Abdelfattah. Sparamx: Accelerating compressed llms token generation on amx-powered cpus.arXiv preprint arXiv:2502.12444, 2025

  41. [41]

    Smollm2: When smol goes big – data-centric training of a small language model, 2025

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Wer...

  42. [42]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  43. [43]

    Needle In A Haystack - Pressure Testing LLMs

    Gregory Kamradt. Needle In A Haystack - Pressure Testing LLMs. https://github.com/gkamradt/ LLMTestNeedleInAHaystack/tree/main, 2023

  44. [44]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023

  45. [45]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024. 14 APREPRINT- OCTOBER14, 2025 A Detailed Evaluation Results Table 2: Performance of APE on RUL...