CacheClip: Accelerating RAG with Effective KV Cache Reuse
Pith reviewed 2026-05-22 12:32 UTC · model grok-4.3
The pith
Small auxiliary LLMs identify critical tokens via last-layer attention similarity to enable selective KV cache recomputation in RAG.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CacheClip demonstrates that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs, enabling efficient identification of tokens critical for restoring inter-chunk attention; this supports auxiliary-model-guided token selection for selective KV cache recomputation, combined with shared prefixes, sliding-window grouping, and CPU-GPU hybrid execution, so that RAG inference can be accelerated while retaining up to 85.2 percent of full-attention performance on NIAH and 91.1 percent on LongBench at 20 percent recomputation.
What carries the argument
Auxiliary-model-guided token selection that uses last-layer attention similarity to decide which tokens require KV cache recomputation across chunk boundaries.
If this is right
- Adjusting the recomputation ratio lets users control the speed-quality trade-off without retraining.
- Shared prefixes remove repeated attention sinks that would otherwise waste compute on every chunk.
- Sliding-window grouping preserves local token coherence while only partial KV states are updated.
- Offloading the auxiliary model to CPU avoids extra GPU memory or compute cost during prefill.
- The method outperforms prior reuse techniques such as CacheBlend and APE on both NIAH and LongBench at equal recomputation budgets.
Where Pith is reading between the lines
- The same attention-similarity signal might let future systems decide on the fly whether a retrieved chunk needs any recomputation at all.
- Because the auxiliary runs on CPU, the technique could be deployed on hardware where GPU memory is the main constraint.
- If attention similarity proves stable across model families, CacheClip-style selection could become a standard prefill optimization for any long-context retrieval pipeline.
- Testing the approach on even smaller auxiliaries or distilled versions would directly measure how far the similarity assumption can be pushed.
Load-bearing premise
The last-layer attention maps of the small auxiliary model remain similar enough to the primary model's maps across different primary models, tasks, and chunk boundaries to select the right tokens for recomputation.
What would settle it
On a cross-chunk reasoning benchmark, run both the auxiliary and primary models on the same input chunks and measure whether their last-layer attention rankings differ enough that the selected recomputation set produces quality more than 10 points below full attention.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates four techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, (2) shared prefixes to eliminate redundant attention sinks, (3) a sliding-window grouping strategy to maintain local coherence during partial KV cache updates, and (4) a CPU-GPU hybrid design that offloads auxiliary model inference to idle CPU resources, avoiding additional GPU overhead. The recomputation ratio is adjustable, allowing users to flexibly balance efficiency and quality for different deployment requirements. Experiments show CacheClip retains up to 85.2% and 91.1% of full-attention performance on NIAH and LongBench, outperforming CacheBlend and APE by 16.1 and 12.8 points on NIAH, and by 4.5 and 4.2 points on LongBench (with recomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 3.33$\times$ in prefill time (with recomp% = 20%), providing a practical solution to the efficiency-quality trade-off in RAG systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CacheClip, a framework to accelerate RAG inference by mitigating TTFT bottlenecks through KV cache reuse. Its central approach relies on small auxiliary LLMs to identify tokens for selective recomputation by exploiting similar last-layer attention distributions to the primary model, combined with shared prefixes to reduce attention sinks, a sliding-window grouping strategy, and a CPU-GPU hybrid execution model. At a 20% recomputation ratio, it reports retaining 85.2% of full-attention performance on NIAH and 91.1% on LongBench while achieving up to 3.33× prefill speedup and outperforming APE and CacheBlend by 16.1/12.8 and 4.5/4.2 points respectively.
Significance. If the auxiliary-to-primary attention similarity assumption proves robust, CacheClip would offer a practical, adjustable engineering solution to the efficiency-quality trade-off in RAG systems, with concrete reported speedups and retention figures that could influence deployment practices for long-context retrieval-augmented tasks. The hybrid CPU offload and adjustable recomp% parameter are pragmatic strengths that distinguish it from prior prefix-caching or full precomputation baselines.
major comments (2)
- [Abstract] Abstract (key insight paragraph): The claim that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs is presented as the enabling observation for token selection, yet no quantitative metrics (token overlap, KL divergence, or correlation values) or cross-boundary validation are supplied. This assumption is load-bearing for the selective recomputation that supports the 85.2% NIAH and 91.1% LongBench retention figures at 20% recomp%; divergence on cross-chunk reasoning would directly weaken those results.
- [Experiments] Experiments (reported results): The retention and speedup numbers are given without error bars, per-dataset breakdowns, or ablations testing attention-similarity transfer across model families, chunk sizes, or query types. This omission makes it impossible to evaluate whether the 16.1-point NIAH gain over CacheBlend generalizes or depends on specific conditions where the auxiliary-primary similarity holds.
minor comments (2)
- Clarify the exact auxiliary and primary model pairs, chunk sizes, and query distributions used for the NIAH and LongBench numbers so readers can reproduce the attention-similarity premise.
- The sliding-window grouping strategy and shared-prefix mechanism are described at a high level; a diagram or pseudocode would improve clarity on how local coherence is preserved during partial KV updates.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. We will revise the manuscript to address the concerns raised, particularly by adding quantitative support for our key assumption and enhancing the experimental reporting.
read point-by-point responses
-
Referee: [Abstract] Abstract (key insight paragraph): The claim that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs is presented as the enabling observation for token selection, yet no quantitative metrics (token overlap, KL divergence, or correlation values) or cross-boundary validation are supplied. This assumption is load-bearing for the selective recomputation that supports the 85.2% NIAH and 91.1% LongBench retention figures at 20% recomp%; divergence on cross-chunk reasoning would directly weaken those results.
Authors: We agree that explicit quantitative metrics would better substantiate the core assumption. While the manuscript presents the similarity observation as the foundation for auxiliary-guided selection, the initial version omitted direct metrics such as KL divergence or token overlap. In the revised manuscript we will add a dedicated analysis section reporting average KL divergence, Pearson correlation, and top-k token overlap between last-layer attention distributions of the auxiliary and primary models across representative RAG inputs. We will also include cross-chunk boundary validation to confirm robustness on inter-chunk reasoning tasks, directly supporting the reported retention figures at 20% recomputation. revision: yes
-
Referee: [Experiments] Experiments (reported results): The retention and speedup numbers are given without error bars, per-dataset breakdowns, or ablations testing attention-similarity transfer across model families, chunk sizes, or query types. This omission makes it impossible to evaluate whether the 16.1-point NIAH gain over CacheBlend generalizes or depends on specific conditions where the auxiliary-primary similarity holds.
Authors: We acknowledge that greater statistical detail and ablation coverage would improve evaluation of generalizability. The current results reflect single-run point estimates under our experimental configuration. In revision we will add error bars derived from multiple random seeds for the primary metrics, provide per-dataset breakdowns within LongBench and NIAH, and include targeted ablations examining attention-similarity transfer across model families and chunk sizes. These additions will clarify the conditions under which the reported gains over CacheBlend and APE hold. revision: partial
Circularity Check
No significant circularity; empirical engineering approach with independent experimental validation.
full rationale
The paper presents CacheClip as a practical framework relying on an empirical observation that small auxiliary LLMs show similar last-layer attention distributions to primary LLMs. This is stated as a key insight without derivation from equations or reduction to fitted parameters by construction. Techniques like auxiliary-guided token selection, shared prefixes, sliding-window grouping, and CPU-GPU hybrid design are described as engineering choices, with performance claims supported by reported experiments on NIAH and LongBench rather than self-referential identities. No load-bearing steps invoke self-citations for uniqueness theorems or smuggle ansatzes; the central claim does not reduce to its inputs and remains falsifiable via external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- recomp%
axioms (1)
- domain assumption Small auxiliary LLMs exhibit similar last-layer attention distributions to the primary LLM on the same inputs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs … enabling efficient identification of tokens critical for restoring inter-chunk attention
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
shared prefixes to eliminate redundant attention sinks … sliding-window grouping strategy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Reference graph
Works this paper leans on
-
[1]
Large language models in healthcare and medical domain: A review
Zabir Al Nazi and Wei Peng. Large language models in healthcare and medical domain: A review. InInformatics, volume 11, page 57. MDPI, 2024
work page 2024
-
[2]
Large language models in finance: A survey
Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. Large language models in finance: A survey. In Proceedings of the fourth ACM international conference on AI in finance, pages 374–382, 2023
work page 2023
-
[3]
Junhao Xia, Yao Tong, and Ying Long. Advancements in the application of large language models in urban studies: A systematic review.Cities, 165:106142, 2025
work page 2025
-
[4]
Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025
-
[5]
Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024
Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024
-
[6]
Large language models struggle to learn long-tail knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. InInternational conference on machine learning, pages 15696–15707. PMLR, 2023
work page 2023
-
[7]
Knowledge boundary of large language models: A survey.arXiv preprint arXiv:2412.12472, 2024
Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, and Yang Deng. Knowledge boundary of large language models: A survey.arXiv preprint arXiv:2412.12472, 2024
-
[8]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025
work page 2025
-
[9]
Patrice Béchard and Orlando Marquez Ayala. Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024
-
[10]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[11]
On the computational complexity of self-attention
Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. On the computational complexity of self-attention. InInternational conference on algorithmic learning theory, pages 597–619. PMLR, 2023
work page 2023
-
[12]
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023
work page 2023
-
[14]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody_Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024
work page 2024
-
[16]
Ape: Faster and longer context-augmented generation via adaptive parallel encoding
Xinyu Yang, Tianqi Chen, and Beidi Chen. Ape: Faster and longer context-augmented generation via adaptive parallel encoding. InICLR 2025, 2025. 12 APREPRINT- OCTOBER14, 2025
work page 2025
-
[17]
Cacheblend: Fast large language model serving with cached knowledge fusion
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024
-
[18]
The power of noise: Redefining retrieval for rag systems
Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 719–729, 2024
work page 2024
-
[19]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[21]
Michael Shen, Muhammad Umar, Kiwan Maeng, G Edward Suh, and Udit Gupta. Towards understanding systems trade-offs in retrieval-augmented generation model inference.arXiv preprint arXiv:2412.11854, 2024
-
[22]
Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nir- mal Joshua Kapu, Tong Yu, and Shiv Saini. Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.Proceedings of the ACM on Management of Data, 3(3):1–28, 2025
work page 2025
-
[23]
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024
-
[24]
Available: https://arxiv.org/abs/2410.09342
Zihan Zhou, Chong Li, Xinyi Chen, Shuo Wang, Yu Chao, Zhili Li, Haoyu Wang, Rongqiao An, Qi Shi, Zhixing Tan, et al. Llm×mapreduce: Simplified long-sequence processing using large language models.arXiv preprint arXiv:2410.09342, 2024
-
[25]
Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse.arXiv preprint arXiv:2502.16002, 2025
-
[26]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Block-attention for efficient rag.arXiv preprint arXiv:2409.15355, 2024
East Sun, Yan Wang, and Lan Tian. Block-attention for efficient rag.arXiv preprint arXiv:2409.15355, 2024
-
[28]
Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. Turborag: Accelerating retrieval-augmented generation with precomputed kv caches for chunked text.arXiv preprint arXiv:2410.07590, 2024
-
[29]
Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, and Dong Yu. Attention entropy is a key factor: An analysis of parallel context encoding with full-attention-based pre-trained language models.arXiv preprint arXiv:2412.16545, 2024
-
[30]
Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention.Advances in Neural Information Processing Systems, 34:17413–17426, 2021
work page 2021
-
[31]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023
work page 2023
-
[32]
Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What does bert learn about the structure of language? InACL 2019-57th Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[33]
Analyzing the Structure of Attention in a Transformer Language Model
Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model.arXiv preprint arXiv:1906.04284, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[34]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar, and Mitesh M Khapra. On the importance of local information in transformer based models.arXiv preprint arXiv:2008.05828, 2020
-
[37]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024
work page 2024
-
[38]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. 13 APREPRINT-...
work page 2020
-
[39]
Intel. Accelerate artificial intelligence (ai) workloads with intel advanced matrix extensions (intel amx), December 2022
work page 2022
-
[40]
Ahmed F AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J Pablo Muñoz, Vui Seng Chua, Nilesh Jain, and Mohamed S Abdelfattah. Sparamx: Accelerating compressed llms token generation on amx-powered cpus.arXiv preprint arXiv:2502.12444, 2025
-
[41]
Smollm2: When smol goes big – data-centric training of a small language model, 2025
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Wer...
work page 2025
-
[42]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Needle In A Haystack - Pressure Testing LLMs
Gregory Kamradt. Needle In A Haystack - Pressure Testing LLMs. https://github.com/gkamradt/ LLMTestNeedleInAHaystack/tree/main, 2023
work page 2023
-
[44]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024. 14 APREPRINT- OCTOBER14, 2025 A Detailed Evaluation Results Table 2: Performance of APE on RUL...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.