Predictive Prefetching for Retrieval-Augmented Generation

Shichao Pei; Wuyang Zhang

arxiv: 2605.17989 · v1 · pith:TYEMPOBHnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Predictive Prefetching for Retrieval-Augmented Generation

Wuyang Zhang , Shichao Pei This is my paper

Pith reviewed 2026-05-20 11:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Retrieval-Augmented GenerationPredictive PrefetchingAsynchronous RetrievalLatency ReductionSemantic PrecursorsQuery GeneratorContext Monitor

0 comments

The pith

A predictive prefetching framework for RAG triggers retrievals by spotting semantic precursors several tokens before uncertainty peaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to cut the latency of retrieval-augmented generation by shifting from synchronous to predictive asynchronous retrieval. It argues that the model's own generation process contains early, detectable signals of future information needs. If those signals can be read reliably, the system can fetch the right material in advance rather than waiting until the model stalls. Three components work together to monitor context, decide when to act, and generate the actual retrieval query. Experiments across benchmarks show large drops in both overall latency and time to first token while answer quality stays comparable to standard RAG.

Core claim

The paper claims that semantic precursors in generation dynamics emerge several tokens before uncertainty becomes critical, and that these precursors can be exploited by a retrieval predictor, a context monitor, and a query generator to produce accurate prefetch decisions that align with the model's evolving information needs.

What carries the argument

The retrieval predictor, context monitor, and query generator together exploit semantic precursors in generation dynamics to decide both when retrieval should occur and what content to request.

If this is right

End-to-end latency drops by up to 43.5 percent on tested benchmarks.
Time-to-first-token improves by 62.4 percent while answer quality remains comparable.
The same three-component design works across multiple benchmarks without manual tuning of retrieval timing.
Prefetching decisions adapt to changing information demands during a single generation pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-signal approach could be tested on open-ended creative generation tasks where information needs shift rapidly.
If the predictors generalize, retrieval costs in large-scale serving systems could fall by skipping unneeded fetches.
A natural next measurement is how far ahead the precursors appear on average and whether that distance varies by domain.

Load-bearing premise

Semantic precursors reliably appear several tokens before critical uncertainty and can be read accurately enough to drive prefetching in complex multi-domain cases without quality loss.

What would settle it

A controlled run on a multi-domain benchmark where the predictor either misses needed retrievals or issues wrong ones, producing either higher end-to-end latency or lower answer accuracy than synchronous RAG.

Figures

Figures reproduced from arXiv: 2605.17989 by Shichao Pei, Wuyang Zhang.

**Figure 1.** Figure 1: Comparison of RAG architectures. (a) Synchronous RAG: generation blocks during each retrieval (287ms). (b) PipeRAG: retrieval runs in parallel but suffers from query staleness, since queries use outdated context (e.g., RET2 at token 12 is applied at token 16 when context has shifted). (c) Ours: predictive prefetching anticipates future needs. At token 2, we predict token 5 will require retrieval and issue … view at source ↗

**Figure 2.** Figure 2: The architecture decoupling generation and retrieval. Retrieval Predictor monitors transformer signals, Context Monitor assesses query readiness, and Query Generator initiates asynchronous retrieval. Documents and embeddings are stored in a shared Result Cache. Online learning (red) adapts all three components based on retrieval outcomes. ∆ tokens: pˆt = RetrievalPredictor(Ht, At, Vt, ot) ∈ [0, 1] (1) w… view at source ↗

**Figure 3.** Figure 3: Prediction and query analysis. (a) ROC curves show the Retrieval Predictor (AUROC=0.81) outperforms entropy thresholding (0.66) by 22.7%. (b) Waiting 3–4 tokens improves query relevance by up to 23% for factual queries. confirming benefits across diverse latency regimes. Cross-Model Generalization. Evaluation on six models (Llama, GPT-OSS, Qwen families) confirms consistent TTFT improvements of 61.5-63.4… view at source ↗

**Figure 4.** Figure 4: Radar chart comparing model performance across five dimensions: EM accuracy, TTFT efficiency (inverted), E2E efficiency (inverted), AUROC, and Efficiency Score. GPT-OSS-20B achieves the best balance of quality and efficiency, while GPT-OSS-120B leads in accuracy. All models show consistent TTFT improvements through predictive prefetching. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-signal visualization before retrieval trigger. (a) Entropy Ht and its derivative show rising uncertainty. (b) Attention dispersion increases as the model’s focus scatters. (c) Linguistic hedge frequency (“may”, “possibly”, “likely”) spikes before uncertainty. The green dashed line marks the prediction trigger at token 8, well before peak uncertainty. 50 100 200 500 1000 Retrieval latency (ms) 0 20 40… view at source ↗

**Figure 6.** Figure 6: TTFT/E2E latency reduction and prefetch hit rate as retrieval latency varies from 50 ms to 1000 ms on HotpotQA. Shaded band marks the 100–500 ms working range targeted by the framework. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Online adaptation on HotpotQA: AUROC (left axis) and reward standard deviation (right axis) versus number of online queries. 70% of AUROC gain occurs within the first 500 queries; reward variance decreases monotonically. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper offers a concrete three-component framework for predictive prefetching in RAG that targets real latency issues, though the abstract leaves implementation and evaluation details thin.

read the letter

The one thing to know is that this paper introduces a predictive prefetching framework for retrieval-augmented generation. It aims to reduce the latency that comes from waiting for retrieval during generation by using signals from the ongoing output to prefetch relevant information ahead of time. What is new here is the explicit three-component architecture. Instead of relying on heuristics for when to retrieve asynchronously, they have a retrieval predictor that decides timing, a context monitor that tracks the generation state, and a query generator that creates the actual retrieval query. This exploits what they call semantic precursors that appear in the generation dynamics before the model hits high uncertainty. The paper does well in focusing on a practical issue. RAG systems often face delays in real applications, and the reported results show up to 43.5 percent reduction in end-to-end latency and over 60 percent better time to first token, all while keeping answer quality on par with standard synchronous approaches. The experiments span multiple benchmarks, which adds some breadth. The soft spots are mainly around the missing specifics. The description stays at the framework level without detailing the neural architectures for the predictor and monitor, the training procedure, or any ablations that isolate the contribution of each component. There is also no mention of statistical significance or detailed error analysis for cases where the prediction might miss. This makes it difficult to fully evaluate how well the approach generalizes beyond the tested settings, especially in complex multi-domain scenarios. Overall, the central claim holds up in the reported outcomes, but more transparent methods would make the evidence more convincing. This paper is for engineers and researchers who work on making RAG systems faster for interactive use cases. Anyone dealing with deployment constraints around latency and retrieval will see direct relevance. It deserves a serious referee because it tackles a timely problem with a structured proposal and empirical backing. Reviewers could help strengthen the evaluation sections. I recommend sending this to peer review rather than desk rejecting it. The idea has enough substance to warrant detailed feedback on the implementation and results.

Referee Report

2 major / 1 minor

Summary. The paper proposes an advanced asynchronous RAG framework for predictive prefetching that uses three components—a retrieval predictor, a context monitor, and a query generator—to exploit semantic precursors in generation dynamics several tokens before uncertainty becomes critical. It claims this enables accurate triggering of retrieval and query generation, yielding up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token across multiple benchmarks while preserving answer quality comparable to synchronous RAG baselines.

Significance. If the reported gains are reproducible, the work addresses a practical bottleneck in RAG deployment by shifting from heuristic or synchronous retrieval to predictive prefetching based on generation dynamics. This could improve responsiveness in latency-sensitive applications without quality trade-offs, particularly in multi-domain settings where information needs evolve during decoding.

major comments (2)

[Abstract and experimental evaluation] The abstract and experimental results report specific quantitative gains (43.5% latency reduction, 62.4% TTFT improvement) and maintained answer quality, but supply no implementation details on the retrieval predictor, context monitor, or query generator architectures, training procedures, or how semantic precursors are detected and exploited. This information is load-bearing for verifying the central empirical claims.
[Experiments section] No statistical tests, error analysis, or ablation studies are described to support the robustness of the prefetching approach across complex multi-domain settings or to isolate the contribution of each of the three components. Without these, the claim that semantic precursors reliably enable accurate prefetching several tokens ahead cannot be fully assessed.

minor comments (1)

[Abstract] The abstract would be clearer if it named the specific benchmarks and baselines used for the reported latency and quality comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and valuable suggestions. We have carefully considered the comments and revised the manuscript to address the concerns regarding implementation details and experimental robustness. Our responses to the major comments are as follows.

read point-by-point responses

Referee: [Abstract and experimental evaluation] The abstract and experimental results report specific quantitative gains (43.5% latency reduction, 62.4% TTFT improvement) and maintained answer quality, but supply no implementation details on the retrieval predictor, context monitor, or query generator architectures, training procedures, or how semantic precursors are detected and exploited. This information is load-bearing for verifying the central empirical claims.

Authors: We agree that providing comprehensive implementation details is crucial for the reproducibility and verifiability of our results. In the revised version of the manuscript, we have added a dedicated subsection in the Methods section that describes the architectures of the retrieval predictor, context monitor, and query generator in detail, including model sizes, layer configurations, and input/output formats. We also elaborate on the training procedures, datasets, and loss functions used for each component. Furthermore, we explain the methodology for detecting and exploiting semantic precursors, including the specific metrics and thresholds employed. These additions directly support the central empirical claims and should facilitate independent verification. revision: yes
Referee: [Experiments section] No statistical tests, error analysis, or ablation studies are described to support the robustness of the prefetching approach across complex multi-domain settings or to isolate the contribution of each of the three components. Without these, the claim that semantic precursors reliably enable accurate prefetching several tokens ahead cannot be fully assessed.

Authors: We recognize the importance of rigorous statistical analysis and ablation studies to substantiate our claims. Accordingly, we have incorporated statistical tests, including paired t-tests to assess the significance of the observed latency reductions and TTFT improvements across benchmarks. We now report error bars and standard deviations from multiple experimental runs. Additionally, we have included comprehensive ablation studies that isolate the impact of each component (retrieval predictor, context monitor, and query generator) by evaluating variants with individual components disabled. A new error analysis subsection examines cases in multi-domain settings where prefetching accuracy varies, providing insights into the reliability of semantic precursors. These enhancements allow for a fuller assessment of the approach's robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on proposed architecture and external benchmarks

full rationale

The paper introduces a predictive prefetching framework for RAG via three components (retrieval predictor, context monitor, query generator) that exploit semantic precursors emerging before uncertainty peaks. No equations, fitted parameters, or self-referential definitions appear in the provided text. Central claims of latency reduction (up to 43.5%) and maintained quality are validated through experiments on multiple benchmarks rather than by construction from inputs. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach is described at the level of system components and empirical outcomes without mathematical formulation.

pith-pipeline@v0.9.0 · 5666 in / 1061 out tokens · 38328 ms · 2026-05-20T11:33:39.028695+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction (8-tick period) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

semantic precursors in generation dynamics that emerge approximately 8–16 tokens before uncertainty becomes critical
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates / z_monotone_absolute echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

retrieval predictor that forecasts impending information needs by monitoring generation signals, including token distributions, attention patterns, and discourse markers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019
[2]

Cohen and Ruslan Salakhutdinov and Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018
[3]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.580

work page doi:10.18653/v1/2020.coling-main.580 2020
[4]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[5]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[6]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[7]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[8]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

work page
[9]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020
[10]

Proceedings of the 39th International Conference on Machine Learning , pages =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022
[11]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-. 2024 , url=

work page 2024
[12]

doi: 10.18653/v1/2023.emnlp-main.495

Jiang, Zhengbao and Xu, Frank and Gao, Luyu and Sun, Zhiqing and Liu, Qian and Dwivedi-Yu, Jane and Yang, Yiming and Callan, Jamie and Neubig, Graham. Active Retrieval Augmented Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.495

work page doi:10.18653/v1/2023.emnlp-main.495 2023
[13]

Transactions on Machine Learning Research , issn=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

work page 2022
[14]

2025 , eprint=

From Local to Global: A Graph RAG Approach to Query-Focused Summarization , author=. 2025 , eprint=

work page 2025
[15]

2025 , eprint=

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval , author=. 2025 , eprint=

work page 2025
[16]

2025 , isbn =

Jiang, Wenqi and Zhang, Shuai and Han, Boran and Wang, Jie and Wang, Bernie and Kraska, Tim , title =. 2025 , isbn =. doi:10.1145/3690624.3709194 , booktitle =

work page doi:10.1145/3690624.3709194 2025
[17]

and Chen, Chao-Ting and Cheng, Jui-Hung and Huang, Hen-Hsen , title =

Chan, Brian J. and Chen, Chao-Ting and Cheng, Jui-Hung and Huang, Hen-Hsen , title =. 2025 , isbn =. doi:10.1145/3701716.3715490 , booktitle =

work page doi:10.1145/3701716.3715490 2025
[18]

Speculative

Zilong Wang and Zifeng Wang and Long Le and Steven Zheng and Swaroop Mishra and Vincent Perot and Yuwei Zhang and Anush Mattapalli and Ankur Taly and Jingbo Shang and Chen-Yu Lee and Tomas Pfister , booktitle=. Speculative. 2025 , url=

work page 2025
[19]

2024 , isbn =

Sarmah, Bhaskarjit and Mehta, Dhagash and Hall, Benika and Rao, Rohan and Patel, Sunil and Pasquali, Stefano , title =. 2024 , isbn =. doi:10.1145/3677052.3698671 , booktitle =

work page doi:10.1145/3677052.3698671 2024
[20]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. 2023 , isbn =. doi:10.1145/3600006.3613165 , booktitle =

work page doi:10.1145/3600006.3613165 2023
[21]

Billion-Scale Similarity Search with GPUs , year=

Johnson, Jeff and Douze, Matthijs and Jégou, Hervé , journal=. Billion-Scale Similarity Search with GPUs , year=

work page
[22]

Kernel Language Entropy: Fine-grained Uncertainty Quantification for

Alexander V Nikitin and Jannik Kossen and Yarin Gal and Pekka Marttinen , booktitle=. Kernel Language Entropy: Fine-grained Uncertainty Quantification for. 2024 , url=

work page 2024
[23]

Not All Contexts Are Equal: Teaching LLM s Credibility-aware Generation

Pan, Ruotong and Cao, Boxi and Lin, Hongyu and Han, Xianpei and Zheng, Jia and Wang, Sirui and Cai, Xunliang and Sun, Le. Not All Contexts Are Equal: Teaching LLM s Credibility-aware Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1109

work page doi:10.18653/v1/2024.emnlp-main.1109 2024
[24]

DRAGIN : Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models

Su, Weihang and Tang, Yichen and Ai, Qingyao and Wu, Zhijing and Liu, Yiqun. DRAGIN : Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.702

work page doi:10.18653/v1/2024.acl-long.702 2024
[25]

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2024.acl-long.276 2024
[26]

2025 , eprint=

Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers , author=. 2025 , eprint=

work page 2025
[27]

2023 , eprint=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. 2023 , eprint=

work page 2023
[28]

2024 , eprint=

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs , author=. 2024 , eprint=

work page 2024
[29]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , volume =

Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , volume =

work page
[30]

Repocoder: Repository-level code completion through iterative retrieval and generation

Zhang, Fengji and Chen, Bei and Zhang, Yue and Keung, Jacky and Liu, Jin and Zan, Daoguang and Mao, Yi and Lou, Jian-Guang and Chen, Weizhu. R epo C oder: Repository-Level Code Completion Through Iterative Retrieval and Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.151

work page doi:10.18653/v1/2023.emnlp-main.151 2023
[31]

2024 , eprint=

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion , author=. 2024 , eprint=

work page 2024
[32]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Wu, Di and Ahmad, Wasi Uddin and Zhang, Dejiao and Ramanathan, Murali Krishna and Ma, Xiaofei , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[33]

Radev , editor =

Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Awadallah, Ahmed Hassan and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir. QMS um: A New Benchmark for Query-based Multi-domain Meeting Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Com...

work page doi:10.18653/v1/2021.naacl-main.472 2021
[34]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020
[35]

Adapting Pretrained Text-to-Text Models for Long Text Sequences

Xiong, Wenhan and Gupta, Anchit and Toshniwal, Shubham and Mehdad, Yashar and Yih, Scott. Adapting Pretrained Text-to-Text Models for Long Text Sequences. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.370

work page doi:10.18653/v1/2023.findings-emnlp.370 2023
[36]

doi: 10.18653/v1/2024.naacl-long.389

Jeong, Soyeong and Baek, Jinheon and Cho, Sukmin and Hwang, Sung Ju and Park, Jong. Adaptive- RAG : Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). ...

work page doi:10.18653/v1/2024.naacl-long.389 2024
[37]

A Primer in BERT ology: What We Know About How BERT Works

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349

work page doi:10.1162/tacl_a_00349 2020
[38]

Proceedings of the 57th Conference of the Association for Computational Linguistics,

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452

work page doi:10.18653/v1/p19-1452 2019
[39]

Williams

Williams, Ronald J. , title =. 1992 , issue_date =. doi:10.1007/BF00992696 , journal =

work page doi:10.1007/bf00992696 1992
[40]

SQ u AD : 100,000+ questions for machine comprehension of text

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016
[41]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. J. Mach. Learn. Res. , month = jan, articleno =. 2020 , issue_date =

work page 2020
[42]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[43]

An introduction to ROC analysis

Tom Fawcett , keywords =. An introduction to ROC analysis , journal =. 2006 , note =. doi:https://doi.org/10.1016/j.patrec.2005.10.010 , url =

work page doi:10.1016/j.patrec.2005.10.010 2006
[44]

, title =

Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q. , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

work page 2017
[45]

International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations , year=

work page
[46]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2023 , address=. doi:10.18653/v1/2023.acl-long.557 , url=

work page doi:10.18653/v1/2023.acl-long.557 2023
[47]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

work page 2022

[1] [1]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019

[2] [2]

Cohen and Ruslan Salakhutdinov and Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018

[3] [3]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.580

work page doi:10.18653/v1/2020.coling-main.580 2020

[4] [4]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017

[5] [5]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023

[6] [6]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[7] [7]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[8] [8]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

work page

[9] [9]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020

[10] [10]

Proceedings of the 39th International Conference on Machine Learning , pages =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022

[11] [11]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-. 2024 , url=

work page 2024

[12] [12]

doi: 10.18653/v1/2023.emnlp-main.495

Jiang, Zhengbao and Xu, Frank and Gao, Luyu and Sun, Zhiqing and Liu, Qian and Dwivedi-Yu, Jane and Yang, Yiming and Callan, Jamie and Neubig, Graham. Active Retrieval Augmented Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.495

work page doi:10.18653/v1/2023.emnlp-main.495 2023

[13] [13]

Transactions on Machine Learning Research , issn=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

work page 2022

[14] [14]

2025 , eprint=

From Local to Global: A Graph RAG Approach to Query-Focused Summarization , author=. 2025 , eprint=

work page 2025

[15] [15]

2025 , eprint=

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval , author=. 2025 , eprint=

work page 2025

[16] [16]

2025 , isbn =

Jiang, Wenqi and Zhang, Shuai and Han, Boran and Wang, Jie and Wang, Bernie and Kraska, Tim , title =. 2025 , isbn =. doi:10.1145/3690624.3709194 , booktitle =

work page doi:10.1145/3690624.3709194 2025

[17] [17]

and Chen, Chao-Ting and Cheng, Jui-Hung and Huang, Hen-Hsen , title =

Chan, Brian J. and Chen, Chao-Ting and Cheng, Jui-Hung and Huang, Hen-Hsen , title =. 2025 , isbn =. doi:10.1145/3701716.3715490 , booktitle =

work page doi:10.1145/3701716.3715490 2025

[18] [18]

Speculative

Zilong Wang and Zifeng Wang and Long Le and Steven Zheng and Swaroop Mishra and Vincent Perot and Yuwei Zhang and Anush Mattapalli and Ankur Taly and Jingbo Shang and Chen-Yu Lee and Tomas Pfister , booktitle=. Speculative. 2025 , url=

work page 2025

[19] [19]

2024 , isbn =

Sarmah, Bhaskarjit and Mehta, Dhagash and Hall, Benika and Rao, Rohan and Patel, Sunil and Pasquali, Stefano , title =. 2024 , isbn =. doi:10.1145/3677052.3698671 , booktitle =

work page doi:10.1145/3677052.3698671 2024

[20] [20]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. 2023 , isbn =. doi:10.1145/3600006.3613165 , booktitle =

work page doi:10.1145/3600006.3613165 2023

[21] [21]

Billion-Scale Similarity Search with GPUs , year=

Johnson, Jeff and Douze, Matthijs and Jégou, Hervé , journal=. Billion-Scale Similarity Search with GPUs , year=

work page

[22] [22]

Kernel Language Entropy: Fine-grained Uncertainty Quantification for

Alexander V Nikitin and Jannik Kossen and Yarin Gal and Pekka Marttinen , booktitle=. Kernel Language Entropy: Fine-grained Uncertainty Quantification for. 2024 , url=

work page 2024

[23] [23]

Not All Contexts Are Equal: Teaching LLM s Credibility-aware Generation

Pan, Ruotong and Cao, Boxi and Lin, Hongyu and Han, Xianpei and Zheng, Jia and Wang, Sirui and Cai, Xunliang and Sun, Le. Not All Contexts Are Equal: Teaching LLM s Credibility-aware Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1109

work page doi:10.18653/v1/2024.emnlp-main.1109 2024

[24] [24]

DRAGIN : Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models

Su, Weihang and Tang, Yichen and Ai, Qingyao and Wu, Zhijing and Liu, Yiqun. DRAGIN : Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.702

work page doi:10.18653/v1/2024.acl-long.702 2024

[25] [25]

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2024.acl-long.276 2024

[26] [26]

2025 , eprint=

Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers , author=. 2025 , eprint=

work page 2025

[27] [27]

2023 , eprint=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. 2023 , eprint=

work page 2023

[28] [28]

2024 , eprint=

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs , author=. 2024 , eprint=

work page 2024

[29] [29]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , volume =

Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , volume =

work page

[30] [30]

Repocoder: Repository-level code completion through iterative retrieval and generation

Zhang, Fengji and Chen, Bei and Zhang, Yue and Keung, Jacky and Liu, Jin and Zan, Daoguang and Mao, Yi and Lou, Jian-Guang and Chen, Weizhu. R epo C oder: Repository-Level Code Completion Through Iterative Retrieval and Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.151

work page doi:10.18653/v1/2023.emnlp-main.151 2023

[31] [31]

2024 , eprint=

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion , author=. 2024 , eprint=

work page 2024

[32] [32]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Wu, Di and Ahmad, Wasi Uddin and Zhang, Dejiao and Ramanathan, Murali Krishna and Ma, Xiaofei , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024

[33] [33]

Radev , editor =

Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Awadallah, Ahmed Hassan and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir. QMS um: A New Benchmark for Query-based Multi-domain Meeting Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Com...

work page doi:10.18653/v1/2021.naacl-main.472 2021

[34] [34]

2020 , eprint=

Longformer: The Long-Document Transformer , author=. 2020 , eprint=

work page 2020

[35] [35]

Adapting Pretrained Text-to-Text Models for Long Text Sequences

Xiong, Wenhan and Gupta, Anchit and Toshniwal, Shubham and Mehdad, Yashar and Yih, Scott. Adapting Pretrained Text-to-Text Models for Long Text Sequences. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.370

work page doi:10.18653/v1/2023.findings-emnlp.370 2023

[36] [36]

doi: 10.18653/v1/2024.naacl-long.389

Jeong, Soyeong and Baek, Jinheon and Cho, Sukmin and Hwang, Sung Ju and Park, Jong. Adaptive- RAG : Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). ...

work page doi:10.18653/v1/2024.naacl-long.389 2024

[37] [37]

A Primer in BERT ology: What We Know About How BERT Works

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna. A Primer in BERT ology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00349

work page doi:10.1162/tacl_a_00349 2020

[38] [38]

Proceedings of the 57th Conference of the Association for Computational Linguistics,

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452

work page doi:10.18653/v1/p19-1452 2019

[39] [39]

Williams

Williams, Ronald J. , title =. 1992 , issue_date =. doi:10.1007/BF00992696 , journal =

work page doi:10.1007/bf00992696 1992

[40] [40]

SQ u AD : 100,000+ questions for machine comprehension of text

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016

[41] [41]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. J. Mach. Learn. Res. , month = jan, articleno =. 2020 , issue_date =

work page 2020

[42] [42]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[43] [43]

An introduction to ROC analysis

Tom Fawcett , keywords =. An introduction to ROC analysis , journal =. 2006 , note =. doi:https://doi.org/10.1016/j.patrec.2005.10.010 , url =

work page doi:10.1016/j.patrec.2005.10.010 2006

[44] [44]

, title =

Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q. , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

work page 2017

[45] [45]

International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations , year=

work page

[46] [46]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2023 , address=. doi:10.18653/v1/2023.acl-long.557 , url=

work page doi:10.18653/v1/2023.acl-long.557 2023

[47] [47]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

work page 2022