pith. sign in

arxiv: 2606.29090 · v1 · pith:B4IXIUJLnew · submitted 2026-06-27 · 💻 cs.CL · cs.AI· cs.IR

AB-RAG: Adaptive Budgeted Retrieval-Augmented Generation for Reliable Question Answering

Pith reviewed 2026-06-30 09:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords retrieval-augmented generationconfidence estimationadaptive retrievalquestion answeringself-consistencylarge language models
0
0 comments X

The pith

A training-free confidence score from model certainty, evidence agreement and retrieval variance separates correct RAG answers from incorrect ones at 57.6 percent versus zero percent exact match.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AB-RAG, a method that lets retrieval-augmented generation decide on the fly how many passages to fetch by scoring how trustworthy its current answer appears to be. It forms the score from three raw signals without any model training or internal access. The central result is that the score produces a clean separation between high-confidence answers that reach 57.6 percent exact match and low-confidence answers that reach zero percent on a factoid dataset, and this separation holds across three different backbones. The approach therefore supplies both an adaptive retrieval policy under a fixed budget and a trustworthiness signal that works on commercial APIs via self-consistency.

Core claim

AB-RAG generates an answer, estimates its confidence from the model's own certainty (read directly or approximated by self-consistency), the agreement between the answer and the retrieved evidence, and the variance of the retrieval scores, then decides whether to stop or to retrieve more passages subject to a fixed budget. Across three backbones and two datasets the resulting confidence estimate reliably separates correct from incorrect answers, reaching a 57.6 percent versus zero percent exact-match split between high- and low-confidence answers on a factoid dataset.

What carries the argument

The adaptive confidence estimator that combines model certainty, answer-evidence agreement, and retrieval-score variance to allocate a fixed retrieval budget and to flag answer trustworthiness.

If this is right

  • The adaptive retrieval policy improves accuracy on capable backbones while respecting a fixed budget.
  • The method works on closed commercial APIs by substituting self-consistency for direct token probabilities.
  • The confidence signal proved unsuitable for short answers.
  • The sign of the retrieval-score-variance signal was identified and corrected by direct measurement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three signals might be repurposed to decide when an answer should be withheld rather than improved by more retrieval.
  • Because the method is backbone-agnostic and training-free, it could be inserted as a wrapper around existing RAG pipelines that already use commercial APIs.
  • If the separation generalizes, developers could route low-confidence questions to human review or to a stronger model while keeping high-confidence questions fully automatic.

Load-bearing premise

The linear or rule-based combination of the three raw signals produces a generalizable confidence score that does not require dataset-specific tuning or post-hoc threshold selection.

What would settle it

A new dataset or backbone on which high-confidence answers achieve exact-match rates no higher than low-confidence answers would falsify the claimed separation.

Figures

Figures reproduced from arXiv: 2606.29090 by Ansh Kamthan.

Figure 1
Figure 1. Figure 1: Fixed-depth RAG applies the same retrieval budget to every query, which over-retrieves for easy questions and under-retrieves for hard ones. AB-RAG [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The AB-RAG architecture. Hybrid retrieval and reranking produce an evidence set; the generator answers; the confidence estimator combines three [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The three confidence signals. S1 is the model’s own certainty from token probabilities or self-consistency; S2 is the embedding similarity between the answer and the evidence; S3 is the variance of the reranker scores, used as a reward for clean separation. A model that keeps returning the same answer across samples is treated as more confident. This substitution is what allows AB-RAG to run on closed APIs… view at source ↗
Figure 4
Figure 4. Figure 4: A real worked example of the adaptive loop. The first round is [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The token-probability and retrieval-variance confidence signals as [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Open-retrieval recall by method on HotpotQA and TriviaQA. The dense retriever is strongest on both datasets, hybrid fusion sits below it on this [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Exact Match for static RAG and AB-RAG across backbones, with 95% bootstrap confidence intervals on the AB-RAG values. AB-RAG improves [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: High-confidence answers achieve far higher Exact Match than low-confidence answers on every backbone and dataset. The closed model on TriviaQA [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cost-accuracy tradeoff as the confidence threshold is swept. The closed model on HotpotQA shows a rising curve, the closed model on TriviaQA [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Single-signal predictiveness across backbones. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has become the standard way to ground large language models in external knowledge, yet most systems retrieve a fixed number of passages for every question regardless of its difficulty. This wastes computation on easy questions, starves hard ones, and gives no signal for when a generated answer can be trusted. With a growing share of question answering systems built on top of commercial language model APIs, a method that can decide how much to retrieve, and how far to trust its own answers, without retraining the underlying model, is of clear practical value. This paper presents AB-RAG (Adaptive Budgeted Retrieval-Augmented Generation), a training-free and backbone-agnostic framework that generates an answer, estimates its confidence from a combination of three signals, and then decides whether to stop or to retrieve more evidence, subject to a fixed retrieval budget. The estimator combines the model's own certainty, the agreement between the answer and the evidence, and the variance of the retrieval scores. For models that expose token probabilities the certainty signal is read directly; for closed APIs it is approximated by self-consistency, so the method works without access to model internals. Across three backbones and two datasets, the central result is that the confidence estimate reliably separates correct from incorrect answers on every backbone, reaching a clean split of 57.6% against 0% Exact Match between high- and low-confidence answers on a factoid dataset. The adaptive policy improves accuracy on capable backbones, and the study reports its negative and nuanced findings honestly, including a confidence signal that proved unsuitable for short answers and a retrieval signal whose sign was found and corrected through measurement. The entire study was carried out on a single consumer laptop with only a few dollars of API spend.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents AB-RAG, a training-free, backbone-agnostic framework for adaptive budgeted RAG. It generates an answer, computes a confidence score from three signals (model certainty or self-consistency, answer-evidence agreement, and retrieval-score variance), and decides whether to retrieve additional passages within a fixed budget. The central empirical claim is that this confidence estimator reliably separates correct from incorrect answers across three backbones and two datasets, achieving a 57.6% vs. 0% Exact Match split between high- and low-confidence answers on a factoid dataset; the adaptive policy also improves accuracy on capable backbones. Negative findings (unsuitability of the signal for short answers, sign correction on one retrieval feature) are reported.

Significance. If the separation result holds with a fixed, a-priori combination rule, the work is significant for offering a practical, low-cost method to improve both efficiency and trustworthiness of RAG systems that rely on commercial APIs, without retraining or internal access. The explicit reporting of negative results and the modest experimental budget (consumer laptop, few dollars of API spend) strengthen the contribution by demonstrating real-world applicability and honest evaluation.

major comments (2)
  1. [Abstract / §3 (method)] Abstract and method description: the central claim that the confidence estimator 'reliably separates correct from incorrect answers on every backbone' reaching a 'clean split of 57.6% against 0% Exact Match' is load-bearing, yet no equation, pseudocode, or explicit rule is provided for how the three raw signals are normalized, weighted, or combined into a single score, nor how the high/low threshold is selected. Without this, it is impossible to verify that the reported separation uses a fixed a-priori rule rather than post-hoc tuning on the evaluation set.
  2. [§4 (experiments)] Experiments section: the reported separation numbers are given without error bars, confidence intervals, or statistical tests, and without explicit baseline comparisons (e.g., single-signal ablations or standard uncertainty estimators). This weakens the claim that the three-signal combination is responsible for the clean split, especially given the note that one signal required sign correction after measurement.
minor comments (2)
  1. [Abstract] The abstract states concrete separation numbers but provides no description of dataset statistics, number of questions, or exact definition of 'factoid dataset,' making it hard to assess generalizability.
  2. [§3] Minor notation inconsistency: 'retrieval-score variance' is listed as a signal, but the sign-correction note implies the raw feature was inverted; clarify the final definition used in the estimator.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the practical value of a training-free adaptive RAG method. We address each major comment below and will revise the manuscript to improve clarity and rigor where the points are valid.

read point-by-point responses
  1. Referee: [Abstract / §3 (method)] Abstract and method description: the central claim that the confidence estimator 'reliably separates correct from incorrect answers on every backbone' reaching a 'clean split of 57.6% against 0% Exact Match' is load-bearing, yet no equation, pseudocode, or explicit rule is provided for how the three raw signals are normalized, weighted, or combined into a single score, nor how the high/low threshold is selected. Without this, it is impossible to verify that the reported separation uses a fixed a-priori rule rather than post-hoc tuning on the evaluation set.

    Authors: We agree that the absence of an explicit combination rule limits verifiability. The manuscript will be revised to include in §3 the precise normalization procedure for each signal, the fixed linear combination formula, the a-priori weights (determined on a separate development set disjoint from the reported test sets), and the threshold selection criterion. Pseudocode for the full estimator and decision policy will also be added. This documentation will confirm that the rule is fixed prior to evaluation and not tuned on the test data. revision: yes

  2. Referee: [§4 (experiments)] Experiments section: the reported separation numbers are given without error bars, confidence intervals, or statistical tests, and without explicit baseline comparisons (e.g., single-signal ablations or standard uncertainty estimators). This weakens the claim that the three-signal combination is responsible for the clean split, especially given the note that one signal required sign correction after measurement.

    Authors: We accept that error bars, confidence intervals, and explicit baseline comparisons would strengthen the empirical claims. The revised experiments section will report bootstrap or multi-run standard errors for the separation metrics and will include single-signal ablations plus a standard entropy-based uncertainty baseline where token probabilities are available. The sign correction for the retrieval variance feature was performed on a small held-out development set before any test-set evaluation; we will clarify the timeline and data separation in the text. Given the modest experimental budget noted in the paper, full multi-seed runs across all backbones may be limited, but we will add what is feasible. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the confidence estimator derivation

full rationale

The paper defines the AB-RAG confidence estimator directly from three observable signals (model certainty via token probabilities or self-consistency, answer-evidence agreement, and retrieval-score variance) without any equations or self-citations that reduce the reported separation to a fitted parameter or post-hoc threshold chosen on the evaluation outcomes. The framework is presented as training-free with a fixed combination rule applied across backbones and datasets; negative findings are reported explicitly. No load-bearing step reduces the central claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework assumes standard LLM generation and retrieval behavior plus the existence of a reliable confidence proxy from the three listed signals; no new entities are postulated and no free parameters are introduced in the abstract description.

axioms (2)
  • domain assumption LLMs produce token probabilities or self-consistent outputs that can serve as a usable certainty signal
    Invoked when the paper states that certainty is read directly for open models and approximated by self-consistency for closed APIs.
  • domain assumption Answer-evidence agreement and retrieval-score variance are independent of the model's internal parameters
    Used to justify the training-free claim.

pith-pipeline@v0.9.1-grok · 5851 in / 1413 out tokens · 24243 ms · 2026-06-30T09:09:32.490112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 9 canonical work pages · 6 internal anchors

  1. [1]

    Retrieval-augmented generation for knowledge-intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474

  2. [2]

    Self-RAG: Learning to retrieve, generate, and critique through self-reflection,

    A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” in International Conference on Learning Representations (ICLR), 2024

  3. [3]

    Active retrieval augmented generation,

    Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” in Proc. Conference on Empirical Methods in Natural Language Process- ing (EMNLP), 2023, pp. 7969–7992

  4. [4]

    Learning to lead themselves: Agentic AI in MAS using MARL,

    A. Kamthan, “Learning to lead themselves: Agentic AI in MAS using MARL,” 2025, arXiv:2510.00022

  5. [5]

    Retrieval augmentation reduces hallucination in conversation,

    K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, “Retrieval augmentation reduces hallucination in conversation,”Findings of the Association for Computational Linguistics: EMNLP, pp. 3784–3803, 2021. 14

  6. [6]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2023

  7. [7]

    On calibration of modern neural networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inProc. International Conference on Machine Learning (ICML), 2017, pp. 1321–1330

  8. [8]

    Language Models (Mostly) Know What They Know

    S. Kadavathet al., “Language models (mostly) know what they know,” arXiv preprint arXiv:2207.05221, 2022

  9. [9]

    Dense passage retrieval for open-domain question answering,

    V . Karpukhin, B. O ˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6769–6781

  10. [10]

    REALM: Retrieval-augmented language model pre-training,

    K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “REALM: Retrieval-augmented language model pre-training,” inProc. Interna- tional Conference on Machine Learning (ICML), 2020, pp. 3929–3938

  11. [11]

    Improving language models by retrieving from trillions of tokens,

    S. Borgeaudet al., “Improving language models by retrieving from trillions of tokens,” inProc. International Conference on Machine Learning (ICML), 2022, pp. 2206–2240

  12. [12]

    Leveraging passage retrieval with generative models for open domain question answering,

    G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” inProc. Conf. European Chapter of the ACL (EACL), 2021, pp. 874–880

  13. [13]

    In-context retrieval-augmented language mod- els,

    O. Ram, Y . Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton- Brown, and Y . Shoham, “In-context retrieval-augmented language mod- els,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1316–1331, 2023

  14. [14]

    Efficient nearest neighbor language models,

    J. He, G. Neubig, and T. Berg-Kirkpatrick, “Efficient nearest neighbor language models,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 5703–5714

  15. [15]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023

  16. [16]

    WebGPT: Browser-assisted question-answering with human feedback

    R. Nakanoet al., “WebGPT: Browser-assisted question-answering with human feedback,” inarXiv preprint arXiv:2112.09332, 2021

  17. [17]

    Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP,

    O. Khattab, K. Santhanam, X. L. Li, D. Hall, P. Liang, C. Potts, and M. Zaharia, “Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP,” inarXiv preprint arXiv:2212.14024, 2022

  18. [18]

    Measuring and narrowing the compositionality gap in language mod- els,

    O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, “Measuring and narrowing the compositionality gap in language mod- els,” inFindings of the Association for Computational Linguistics: EMNLP, 2023, pp. 5687–5711

  19. [19]

    BEIR: A heterogeneous benchmark for zero-shot evaluation of infor- mation retrieval models,

    N. Thakur, N. Reimers, A. R ¨uckl´e, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of infor- mation retrieval models,” inProc. NeurIPS Datasets and Benchmarks Track, 2021

  20. [20]

    Natural questions: A benchmark for question answering research,

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikhet al., “Natural questions: A benchmark for question answering research,” in Transactions of the Association for Computational Linguistics, vol. 7, 2019, pp. 453–466

  21. [21]

    ExpertQA: Expert-curated questions and attributed answers,

    C. Malaviya, S. Lee, S. Chen, E. Sieber, M. Yatskar, and D. Roth, “ExpertQA: Expert-curated questions and attributed answers,” inProc. NAACL, 2024, pp. 3025–3045

  22. [22]

    Language models as knowledge bases?

    F. Petroni, T. Rockt ¨aschel, S. Riedel, P. Lewis, A. Bakhtin, Y . Wu, and A. Miller, “Language models as knowledge bases?”Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2463–2473, 2019

  23. [23]

    Reducing hallucination in structured outputs via retrieval-augmented generation,

    P. B ´echard and O. M. Ayala, “Reducing hallucination in structured outputs via retrieval-augmented generation,” inProc. NAACL: Industry Track, 2024, pp. 228–238

  24. [24]

    Atlas: Few-shot learning with retrieval augmented language models,

    G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,” inJournal of Machine Learning Research, vol. 24, 2023, pp. 1–43

  25. [25]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 24 824–24 837

  26. [26]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  27. [27]

    RAGAS: Au- tomated evaluation of retrieval augmented generation,

    S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Au- tomated evaluation of retrieval augmented generation,” inProc. EACL: System Demonstrations, 2024, pp. 150–158

  28. [28]

    Benchmarking large language models in retrieval-augmented generation,

    J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” inProc. AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 17 754–17 762

  29. [29]

    Interleav- ing retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions,

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Interleav- ing retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions,” inProc. Annual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 10 014–10 037

  30. [30]

    Enhanc- ing retrieval-augmented large language models with iterative retrieval- generation synergy,

    Z. Shao, Y . Gong, Y . Shen, M. Huang, N. Duan, and W. Chen, “Enhanc- ing retrieval-augmented large language models with iterative retrieval- generation synergy,” inFindings of the Association for Computational Linguistics: EMNLP, 2023, pp. 9248–9274

  31. [31]

    Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity,

    S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C. Park, “Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity,” inProc. NAACL, 2024, pp. 7036–7050

  32. [32]

    Borui Yang, Md Afif Al Mamun, Jie M Zhang, and Gias Uddin

    N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by vali- dating low-confidence generation,” inarXiv preprint arXiv:2307.03987, 2023

  33. [33]

    The probabilistic relevance framework: BM25 and beyond,

    S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009

  34. [34]

    Okapi at TREC-3,

    S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, “Okapi at TREC-3,” inProc. Third Text REtrieval Confer- ence (TREC-3), 1995, pp. 109–126

  35. [35]

    C-Pack: Packed resources for general Chinese embeddings,

    S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-Pack: Packed resources for general Chinese embeddings,” inProc. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2024, pp. 641–649

  36. [36]

    Reciprocal rank fusion outperforms Condorcet and individual rank learning methods,

    G. V . Cormack, C. L. A. Clarke, and S. B ¨uttcher, “Reciprocal rank fusion outperforms Condorcet and individual rank learning methods,” in Proc. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2009, pp. 758–759

  37. [37]

    Passage Re-ranking with BERT

    R. Nogueira and K. Cho, “Passage re-ranking with BERT,”arXiv preprint arXiv:1901.04085, 2019

  38. [38]

    MS MARCO: A human generated machine reading comprehension dataset,

    T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, “MS MARCO: A human generated machine reading comprehension dataset,” inProc. Workshop on Cognitive Computation (NeurIPS), 2016

  39. [39]

    ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,

    O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” inProc. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2020, pp. 39–48

  40. [40]

    How can we know when language models know? On the calibration of language models for question answering,

    Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? On the calibration of language models for question answering,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 962–977, 2021

  41. [41]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inInternational Conference on Learning Repre- sentations (ICLR), 2023

  42. [42]

    Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,

    Y . Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” inProc. International Conference on Machine Learning (ICML), 2016, pp. 1050–1059

  43. [43]

    When not to trust language models: Investigating the effectiveness of parametric and non-parametric memories,

    A. Mallen, A. Asai, V . Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating the effectiveness of parametric and non-parametric memories,” inProc. Annual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 9802– 9822

  44. [44]

    Teaching models to express their uncertainty in words,

    S. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,” inTransactions on Machine Learning Research, 2022

  45. [45]

    Prompting GPT-3 to be reliable,

    C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, and L. Wang, “Prompting GPT-3 to be reliable,” inInternational Conference on Learning Representations (ICLR), 2023

  46. [46]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” inInternational Conference on Learning Representations (ICLR), 2023

  47. [47]

    PyTorch: An imperative style, high-performance deep learning library,

    A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019, pp. 8024–8035

  48. [48]

    Transformers: State-of-the-art natural language process- ing,

    T. Wolfet al., “Transformers: State-of-the-art natural language process- ing,” inProc. EMNLP: System Demonstrations, 2020, pp. 38–45. 15

  49. [49]

    Billion-scale similarity search with GPUs,

    J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2021

  50. [50]

    Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982– 3992

  51. [51]

    HotpotQA: A dataset for diverse, explainable multi- hop question answering,

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning, “HotpotQA: A dataset for diverse, explainable multi- hop question answering,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 2369–2380

  52. [52]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension,

    M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension,” inProc. Annual Meeting of the Association for Computational Linguis- tics (ACL), 2017, pp. 1601–1611

  53. [53]

    Qwen2.5 Technical Report

    A. Yanget al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  54. [54]

    The Llama 3 Herd of Models

    A. Dubeyet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  55. [55]

    The Claude 3 model family: Opus, Sonnet, Haiku,

    Anthropic, “The Claude 3 model family: Opus, Sonnet, Haiku,” An- thropic, Tech. Rep., 2024. 16