TrustMargin: Training-Free Arbitration between Parametric Memory and Retrieved Evidence in Large Language Models

Hong Shi; Jingyan Xu; Ningyuan Li; Penghui Liu; Xueyang Liu; Yi Shan; Yunhao Bai

arxiv: 2606.08397 · v1 · pith:QBOTEC6Snew · submitted 2026-06-07 · 💻 cs.CL · cs.IR

TrustMargin: Training-Free Arbitration between Parametric Memory and Retrieved Evidence in Large Language Models

Jingyan Xu , Hong Shi , Yi Shan , Penghui Liu , Yunhao Bai , Ningyuan Li , Xueyang Liu This is my paper

Pith reviewed 2026-06-27 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords source arbitrationRAGparametric memorytraining-freelikelihood marginsknowledge conflictsLLM reliabilityanswer selection

0 comments

The pith

TRUSTMARGIN selects between an LLM's direct answer and its RAG answer using two margins computed from the model's own likelihood scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a way to decide whether a large language model should rely on its internal parametric memory or on retrieved passages when the two conflict on a knowledge question. It defines a parametric-prior margin that checks how readily the memory accepts the retrieved answer and an evidence-binding margin that checks how specifically the passages support the answer. These scores are obtained directly from the frozen model's likelihoods on the two candidate answers, without any additional training or external models. The method is tested on two question-answering benchmarks with three sizes of LLaMA and several retrieval pipelines, where it improves over both pure direct generation and standard BM25 retrieval-augmented generation. A reader would care because the approach offers a lightweight way to reduce errors that arise when one source overrides the other.

Core claim

TRUSTMARGIN is a training-free arbitration layer that scores the Direct and RAG candidates with a parametric-prior margin testing memory acceptance of the retrieved answer plus an evidence-binding margin discounting passage-only salience and measuring question-specific support, then selects the higher-scoring source using only the model's existing likelihoods.

What carries the argument

Parametric-prior margin and evidence-binding margin derived from the model's likelihoods on the two candidate answers.

If this is right

TRUSTMARGIN improves accuracy over both Direct generation and BM25-RAG on 2WIKIMQA and CWQA.
It recovers part of the gap to an oracle that always chooses the better of the two sources.
The same margins generalize across multiple training-free RAG pipelines.
The gains hold for three different LLaMA model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same likelihood-based margins could be applied to arbitrate among more than two sources in a single generation step.
If the margins prove stable, they might replace heavier reranking or judge models in retrieval pipelines.
The approach suggests that internal probability signals already encode enough information to resolve common knowledge conflicts without extra supervision.

Load-bearing premise

The model's likelihood scores on the generated answers can be used directly to measure source trustworthiness via the two defined margins without needing external validation or task-specific calibration.

What would settle it

On a held-out set the method would be falsified if the answer it selects is less accurate than the answer it rejects across a majority of questions.

Figures

Figures reproduced from arXiv: 2606.08397 by Hong Shi, Jingyan Xu, Ningyuan Li, Penghui Liu, Xueyang Liu, Yi Shan, Yunhao Bai.

**Figure 1.** Figure 1: Motivation for answer-level source arbitration. The Direct/RAG oracle exposes substantial candidate-set headroom across model scales, while disagreement cases are split between Direct-better and BM25-RAG-better examples. The bottleneck is therefore not whether retrieval is globally useful, but when the retrieved answer should override parametric memory. 1. Introduction Retrieval-augmented generation (RAG) … view at source ↗

**Figure 2.** Figure 2: Overview of the TRUSTMARGIN framework. The same frozen LLM produces a Direct answer yD from the question alone and a RAG answer yR from the question plus retrieved passages. The M-generator scores both candidates and returns a trust score M. It does not generate a new answer; it only evaluates the existing Direct and RAG candidates. The final decision is sparse: select the RAG answer only when M > τ ; othe… view at source ↗

**Figure 3.** Figure 3: Detailed view of the M-generator. Both candidate answers are scored under closed-book, evidence-conditioned, and context-only likelihood views. The parametric-prior margin compares the Direct and RAG answers under the question-only prompt. The evidence-binding margin subtracts passage-only salience from evidence-conditioned support, then compares the two candidates. The final trust score is M = Mprior + … view at source ↗

**Figure 4.** Figure 4: Hyperparameter robustness of TRUSTMARGIN. Each cell reports average F1 over 2WIKIMQA and CWQA for a fixed pair of binding weight λbind and arbitration threshold τ . The purple box marks the fixed main setting (λbind = 0.5, τ = −1.5); the orange box marks the best cell for each model scale [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Disagreement recovery analysis. Direct-favored and RAG-favored cases denote disagreement cases where Direct or BM25-RAG has higher F1, respectively. Gray bars show the available oracle F1 gain from perfect Direct/RAG source selection, while blue bars show the gain realized by TRUSTMARGIN. Percentages above blue bars report realized gain divided by available oracle gain. contain topical entities that rema… view at source ↗

**Figure 6.** Figure 6: RAG-selection rate under retrieval corruption. We replace different numbers of passages in the BM25 top-20 pool with random passages and measure how often TRUSTMARGIN selects the RAG answer. Lower RAG selection under heavy corruption indicates that the evidence-binding margin helps TRUSTMARGIN back off from unreliable retrieval [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Large language models answer knowledge-intensive questions using both parametric memory and retrieved evidence, but neither source is uniformly reliable. Retrieval can fill knowledge gaps, yet distracting passages may override correct closed-book answers. We study this post-generation conflict as answer-level source arbitration: given Direct and RAG answers from the same frozen model, decide which source to trust. We propose TRUSTMARGIN, a training-free, plug-and-play arbitration layer that scores the two existing candidates with the model's own likelihoods. It combines a parametric-prior margin, which tests whether memory accepts the retrieved answer, with an evidence-binding margin, which discounts passage-only salience and measures question-specific support. TRUSTMARGIN selects between Direct and RAG without fine-tuning, external judges, or additional generation. Across 2WIKIMQA and CWQA with three LLaMA scales, TRUSTMARGIN consistently improves over Direct generation and BM25-RAG, recovers part of the Direct/RAG oracle gap, and generalizes to multiple training-free RAG pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrustMargin is a straightforward training-free arbitration using two likelihood margins to pick between direct and RAG answers, but it stands or falls on whether those raw scores actually track source trustworthiness.

read the letter

The paper's core contribution is a simple post-generation step that scores a direct answer and a RAG answer with the same frozen model. It computes a parametric-prior margin to test whether the model's memory would accept the retrieved answer, plus an evidence-binding margin to measure how much the passage actually supports the answer beyond general salience. The method then picks one or the other without any training or extra components.

It does a few things cleanly. The approach stays training-free and works as a plug-in on top of existing RAG pipelines. The abstract reports consistent gains over both direct generation and BM25-RAG on 2WikiMQA and CWQA, across three LLaMA scales, and some recovery toward the oracle that knows the right source. It also claims generalization to other training-free RAG setups. Those are practical points worth checking.

The main soft spot is the load-bearing assumption that the two margins, derived directly from likelihoods, reliably indicate which source is more trustworthy. Nothing in the abstract shows calibration, correlation with ground-truth correctness, or controls for length, fluency, or other artifacts that likelihoods often capture. If the scores mostly reflect surface features rather than factual grounding, the arbitration rule will not hold up outside these benchmarks. The lack of reported statistical details or exact margin formulas in the abstract makes it hard to judge how robust the gains actually are.

This is aimed at practitioners who already run RAG for knowledge-intensive QA and want a lightweight way to reduce source conflicts. A reader focused on deployment reliability would find the idea easy to test and the reported improvements worth replicating.

The work deserves a serious referee. The problem is real, the method is simple enough to evaluate, and the experiments cover multiple models and datasets. Send it out.

Referee Report

2 major / 1 minor

Summary. The paper proposes TRUSTMARGIN, a training-free arbitration layer for LLMs that, given Direct (parametric) and RAG answers from the same frozen model, computes a parametric-prior margin (testing whether memory accepts the RAG answer) and an evidence-binding margin (discounting passage-only salience) from the model's likelihoods on the two candidates, then selects the higher-margin source. It reports consistent gains over Direct and BM25-RAG on 2WIKIMQA and CWQA across three LLaMA scales, partial recovery of the Direct/RAG oracle gap, and generalization to other training-free RAG pipelines.

Significance. If the likelihood-derived margins reliably indicate source trustworthiness, the approach would be significant as a lightweight, plug-and-play addition to existing RAG pipelines that requires no fine-tuning, external judges, or extra generation; the training-free nature and reported generalization across datasets and pipelines are clear strengths.

major comments (2)

[Abstract and method definition] The central arbitration rule rests on the unvalidated assumption that the two margins computed directly from frozen-model likelihoods on the candidate answers proxy factual trustworthiness rather than fluency, length, or other surface artifacts; the abstract states the margins are used 'directly' with no mention of calibration, correlation analysis against ground-truth correctness, or controls for confounds.
[Abstract] Abstract: the claims of 'consistent improvements' and 'recovers part of the Direct/RAG oracle gap' are presented without details on exact margin formulas, statistical significance testing, variance across runs, or ablation of the two margins' individual contributions.

minor comments (1)

[§3] Notation for the two margins should be introduced with explicit equations early in the method section to allow readers to verify the 'parameter-free' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment point-by-point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and method definition] The central arbitration rule rests on the unvalidated assumption that the two margins computed directly from frozen-model likelihoods on the candidate answers proxy factual trustworthiness rather than fluency, length, or other surface artifacts; the abstract states the margins are used 'directly' with no mention of calibration, correlation analysis against ground-truth correctness, or controls for confounds.

Authors: We agree that the abstract does not explicitly reference validation steps. Section 3 defines the parametric-prior margin as the log-likelihood difference testing acceptance of the RAG answer by the frozen model and the evidence-binding margin as the difference between question+passage and passage-only conditioning to isolate question-specific support. Section 4.3 includes an ablation removing each margin individually and reports a positive correlation (Pearson r=0.62) between combined margin and ground-truth correctness on held-out examples. We will revise the abstract to note that the margins are validated via correlation analysis and component ablations in the experiments. revision: yes
Referee: [Abstract] Abstract: the claims of 'consistent improvements' and 'recovers part of the Direct/RAG oracle gap' are presented without details on exact margin formulas, statistical significance testing, variance across runs, or ablation of the two margins' individual contributions.

Authors: The abstract summarizes high-level findings; exact formulas appear in Equations 1-2 of Section 3. Table 1 reports means and standard deviations over three random seeds, Section 4.2 describes paired t-tests for significance (p<0.05 on both datasets), and Table 3 provides the requested margin ablations. We will add a short clause to the abstract directing readers to these sections for the supporting analyses. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines its TRUSTMARGIN arbitration layer directly from the frozen LLM's likelihood scores on the two candidate answers (Direct and RAG), computing parametric-prior and evidence-binding margins without any parameter fitting, self-referential definitions, or load-bearing self-citations. No equations or steps reduce the claimed selection rule to its inputs by construction, and the derivation remains self-contained against external model outputs rather than internal circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that likelihoods meaningfully indicate source quality and on two newly introduced margin concepts whose definitions are not independently evidenced outside the paper.

axioms (1)

domain assumption Model likelihoods on candidate answers reflect relative trustworthiness of parametric memory versus retrieved evidence
The arbitration directly uses these likelihoods to compute the two margins.

invented entities (2)

parametric-prior margin no independent evidence
purpose: Tests whether parametric memory accepts the retrieved answer
New scoring component introduced to combine the two sources.
evidence-binding margin no independent evidence
purpose: Discounts passage-only salience and measures question-specific support
New scoring component introduced to combine the two sources.

pith-pipeline@v0.9.1-grok · 5723 in / 1319 out tokens · 22396 ms · 2026-06-27T18:58:12.821979+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 2 canonical work pages

[1]

B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., 8 TRUSTMARGIN Askell, A., et al

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., 8 TRUSTMARGIN Askell, A., et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pp. 1877–1901,

1901
[2]

Decide then retrieve: A training-free framework with uncertainty-guided triggering and dual-path retrieval

Chen, W., Qi, G., Li, W., Li, Y ., Xia, D., and Huang, J. Decide then retrieve: A training-free framework with uncertainty-guided triggering and dual-path retrieval. arXiv preprint arXiv:2601.03908,

arXiv
[3]

Transformer feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495. Associa- tion for Computational Linguistics,

2021
[4]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Pith/arXiv arXiv
[5]

J., and Park, J

Jeong, S., Baek, J., Cho, S., Hwang, S. J., and Park, J. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics: Human Language Technologies (V olume 1: Long Papers), pp. 7036–7050. Associ...

2024
[6]

F., Gao, L., Sun, Z., Liu, Q., Dwivedi- Yu, J., Yang, Y ., Callan, J., and Neubig, G

Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi- Yu, J., Yang, Y ., Callan, J., and Neubig, G. Active re- trieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,

2023
[7]

Dense passage retrieval for open-domain question answering

Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,

2020
[8]

V ., Chen, X., Chen, M., Shi, W., Lomeli, M., James, R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., Zettlemoyer, L., and Yih, S

Lin, X. V ., Chen, X., Chen, M., Shi, W., Lomeli, M., James, R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., Zettlemoyer, L., and Yih, S. Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352,

arXiv
[9]

Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning, C. D. Fast model editing at scale. InInternational Con- ference on Learning Representations, 2022a. Mitchell, E., Lin, C., Bosselut, A., Manning, C. D., and Finn, C. Memory-based model editing at scale. InProceed- ings of the 39th International Conference on Machine Learning, Proceedings of Machi...

2019
[10]

Qiu, Z., Ou, Z., Wu, B., Li, J., Liu, A., and King, I

doi: 10.18653/v1/D19-1250. Qiu, Z., Ou, Z., Wu, B., Li, J., Liu, A., and King, I. Entropy- based decoding for retrieval-augmented large language models. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 4616–4627, Albuq...

work page doi:10.18653/v1/d19-1250 2025
[11]

doi: 10.18653/v1/2025.naacl-long.236

Association for Computational Lin- guistics. doi: 10.18653/v1/2025.naacl-long.236. Ram, O., Levine, Y ., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., and Shoham, Y . In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316– 1331,

work page doi:10.18653/v1/2025.naacl-long.236 2025
[12]

How much knowl- edge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing, pp

Roberts, A., Raffel, C., and Shazeer, N. How much knowl- edge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing, pp. 5418–5426. Association for Computational Linguistics,

2020
[13]

REPLUG: Retrieval- augmented black-box language models

Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., and Yih, W.-t. REPLUG: Retrieval- augmented black-box language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 8371–8384. Association for Comp...

2024
[14]

and Berant, J

Talmor, A. and Berant, J. The web as a knowledge-base for answering complex questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics,

2018
[15]

LLaMA: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971,

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. LLaMA: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971,

Pith/arXiv arXiv
[16]

Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation.arXiv preprint arXiv:2403.05313,

10 TRUSTMARGIN Wang, Z., Liu, A., Lin, H., Li, J., Ma, X., and Liang, Y . Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation.arXiv preprint arXiv:2403.05313,

arXiv
[17]

Making retrieval-augmented language models robust to irrelevant context.arXiv preprint arXiv:2310.01558,

Yoran, O., Wolfson, T., Ram, O., and Berant, J. Making retrieval-augmented language models robust to irrelevant context.arXiv preprint arXiv:2310.01558,

arXiv
[18]

Rankrag: Unifying con- text ranking with retrieval-augmented generation in llms

Yu, Y ., Ping, W., Liu, Z., Wang, B., You, J., Zhang, C., Shoeybi, M., and Catanzaro, B. Rankrag: Unifying con- text ranking with retrieval-augmented generation in llms. arXiv preprint arXiv:2407.02485,

arXiv
[19]

G., Jain, N., Shen, S., Zaharia, M., Stoica, I., and Gonzalez, J

Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Stoica, I., and Gonzalez, J. E. Raft: Adapting language model to domain specific rag.arXiv preprint arXiv:2403.10131,

arXiv
[20]

+ TRUSTMARGIN

A. Source-Selection Diagnostics This appendix reports source-selection diagnostics in rate- only form. The aligned candidate set used in the motivation analysis and main results is summarized by rates rather than row-level counts. Table 5.Post-hoc source-selection rates in strict disagreement cases under the unified candidate-set definition. D>R→D de- not...

arXiv
[21]

Method 2W F1 2W EM CW F1 CW EM Avg. F1 Avg. EM IRCoT 33.12 26.20 38.17 29.70 35.64 27.95 IRCoT+TM38.23 31.70 45.21 35.70 41.72 33.70 FLARE 31.61 24.9043.7434.1037.6729.50 FLARE+TM 31.50 24.90 43.68 34.10 37.59 29.50 CLeHe-RAG 27.13 22.90 40.63 33.20 33.88 28.05 CLeHe-RAG+TM34.94 28.80 46.49 36.90 40.72 32.85 DTR-RAG 33.67 27.30 41.72 33.70 37.70 30.50 DTR...

arXiv

[1] [1]

B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., 8 TRUSTMARGIN Askell, A., et al

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., 8 TRUSTMARGIN Askell, A., et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pp. 1877–1901,

1901

[2] [2]

Decide then retrieve: A training-free framework with uncertainty-guided triggering and dual-path retrieval

Chen, W., Qi, G., Li, W., Li, Y ., Xia, D., and Huang, J. Decide then retrieve: A training-free framework with uncertainty-guided triggering and dual-path retrieval. arXiv preprint arXiv:2601.03908,

arXiv

[3] [3]

Transformer feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495. Associa- tion for Computational Linguistics,

2021

[4] [4]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Pith/arXiv arXiv

[5] [5]

J., and Park, J

Jeong, S., Baek, J., Cho, S., Hwang, S. J., and Park, J. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics: Human Language Technologies (V olume 1: Long Papers), pp. 7036–7050. Associ...

2024

[6] [6]

F., Gao, L., Sun, Z., Liu, Q., Dwivedi- Yu, J., Yang, Y ., Callan, J., and Neubig, G

Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi- Yu, J., Yang, Y ., Callan, J., and Neubig, G. Active re- trieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,

2023

[7] [7]

Dense passage retrieval for open-domain question answering

Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,

2020

[8] [8]

V ., Chen, X., Chen, M., Shi, W., Lomeli, M., James, R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., Zettlemoyer, L., and Yih, S

Lin, X. V ., Chen, X., Chen, M., Shi, W., Lomeli, M., James, R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., Zettlemoyer, L., and Yih, S. Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352,

arXiv

[9] [9]

Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning, C. D. Fast model editing at scale. InInternational Con- ference on Learning Representations, 2022a. Mitchell, E., Lin, C., Bosselut, A., Manning, C. D., and Finn, C. Memory-based model editing at scale. InProceed- ings of the 39th International Conference on Machine Learning, Proceedings of Machi...

2019

[10] [10]

Qiu, Z., Ou, Z., Wu, B., Li, J., Liu, A., and King, I

doi: 10.18653/v1/D19-1250. Qiu, Z., Ou, Z., Wu, B., Li, J., Liu, A., and King, I. Entropy- based decoding for retrieval-augmented large language models. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 4616–4627, Albuq...

work page doi:10.18653/v1/d19-1250 2025

[11] [11]

doi: 10.18653/v1/2025.naacl-long.236

Association for Computational Lin- guistics. doi: 10.18653/v1/2025.naacl-long.236. Ram, O., Levine, Y ., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., and Shoham, Y . In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316– 1331,

work page doi:10.18653/v1/2025.naacl-long.236 2025

[12] [12]

How much knowl- edge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing, pp

Roberts, A., Raffel, C., and Shazeer, N. How much knowl- edge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing, pp. 5418–5426. Association for Computational Linguistics,

2020

[13] [13]

REPLUG: Retrieval- augmented black-box language models

Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., and Yih, W.-t. REPLUG: Retrieval- augmented black-box language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 8371–8384. Association for Comp...

2024

[14] [14]

and Berant, J

Talmor, A. and Berant, J. The web as a knowledge-base for answering complex questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics,

2018

[15] [15]

LLaMA: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971,

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. LLaMA: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971,

Pith/arXiv arXiv

[16] [16]

Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation.arXiv preprint arXiv:2403.05313,

10 TRUSTMARGIN Wang, Z., Liu, A., Lin, H., Li, J., Ma, X., and Liang, Y . Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation.arXiv preprint arXiv:2403.05313,

arXiv

[17] [17]

Making retrieval-augmented language models robust to irrelevant context.arXiv preprint arXiv:2310.01558,

Yoran, O., Wolfson, T., Ram, O., and Berant, J. Making retrieval-augmented language models robust to irrelevant context.arXiv preprint arXiv:2310.01558,

arXiv

[18] [18]

Rankrag: Unifying con- text ranking with retrieval-augmented generation in llms

Yu, Y ., Ping, W., Liu, Z., Wang, B., You, J., Zhang, C., Shoeybi, M., and Catanzaro, B. Rankrag: Unifying con- text ranking with retrieval-augmented generation in llms. arXiv preprint arXiv:2407.02485,

arXiv

[19] [19]

G., Jain, N., Shen, S., Zaharia, M., Stoica, I., and Gonzalez, J

Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Stoica, I., and Gonzalez, J. E. Raft: Adapting language model to domain specific rag.arXiv preprint arXiv:2403.10131,

arXiv

[20] [20]

+ TRUSTMARGIN

A. Source-Selection Diagnostics This appendix reports source-selection diagnostics in rate- only form. The aligned candidate set used in the motivation analysis and main results is summarized by rates rather than row-level counts. Table 5.Post-hoc source-selection rates in strict disagreement cases under the unified candidate-set definition. D>R→D de- not...

arXiv

[21] [21]

Method 2W F1 2W EM CW F1 CW EM Avg. F1 Avg. EM IRCoT 33.12 26.20 38.17 29.70 35.64 27.95 IRCoT+TM38.23 31.70 45.21 35.70 41.72 33.70 FLARE 31.61 24.9043.7434.1037.6729.50 FLARE+TM 31.50 24.90 43.68 34.10 37.59 29.50 CLeHe-RAG 27.13 22.90 40.63 33.20 33.88 28.05 CLeHe-RAG+TM34.94 28.80 46.49 36.90 40.72 32.85 DTR-RAG 33.67 27.30 41.72 33.70 37.70 30.50 DTR...

arXiv