ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

Chenyu Wang; Yang Shu; Yingmin Liu

arxiv: 2605.17301 · v1 · pith:73FVVNOJnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

Chenyu Wang , Yingmin Liu , Yang Shu This is my paper

Pith reviewed 2026-05-20 14:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Retrieval-Augmented GenerationKnowledge ConflictsConflict DetectionSource CredibilityRAG PipelineLLM RefinementAnswer Generation

0 comments

The pith

ConflictRAG detects knowledge conflicts among retrieved documents and resolves them before answer generation to raise RAG correctness by 5 to 6 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation systems assume the documents they retrieve are mutually consistent, yet real queries often surface contradictory facts from different sources. ConflictRAG adds a detection stage that first uses a cheap embedding-based classifier and then calls the language model only on uncertain cases. It next ranks sources by credibility through an entropy-weighted decision procedure and produces the answer from the most reliable subset. On three benchmarks this pipeline raises answer accuracy by 5.3 to 6.1 percent over earlier conflict-aware methods while cutting large-model API calls by more than half.

Core claim

ConflictRAG is a framework that detects, classifies, and resolves knowledge conflicts in retrieved documents prior to answer generation; its two-stage detector combines an embedding MLP with selective LLM refinement, its credibility assessor uses Entropy-TOPSIS, and the overall system yields an 88.7 percent conflict-detection F1 together with consistent correctness gains that transfer across backbone models.

What carries the argument

Two-stage conflict detection module that pairs a lightweight embedding-based MLP classifier with selective LLM refinement, plus the Entropy-TOPSIS framework for data-driven source credibility assessment.

If this is right

Reduces API costs by 62 percent while preserving 90.8 percent detection accuracy.
Raises source-selection accuracy by 7.1 percent over manual heuristics.
Delivers 5.3 to 6.1 percent absolute gains in final-answer correctness over the strongest prior conflict-aware baseline.
Introduces the Conflict-Aware RAG Score (CARS) as a diagnostic metric for conflict-handling performance.
Transfers effectively when the underlying language model is swapped for another backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detection-plus-credibility pattern could be inserted into multi-document summarization or fact-checking pipelines that also face conflicting inputs.
Real-time web retrieval, where contradictions arrive continuously, would be a natural next setting to measure how well the cost-saving two-stage detector scales.
Enterprise RAG deployments with internal policy documents might test whether the Entropy-TOPSIS ranking needs domain-specific calibration.
Adding an explicit uncertainty signal from the detector into the final prompt could further reduce hallucinations beyond the gains already measured.

Load-bearing premise

The three evaluation benchmarks contain representative distributions of knowledge conflicts that match real-world RAG usage.

What would settle it

Apply the same pipeline to a fresh test set built from production logs that contain longer chains of mutually inconsistent facts and check whether the reported F1 and correctness gains both fall below the levels shown on the original benchmarks.

Figures

Figures reproduced from arXiv: 2605.17301 by Chenyu Wang, Yang Shu, Yingmin Liu.

**Figure 2.** Figure 2: Two-stage conflict detection. Stage 1 (MLP) handles 73% [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Correctness (%); error bars show ±1 std over 3 runs. (b) Radar on NQ-Conflict across six CARS dimensions; note that detection, resolution, and transparency are partly method-defined and structurally favor systems with explicit conflict modules (see Sect. III-E). TABLE II Ablation study on NQ-Conflict (n=500). Each row removes one module. Detection and resolution are the most critical. Variant Corr.% To… view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConflictRAG shows a practical two-stage detector plus Entropy-TOPSIS scorer that cuts RAG costs by 62% and adds small correctness gains, but the benchmarks may not match real retrieval conflicts.

read the letter

The main takeaway is that ConflictRAG adds a cheap two-stage conflict detector and a credibility scorer to standard RAG, reporting 88.7% F1 on detection, 5.3-6.1% better answers than the strongest baseline, and 62% lower API cost. The pipeline also transfers across different backbone LLMs. Those are the concrete results worth noting first. What stands out as new is the specific engineering combination: an embedding MLP for initial cheap filtering, LLM calls only on uncertain cases, and Entropy-TOPSIS for data-driven source selection instead of manual rules. The CARS metric they define for diagnostic evaluation is also a fresh addition for this sub-area. The work does well on the cost number and the cross-LLM transfer, both of which are easy to check in practice and matter for anyone running RAG at scale. The soft spots are in the evaluation. The three benchmarks may not contain the same mix of subtle, multi-source, or ambiguous conflicts that appear in production retrievals, so the F1 and correctness numbers could shrink outside these tests. The abstract also gives no error bars, dataset statistics, or ablation breakdowns, which leaves open how much each component actually drives the gains. This paper is for engineers and researchers who already work on reliable RAG systems and need ways to handle conflicting documents without high extra cost. A reader focused on applied improvements will pick up usable pipeline ideas. I would send it to peer review. The claims are specific enough to verify and the components are straightforward to reproduce, so referees can ask for the missing controls and data details.

Referee Report

2 major / 2 minor

Summary. The paper presents ConflictRAG, a framework for detecting, classifying, and resolving knowledge conflicts in RAG systems prior to generation. It contributes a two-stage detection module (lightweight embedding MLP plus selective LLM refinement) that cuts API costs by 62% at 90.8% accuracy, an Entropy-TOPSIS method for source credibility that improves selection by 7.1%, and a new Conflict-Aware RAG Score (CARS) metric. Experiments on three benchmarks against six baselines report 88.7% conflict-detection F1 and 5.3–6.1% correctness gains, with effective transfer across backbone LLMs.

Significance. If the empirical results hold under more rigorous verification, the work addresses a clear practical gap in RAG reliability by explicitly managing inconsistent retrieved documents. The reported cost reduction and cross-LLM transferability are concrete engineering advantages, while CARS offers a diagnostic tool that could support future conflict-aware RAG research.

major comments (2)

[Experiments / Results] The central empirical claims (88.7% F1 and 5.3–6.1% correctness gains) are presented without error bars, ablation tables, or basic dataset statistics on conflict subtlety and source diversity in the three benchmarks. This directly affects verifiability of the headline numbers and generalization.
[§4 (Evaluation)] The assumption that the three evaluation benchmarks contain representative real-world knowledge conflicts (subtlety, source diversity, ambiguity) is load-bearing for the generalization of both the detector F1 and the resolver gains, yet no quantitative characterization or comparison to production RAG distributions is provided.

minor comments (2)

[Abstract] The abstract states both 90.8% detection accuracy and 88.7% F1; clarify the precise relationship and which metric is primary for the two-stage module.
[§3.3] The exact formula or computation steps for the invented Conflict-Aware RAG Score (CARS) should be stated explicitly to support reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on improving the verifiability of our empirical claims and the characterization of the benchmarks. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Experiments / Results] The central empirical claims (88.7% F1 and 5.3–6.1% correctness gains) are presented without error bars, ablation tables, or basic dataset statistics on conflict subtlety and source diversity in the three benchmarks. This directly affects verifiability of the headline numbers and generalization.

Authors: We agree that error bars, ablation tables, and dataset statistics are necessary for stronger verifiability. In the revised manuscript we will report standard error bars over multiple random seeds for all headline metrics, add ablation tables isolating the two-stage detector and Entropy-TOPSIS components, and include basic statistics on each benchmark (conflict-type distribution, average number of conflicting sources per query, lexical/semantic overlap for subtlety, and source-domain entropy for diversity). These additions will appear in Section 4 and the appendix. revision: yes
Referee: [§4 (Evaluation)] The assumption that the three evaluation benchmarks contain representative real-world knowledge conflicts (subtlety, source diversity, ambiguity) is load-bearing for the generalization of both the detector F1 and the resolver gains, yet no quantitative characterization or comparison to production RAG distributions is provided.

Authors: We acknowledge the value of quantitative characterization. The revision will add explicit metrics for conflict subtlety (average embedding cosine distance and lexical overlap), source diversity (domain distribution and credibility variance), and ambiguity (entropy of retrieved passages) across the three benchmarks. A direct comparison to production RAG distributions cannot be performed, as it requires proprietary logs from deployed systems that are not publicly available; we will state this limitation explicitly while noting that the chosen benchmarks are the most conflict-intensive public resources currently used in the literature. revision: partial

standing simulated objections not resolved

Direct quantitative comparison of benchmark conflict characteristics to proprietary production RAG distributions, owing to the unavailability of such data.

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external benchmark evaluation

full rationale

The paper presents ConflictRAG as an empirical framework consisting of a two-stage detector, Entropy-TOPSIS resolver, and CARS metric, all evaluated on three external benchmarks against six baselines. No equations, predictions, or first-principles derivations are claimed; performance numbers (88.7% F1, 5.3-6.1% gains) are reported from direct experiments rather than reduced to fitted parameters or self-citations by construction. Any prior citations are non-load-bearing and do not substitute for the reported results. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the three chosen benchmarks are representative of real RAG conflicts and that the Entropy-TOPSIS weighting produces stable credibility rankings; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Retrieved documents in RAG can be meaningfully classified into conflict categories using embedding similarity plus selective LLM review.
Invoked in the description of the two-stage conflict detection module.

invented entities (1)

Conflict-Aware RAG Score (CARS) no independent evidence
purpose: Diagnostic evaluation of conflict-handling capabilities
New metric introduced for the framework; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5696 in / 1394 out tokens · 32738 ms · 2026-05-20T14:32:50.132951+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Entropy-TOPSIS framework for data-driven source credibility assessment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

Retrieval- augmented generation for knowledge-intensive NLP tasks,

P.Lewis,E.Perez,A.Piktus,F.Petroni,V.Karpukhin,N.Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

work page 2020
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Benchmarking large language models in retrieval-augmented generation,

J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” inAAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 17754– 17762

work page 2024
[4]

Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

R. Xu, Z. Qi, C. Wang, H. Wang, Y. Zhang, and W. Xu, “Knowledge conflicts for LLMs: A survey,”arXiv preprint arXiv:2403.08319, 2024

work page arXiv 2024
[5]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts,

J. Xie, K. Zhang, J. Chen, R. Zhu, and Y. Xiao, “Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts,” inInternational Con- ference on Learning Representations, 2024

work page 2024
[6]

Self- RAG: Learning to retrieve, generate, and critique through self- reflection,

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self- RAG: Learning to retrieve, generate, and critique through self- reflection,” inInternational Conference on Learning Represen- tations, 2024

work page 2024
[7]

Corrective Retrieval Augmented Generation

S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,”arXiv preprint arXiv:2401.15884, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Tug-of-warbetweenknowledge:Exploringandresolving knowledge conflicts in retrieval-augmented language models,

Z. Jin, P. Cao, Y. Chen, K. Liu, X. Jiang, J. Xu, Q. Li, and J.Zhao,“Tug-of-warbetweenknowledge:Exploringandresolving knowledge conflicts in retrieval-augmented language models,” arXiv preprint arXiv:2402.14409, 2024

work page arXiv 2024
[9]

DRAGged into conflicts: Detecting and addressing conflicting sources in search- augmented LLMs,

A. Cattan, A. Jacovi, O. Ram, J. Herzig, R. Aharoni, S. Gold- shtein, E. Ofek, I. Szpektor, and A. Caciularu, “DRAGged into conflicts: Detecting and addressing conflicting sources in search- augmented LLMs,”arXiv preprint arXiv:2506.08500, 2025

work page arXiv 2025
[10]

Entity-based knowledge conflicts in question answer- ing,

S. Longpre, K. Perisetla, A. Chen, N. Ramesh, C. DuBois, and S. Singh, “Entity-based knowledge conflicts in question answer- ing,” inEmpirical Methods in Natural Language Processing, 2021, pp. 7052–7063

work page 2021
[11]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,

A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,” in Annual Meeting of the Association for Computational Linguistics, 2023, pp. 9802–9822

work page 2023
[12]

Resolving knowledge conflicts in large language models,

Y. Wang, S. Feng, H. Wang, W. Shi, V. Balachandran, T. He, and Y. Tsvetkov, “Resolving knowledge conflicts in large language models,”arXiv preprint arXiv:2310.00935, 2023

work page arXiv 2023
[13]

TruthfulRAG: Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs,

S. Liu, Y. Shang, and X. Zhang, “TruthfulRAG: Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs,”arXiv preprint arXiv:2511.10375, 2025

work page arXiv 2025
[14]

Faithfulrag: Fact-level conflict modeling for context-faithful retrieval-augmented generation

Q. Zhang, Z. Xiang, Y. Xiao, L. Wang, J. Li, X. Wang, and J. Su, “FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval-augmented generation,”arXiv preprint arXiv:2506.08938, 2025

work page arXiv 2025
[15]

Resolv- ing conflicting evidence in automated fact-checking: A study on retrieval-augmented LLMs,

Z. Ge, Y. Wu, D. W. K. Chin, R. K.-W. Lee, and R. Cao, “Resolv- ing conflicting evidence in automated fact-checking: A study on retrieval-augmented LLMs,”arXiv preprint arXiv:2505.17762, 2025

work page arXiv 2025
[16]

Seeing through the conflict: Transparent knowledge conflict handling in retrieval-augmented generation,

H. Ye, S. Chen, Z. Zhong, C. Xiao, H. Zhang, Y. Wu, and F. Shen, “Seeing through the conflict: Transparent knowledge conflict handling in retrieval-augmented generation,”arXiv preprint arXiv:2601.06842, 2026

work page arXiv 2026
[17]

Yuxia Wang, Minghan Wang, Muhammad Arslan Man- zoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jy- oti Das, and Preslav Nakov

H. Wang, A. Prasad, E. Stengel-Eskin, and M. Bansal, “Retrieval- augmented generation with conflicting evidence,”arXiv preprint arXiv:2504.13079, 2025

work page arXiv 2025
[18]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

work page 2023
[19]

RAGAS: Automated evaluation of retrieval augmented generation,

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Automated evaluation of retrieval augmented generation,” inEu- ropean Chapter of the Association for Computational Linguistics, 2024

work page 2024
[20]

Supervised learning of universal sentence representations from natural language inference data,

A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” inEmpirical Methods in Natural Language Processing, 2017, pp. 670–680

work page 2017
[21]

Hwang and K

C.-L. Hwang and K. Yoon,Multiple Attribute Decision Making: Methods and Applications. Berlin: Springer-Verlag, 1981

work page 1981
[22]

Naturalquestions:Abenchmarkforquestionansweringresearch,

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, and K. Lee, “Naturalquestions:Abenchmarkforquestionansweringresearch,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019

work page 2019
[23]

AmbigQA: Answering ambiguous open-domain questions,

S. Min, J. Michael, H. Hajishirzi, and L. Zettlemoyer, “AmbigQA: Answering ambiguous open-domain questions,” inEmpirical Methods in Natural Language Processing, 2020, pp. 5783–5797

work page 2020
[24]

Judging LLM-as-a-judge with MT-Bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

work page 2023
[25]

GPT-4o: System card,

OpenAI, “GPT-4o: System card,”Technical Report, 2024

work page 2024
[26]

The probabilistic relevance framework: BM25 and beyond,

S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009

work page 2009
[27]

Unsupervised dense information retrieval with contrastive learning,

G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave, “Unsupervised dense information retrieval with contrastive learning,”Transactions on Machine Learning Research, 2022

work page 2022

[1] [1]

Retrieval- augmented generation for knowledge-intensive NLP tasks,

P.Lewis,E.Perez,A.Piktus,F.Petroni,V.Karpukhin,N.Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

work page 2020

[2] [2]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Benchmarking large language models in retrieval-augmented generation,

J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” inAAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 17754– 17762

work page 2024

[4] [4]

Knowledge conflicts for llms: A survey.arXiv:2403.08319, 2024

R. Xu, Z. Qi, C. Wang, H. Wang, Y. Zhang, and W. Xu, “Knowledge conflicts for LLMs: A survey,”arXiv preprint arXiv:2403.08319, 2024

work page arXiv 2024

[5] [5]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts,

J. Xie, K. Zhang, J. Chen, R. Zhu, and Y. Xiao, “Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts,” inInternational Con- ference on Learning Representations, 2024

work page 2024

[6] [6]

Self- RAG: Learning to retrieve, generate, and critique through self- reflection,

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self- RAG: Learning to retrieve, generate, and critique through self- reflection,” inInternational Conference on Learning Represen- tations, 2024

work page 2024

[7] [7]

Corrective Retrieval Augmented Generation

S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling, “Corrective retrieval augmented generation,”arXiv preprint arXiv:2401.15884, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Tug-of-warbetweenknowledge:Exploringandresolving knowledge conflicts in retrieval-augmented language models,

Z. Jin, P. Cao, Y. Chen, K. Liu, X. Jiang, J. Xu, Q. Li, and J.Zhao,“Tug-of-warbetweenknowledge:Exploringandresolving knowledge conflicts in retrieval-augmented language models,” arXiv preprint arXiv:2402.14409, 2024

work page arXiv 2024

[9] [9]

DRAGged into conflicts: Detecting and addressing conflicting sources in search- augmented LLMs,

A. Cattan, A. Jacovi, O. Ram, J. Herzig, R. Aharoni, S. Gold- shtein, E. Ofek, I. Szpektor, and A. Caciularu, “DRAGged into conflicts: Detecting and addressing conflicting sources in search- augmented LLMs,”arXiv preprint arXiv:2506.08500, 2025

work page arXiv 2025

[10] [10]

Entity-based knowledge conflicts in question answer- ing,

S. Longpre, K. Perisetla, A. Chen, N. Ramesh, C. DuBois, and S. Singh, “Entity-based knowledge conflicts in question answer- ing,” inEmpirical Methods in Natural Language Processing, 2021, pp. 7052–7063

work page 2021

[11] [11]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,

A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,” in Annual Meeting of the Association for Computational Linguistics, 2023, pp. 9802–9822

work page 2023

[12] [12]

Resolving knowledge conflicts in large language models,

Y. Wang, S. Feng, H. Wang, W. Shi, V. Balachandran, T. He, and Y. Tsvetkov, “Resolving knowledge conflicts in large language models,”arXiv preprint arXiv:2310.00935, 2023

work page arXiv 2023

[13] [13]

TruthfulRAG: Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs,

S. Liu, Y. Shang, and X. Zhang, “TruthfulRAG: Resolving factual-level conflicts in retrieval-augmented generation with knowledge graphs,”arXiv preprint arXiv:2511.10375, 2025

work page arXiv 2025

[14] [14]

Faithfulrag: Fact-level conflict modeling for context-faithful retrieval-augmented generation

Q. Zhang, Z. Xiang, Y. Xiao, L. Wang, J. Li, X. Wang, and J. Su, “FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval-augmented generation,”arXiv preprint arXiv:2506.08938, 2025

work page arXiv 2025

[15] [15]

Resolv- ing conflicting evidence in automated fact-checking: A study on retrieval-augmented LLMs,

Z. Ge, Y. Wu, D. W. K. Chin, R. K.-W. Lee, and R. Cao, “Resolv- ing conflicting evidence in automated fact-checking: A study on retrieval-augmented LLMs,”arXiv preprint arXiv:2505.17762, 2025

work page arXiv 2025

[16] [16]

Seeing through the conflict: Transparent knowledge conflict handling in retrieval-augmented generation,

H. Ye, S. Chen, Z. Zhong, C. Xiao, H. Zhang, Y. Wu, and F. Shen, “Seeing through the conflict: Transparent knowledge conflict handling in retrieval-augmented generation,”arXiv preprint arXiv:2601.06842, 2026

work page arXiv 2026

[17] [17]

Yuxia Wang, Minghan Wang, Muhammad Arslan Man- zoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jy- oti Das, and Preslav Nakov

H. Wang, A. Prasad, E. Stengel-Eskin, and M. Bansal, “Retrieval- augmented generation with conflicting evidence,”arXiv preprint arXiv:2504.13079, 2025

work page arXiv 2025

[18] [18]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

work page 2023

[19] [19]

RAGAS: Automated evaluation of retrieval augmented generation,

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Automated evaluation of retrieval augmented generation,” inEu- ropean Chapter of the Association for Computational Linguistics, 2024

work page 2024

[20] [20]

Supervised learning of universal sentence representations from natural language inference data,

A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” inEmpirical Methods in Natural Language Processing, 2017, pp. 670–680

work page 2017

[21] [21]

Hwang and K

C.-L. Hwang and K. Yoon,Multiple Attribute Decision Making: Methods and Applications. Berlin: Springer-Verlag, 1981

work page 1981

[22] [22]

Naturalquestions:Abenchmarkforquestionansweringresearch,

T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, and K. Lee, “Naturalquestions:Abenchmarkforquestionansweringresearch,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019

work page 2019

[23] [23]

AmbigQA: Answering ambiguous open-domain questions,

S. Min, J. Michael, H. Hajishirzi, and L. Zettlemoyer, “AmbigQA: Answering ambiguous open-domain questions,” inEmpirical Methods in Natural Language Processing, 2020, pp. 5783–5797

work page 2020

[24] [24]

Judging LLM-as-a-judge with MT-Bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

work page 2023

[25] [25]

GPT-4o: System card,

OpenAI, “GPT-4o: System card,”Technical Report, 2024

work page 2024

[26] [26]

The probabilistic relevance framework: BM25 and beyond,

S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009

work page 2009

[27] [27]

Unsupervised dense information retrieval with contrastive learning,

G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave, “Unsupervised dense information retrieval with contrastive learning,”Transactions on Machine Learning Research, 2022

work page 2022