arxiv: 2604.09174 · v1 · submitted 2026-04-10 · 💻 cs.CL

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

Passant Elchafei , Monorama Swain , Shahed Masoudian , Markus Schedl This is my paper

Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval-augmented generationhallucinationfacet-level analysisevidence groundingnatural language inferencequestion answeringLLM evaluation

0 comments

The pith

Hallucinations in RAG systems arise mainly from how retrieved evidence is integrated during generation rather than from retrieval failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a facet-level diagnostics framework that decomposes input questions into atomic reasoning facets and builds a Facet x Chunk matrix to score evidence sufficiency and grounding. For each facet the matrix combines retrieval relevance with natural language inference faithfulness scores, then compares three controlled modes: Strict RAG that forces exclusive use of retrieved evidence, Soft RAG that permits mixing with parametric knowledge, and LLM-only generation. This setup exposes retrieval-generation misalignment on medical QA and HotpotQA across GPT, Gemini, and LLaMA models. A reader would care because standard answer-level or passage-level checks hide systematic override patterns, implying that better retrievers alone will not eliminate many hallucinations.

Core claim

The authors claim that retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated, is a primary driver of hallucinations. Their facet-level matrix reveals recurring failure modes including evidence absence, evidence misalignment, and prior-driven overrides. These patterns persist across open- and closed-source models and remain invisible under conventional answer-level evaluation.

What carries the argument

The Facet x Chunk matrix that pairs each atomic reasoning facet with retrieved chunks, scoring both relevance and NLI-based faithfulness to trace whether evidence is used or overridden.

Load-bearing premise

That automatically decomposing questions into atomic facets and scoring them with NLI faithfulness reliably captures whether evidence is integrated or overridden during generation.

What would settle it

An experiment showing that strict evidence-only generation produces the same hallucination rate as mixed or no-retrieval modes on the same facets would falsify the claim that integration is the main driver.

Figures

Figures reproduced from arXiv: 2604.09174 by Markus Schedl, Monorama Swain, Passant Elchafei, Shahed Masoudian.

**Figure 2.** Figure 2: HotpotQA: Evidence Taxonomy Distribution [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Medical Dataset: Facet semantic type × evidence taxonomy distribution. Boolean and Temporal facets show highest failure rates. Comparative facets are most unstable with highest misalignment and lowest robust rates. Misalignment nearly vanishes (0.7% versus 7.5% medical), confirming Wikipedia’s broad coverage provides better retrieval recall. The consistent 7:1 override-to-failure ratio across datasets (42… view at source ↗

**Figure 4.** Figure 4: HotpotQA: Facet Semantic Type × Evidence Taxonomy Distribution. Boolean and Temporal facets show lowest failure rates. Override rates are consistently high across all types. Comparative facets remain most unstable in both datasets. quality: models unpredictably either incorporate or contradict retrieved evidence. Detailed distributions in Appendix D [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Illustrative facet-level diagnostic example [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Medical Dataset: Facet-level faithfulness dis [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Medical Dataset: Per-question ∆F1 distributions (Soft − Strict) by model [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

This paper sketches a facet-level matrix for diagnosing RAG evidence use across strict, soft, and no-retrieval modes, but the central claim that integration problems outweigh retrieval ones lacks the numbers or checks needed to hold up. They split questions into atomic facets, score each against retrieved chunks with relevance plus NLI faithfulness, and compare the three generation regimes to surface misalignment and override patterns that answer-level metrics miss. The setup runs on medical QA and HotpotQA with GPT, Gemini, and LLaMA, and the motivation for finer-grained tracing in applied systems is straightforward. The three-mode contrast is a practical way to separate retrieval failures from generation-time ones, and the matrix format makes the diagnostics readable. That part of the work is clean and internally consistent. The main weakness is that the description stays at the framework level. No quantitative results, error rates, or human validation of the facet splits and NLI labels appear in the account, so it is hard to know whether the reported patterns reflect real model behavior or artifacts from automatic decomposition on multi-hop or domain text. The stress-test concern about NLI mislabeling partial grounding or implicit inference therefore lands, because without those checks the attribution to integration over retrieval cannot be trusted. This is aimed at people who build or audit RAG pipelines and want diagnostic tools beyond standard accuracy scores. A reader working on reliability for QA tasks would pick up usable ideas for their own debugging, even if they have to add the validation themselves. It deserves a serious referee because the diagnostic structure is new enough to get useful comments on how to ground the claims. I would send it to review with a request for human studies on the matrix and actual performance numbers to support the conclusions.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a facet-level diagnostics framework for analyzing hallucinations in Retrieval-Augmented Generation (RAG). It decomposes input questions into atomic reasoning facets and constructs a Facet x Chunk matrix that combines retrieval relevance scores with NLI-based faithfulness assessments. Three controlled generation modes—Strict RAG (evidence-only), Soft RAG (evidence plus parametric knowledge), and LLM-only—are compared on medical QA and HotpotQA using GPT, Gemini, and LLaMA. The central claim is that hallucinations arise primarily from evidence integration failures during generation rather than from retrieval inaccuracies, with the facet-level matrix exposing systematic patterns of misalignment and prior-driven overrides that answer-level metrics obscure.

Significance. If the automatic decomposition and NLI scoring are shown to align with human judgments of evidence usage, the framework would offer a useful interpretable diagnostic tool for RAG systems, shifting attention from retrieval accuracy to generation-time integration. The controlled three-mode comparison on public benchmarks is a clear strength, enabling isolation of integration effects without fitted parameters. This could guide targeted improvements in RAG prompting or fine-tuning. The work is currently limited by the absence of quantitative backing for its claims.

major comments (3)

[Section 4] Section 4 (Facet-Level Framework): The automatic protocol for decomposing questions into atomic facets is described at a high level but supplies no examples, decision criteria, or human validation of the resulting facets. Without this, the Facet x Chunk matrix cannot reliably distinguish evidence absence from misalignment or prior-driven overrides, which is load-bearing for the three-mode contrast and the claim that integration dominates retrieval.
[Section 5] Section 5 (Experiments): No quantitative results, error rates, or statistical comparisons are reported for the frequency of each failure mode across Strict RAG, Soft RAG, and LLM-only. The conclusion that 'hallucinations ... are driven less by retrieval accuracy and more by ... integration' therefore rests on unvalidated NLI labels, risking misattribution if the NLI model errs on medical terminology or multi-hop chains in HotpotQA.
[Section 5.1] Section 5.1 (NLI Faithfulness Scoring): The manuscript does not analyze or ablate the NLI model's accuracy on partial entailment, negation, or implicit inference, nor does it compare against human token-level usage in the generated outputs. This directly affects whether the reported 'evidence override' patterns reflect true generation behavior or scoring artifacts.

minor comments (2)

[Section 3] The combination rule for retrieval relevance and NLI scores into the final matrix entry is stated narratively but would be clearer with an explicit equation or pseudocode.
[Section 2] Related-work discussion omits several recent papers on fine-grained hallucination detection and sub-sentence faithfulness in RAG.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important areas for improvement in our presentation of the facet-level framework. We will revise the manuscript to address these points by adding the requested details, examples, and quantitative analyses.

read point-by-point responses

Referee: Section 4 (Facet-Level Framework): The automatic protocol for decomposing questions into atomic facets is described at a high level but supplies no examples, decision criteria, or human validation of the resulting facets. Without this, the Facet x Chunk matrix cannot reliably distinguish evidence absence from misalignment or prior-driven overrides, which is load-bearing for the three-mode contrast and the claim that integration dominates retrieval.

Authors: We agree that the current description lacks sufficient detail for full reproducibility and validation. In the revised manuscript, we will provide concrete examples of question-to-facet decomposition, specify the decision criteria (such as ensuring each facet is an independent, verifiable unit of reasoning), and include a human validation study on a subset of facets from the medical QA and HotpotQA datasets to measure agreement with human judgments. This will strengthen the foundation for our analysis of the Facet x Chunk matrix. revision: yes
Referee: Section 5 (Experiments): No quantitative results, error rates, or statistical comparisons are reported for the frequency of each failure mode across Strict RAG, Soft RAG, and LLM-only. The conclusion that 'hallucinations ... are driven less by retrieval accuracy and more by ... integration' therefore rests on unvalidated NLI labels, risking misattribution if the NLI model errs on medical terminology or multi-hop chains in HotpotQA.

Authors: We acknowledge that the manuscript would benefit from explicit quantitative reporting. We will add in the revision comprehensive tables detailing the frequency of each failure mode (evidence absence, misalignment, and prior-driven overrides) for all models and both datasets under the three generation modes. Statistical comparisons, such as paired t-tests or chi-squared tests, will be included to support the differences observed. We will also discuss the potential for NLI errors on domain-specific terms and multi-hop reasoning, while emphasizing how the mode comparisons help isolate integration effects from retrieval issues. revision: yes
Referee: Section 5.1 (NLI Faithfulness Scoring): The manuscript does not analyze or ablate the NLI model's accuracy on partial entailment, negation, or implicit inference, nor does it compare against human token-level usage in the generated outputs. This directly affects whether the reported 'evidence override' patterns reflect true generation behavior or scoring artifacts.

Authors: We agree that validating the NLI component is crucial. The revised manuscript will incorporate an analysis of the NLI model's accuracy, including ablations or evaluations on examples with partial entailment, negation, and implicit inferences, benchmarked against human annotations. Additionally, we will perform a comparison of NLI-based faithfulness scores with human assessments of token-level evidence usage in a sample of generated outputs. These additions will help rule out scoring artifacts and bolster confidence in the identified patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces an explicit facet-decomposition procedure and NLI-based Facet x Chunk matrix, applies these to public benchmarks (HotpotQA, medical QA), and contrasts three predefined inference modes (Strict RAG, Soft RAG, LLM-only) whose definitions do not reference the target conclusions. No parameters are fitted to the reported hallucination or misalignment statistics, no load-bearing self-citations justify the core distinctions, and the observed patterns are presented as empirical outcomes rather than identities or renamings of the inputs. The framework therefore remains independent of its own results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard NLP assumptions about question decomposition and NLI faithfulness measurement; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

domain assumption Natural language inference models provide reliable faithfulness scores between generated text and retrieved evidence chunks.
Used to populate the grounding dimension of the Facet x Chunk matrix.
domain assumption Decomposing input questions into atomic reasoning facets preserves the structure needed to diagnose evidence usage.
Foundational step for constructing the diagnostic matrix.

pith-pipeline@v0.9.0 · 5559 in / 1345 out tokens · 31228 ms · 2026-05-10T17:56:29.868503+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets... constructs a structured Facet×Chunk evidence matrix that combines retrieval relevance with natural language inference–based faithfulness scores.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evidence Override emerges as the dominant failure mode at 28.4%... Evidence Failure (7.0%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

A survey on evaluation of large language mod- els.ACM Trans. Intell. Syst. Technol., 15(3). Hung-Ting Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. 2022. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence.arXiv preprint arXiv:2210.13701. Florin Cuconasu, Giovanni Trappolini, Feder...

work page arXiv 2022
[2]

Hanane Djeddal, Pierre Erbacher, Raouf Toukal, Laure Soulier, Karen Pinel-Sauvagnat, Sophia Katrenko, and Lynda Tamine

The power of noise: Redefining retrieval for rag systems.arXiv preprint arXiv:2401.14887. Hanane Djeddal, Pierre Erbacher, Raouf Toukal, Laure Soulier, Karen Pinel-Sauvagnat, Sophia Katrenko, and Lynda Tamine. 2024. An evaluation framework for attributed information retrieval using large lan- guage models. InProceedings of the 33rd ACM International Confe...

work page arXiv 2024
[3]

Ragbench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005,

Ragbench: Explainable benchmark for retrieval-augmented generation systems.CoRR, abs/2407.11005. Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan. 2021. Decomposing complex ques- tions makes multi-hop QA easier and more inter- pretable. InFindings of the Association for Compu- tational Linguistics: EMNLP 2021, pages 169–180, Punta Cana, Domin...

work page arXiv 2021
[4]

InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 129–136, Bilbao, Spain

Context or retrieval? evaluating RAG methods for art and museum QA system. InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 129–136, Bilbao, Spain. Association for Computational Linguistics. Keonwoo Roh, Yeong-Joon Ju, and Seong-Whan Lee

work page
[5]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

XLQA: A benchmark for locale-aware mul- tilingual open-domain question answering. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 28797– 28809, Suzhou, China. Association for Computa- tional Linguistics. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Richard James, Mike Lewis, Luke Zettle- moy...

work page internal anchor Pith review arXiv 2025
[6]

Evidence Overridden

Stitch it in time: Gnn-based prediction of out-of-distribution questions in stackoverflow.arXiv preprint arXiv:2306.16655. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Context versus prior knowledge in language models.arXiv preprint arXiv:2306.04757. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennigho...

work page arXiv 2023
[7]

Masked and Anonymous

Bridge Questions Prompt: Facet Decomposition Prompt Task:Convert this bridge question into reasoning steps (facets). Example 1: Question:What nationality is the director of the film Masked and Anonymous? Supporting Facts: [["Masked and Anonymous", 0], ["Larry Charles", 0]] Facets:

work page
[8]

Who directed the film Masked and Anony- mous?

work page
[9]

Blade Runner

What is Larry Charles’s nationality? Example 2: Question:What year was the director of Blade Run- ner born? Supporting Facts: [["Blade Runner", 1], ["Ridley Scott", 0]] Facets:

work page
[10]

Who directed Blade Runner?

work page
[11]

When was Ridley Scott born? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets:

work page
[12]

Arthur Conan Doyle

Comparison Questions Prompt: Facet Decomposition Prompt Task:Convert this comparison question into reason- ing steps (facets). Example 1: Question:Who was born first, Arthur Conan Doyle or Artur Schnitzler? Supporting Facts: [["Arthur Conan Doyle", 0], ["Artur Schnitzler", 0]] Facets:

work page
[13]

When was Arthur Conan Doyle born?

work page
[14]

Genus A", 0], [

When was Artur Schnitzler born? Example 2: Question:Which has more species, genus A or genus B? Supporting Facts: [["Genus A", 0], ["Genus B", 0]] Facets:

work page
[15]

How many species are in genus A?

work page
[16]

Give a short, direct answer in one or two sentences

How many species are in genus B? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets: B.3 Facet-Level Answer Generation Prompts We generate answers for each reasoning facet under three controlled inference modes: B.3.1 Strict RAG Prompt For facet-level generation with strict evidence grounding: Strict RAG Facet Generation Sys...

work page 2024