Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG
Pith reviewed 2026-05-21 08:52 UTC · model grok-4.3
The pith
RAG hallucinations stem mainly from how evidence is integrated during generation rather than retrieval failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.
What carries the argument
The facet-level diagnostics framework that uses a Facet x Chunk matrix combining retrieval relevance with NLI-based faithfulness scores across Strict RAG, Soft RAG, and LLM-only modes to trace evidence usage.
If this is right
- Relevant evidence is often retrieved but not correctly integrated, producing hallucinations despite good retrieval.
- Answer-level accuracy metrics overlook systematic facet-level evidence misalignment and overrides.
- Controlled comparisons of strict-evidence, mixed, and no-retrieval modes isolate where generation diverges from available evidence.
- Recurring failure modes including evidence absence, misalignment, and prior-driven overrides appear across open- and closed-source LLMs.
Where Pith is reading between the lines
- RAG improvements may need to target evidence-integration steps inside the generator rather than retrieval alone.
- The matrix approach could be adapted to trace uncertainty in other generation tasks beyond question answering.
- Practitioners could apply facet diagnostics to audit and refine models for lower hallucination rates in medical or factual domains.
Load-bearing premise
Decomposing questions into atomic reasoning facets and measuring grounding via NLI-based faithfulness scores on a Facet x Chunk matrix accurately captures how evidence is actually used or ignored during generation.
What would settle it
A test showing that models still hallucinate at high rates even when the Facet x Chunk matrix records high faithfulness and sufficiency for every facet would indicate that integration is not the main driver.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a facet-level diagnostics framework for QA in RAG systems. It decomposes input questions into atomic reasoning facets and uses a structured Facet x Chunk matrix combining retrieval relevance with NLI-based faithfulness scores to assess evidence sufficiency and grounding. By analyzing three controlled inference modes—Strict RAG, Soft RAG, and LLM-only generation—across medical QA and HotpotQA datasets with models including GPT, Gemini, and LLaMA, the paper claims that hallucinations are primarily driven by failures in integrating retrieved evidence during generation rather than by retrieval accuracy alone, with facet-level analysis revealing systematic patterns of evidence override and misalignment hidden in answer-level evaluations.
Significance. If the framework's measurements hold, this work provides a more granular and interpretable way to diagnose RAG failures, shifting focus from retrieval to generation integration. The use of controlled comparisons across open- and closed-source models and two distinct datasets (medical QA and HotpotQA) is a strength, as is the identification of recurring failure modes such as evidence absence, misalignment, and prior-driven overrides. This could lead to better RAG designs if the facet decomposition and NLI proxies are validated.
major comments (3)
- [Facet x Chunk matrix construction] The diagnosis of retrieval-generation misalignment depends on the Facet x Chunk matrix correctly capturing evidence usage via NLI faithfulness scores. However, NLI entailment is a post-hoc proxy that may not reflect actual integration during generation; it risks conflating parametric knowledge leakage or coincidental matches with true grounding, especially under Soft RAG or long chunks. Without supporting evidence such as attention tracing or targeted ablations, the interpretation of 'prior-driven overrides' and 'evidence misalignment' remains suggestive rather than demonstrated. This assumption is central to the paper's main claim.
- [Definition of inference modes] The paper contrasts Strict RAG (enforcing exclusive reliance on retrieved evidence) with Soft RAG (allowing parametric knowledge integration). The manuscript should specify the prompting or implementation details used to enforce 'exclusive reliance' in Strict RAG, as ambiguity here could confound the misalignment measurements.
- [Facet decomposition] The validity of the entire analysis hinges on the rules for decomposing questions into atomic reasoning facets. The manuscript does not appear to provide explicit criteria or examples for this decomposition, nor does it report inter-annotator agreement or sensitivity analysis, which are necessary to ensure the facets are reproducible and not arbitrary.
minor comments (3)
- [Notation] The notation for the Facet x Chunk matrix should be formalized with equations to improve clarity and reproducibility.
- [Related work] Consider citing additional works on hallucination detection in RAG, such as those using attention mechanisms or factuality metrics.
- [Figures] Ensure that visualizations of the Facet x Chunk matrix are legible and include legends explaining the color scales for relevance and faithfulness scores.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and describe the revisions we intend to make to improve clarity and strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Facet x Chunk matrix construction] The diagnosis of retrieval-generation misalignment depends on the Facet x Chunk matrix correctly capturing evidence usage via NLI faithfulness scores. However, NLI entailment is a post-hoc proxy that may not reflect actual integration during generation; it risks conflating parametric knowledge leakage or coincidental matches with true grounding, especially under Soft RAG or long chunks. Without supporting evidence such as attention tracing or targeted ablations, the interpretation of 'prior-driven overrides' and 'evidence misalignment' remains suggestive rather than demonstrated. This assumption is central to the paper's main claim.
Authors: We recognize that NLI faithfulness scores provide an indirect measure of evidence grounding and do not directly observe the model's generation process. This proxy approach is widely used in hallucination detection literature, and our controlled inference modes (Strict vs. Soft RAG) are designed to highlight differences attributable to integration failures. Nevertheless, we agree that additional validation is valuable. In the revised version, we will add a dedicated limitations subsection acknowledging the proxy limitations and potential for coincidental matches. We will also perform and report a targeted ablation on a subset of the data where we compare NLI scores against human annotations of grounding to quantify their alignment. Attention tracing is not applicable to proprietary models like GPT and Gemini, but for open-source LLaMA we can include preliminary attention analysis in the appendix if space permits. revision: partial
-
Referee: [Definition of inference modes] The paper contrasts Strict RAG (enforcing exclusive reliance on retrieved evidence) with Soft RAG (allowing parametric knowledge integration). The manuscript should specify the prompting or implementation details used to enforce 'exclusive reliance' in Strict RAG, as ambiguity here could confound the misalignment measurements.
Authors: We thank the referee for highlighting the need for greater specificity. In the revised manuscript, we will include the exact prompting templates and implementation details used to define the Strict RAG mode, ensuring that the enforcement of exclusive reliance on retrieved evidence is transparent and reproducible. revision: yes
-
Referee: [Facet decomposition] The validity of the entire analysis hinges on the rules for decomposing questions into atomic reasoning facets. The manuscript does not appear to provide explicit criteria or examples for this decomposition, nor does it report inter-annotator agreement or sensitivity analysis, which are necessary to ensure the facets are reproducible and not arbitrary.
Authors: We agree that the rules for decomposing questions into atomic reasoning facets require clearer documentation to ensure reproducibility. In the revised manuscript, we will provide explicit criteria and additional examples for the decomposition process. We will also include inter-annotator agreement metrics from our annotation procedure and a sensitivity analysis to demonstrate robustness to different facet definitions. revision: yes
Circularity Check
No significant circularity; standard components applied to external benchmarks
full rationale
The paper introduces a facet-level framework that decomposes questions into atomic facets and constructs a Facet x Chunk matrix using retrieval relevance plus NLI faithfulness scores, then compares Strict RAG, Soft RAG, and LLM-only modes on HotpotQA and medical QA. No equations, fitted parameters, or self-citations are shown that reduce the reported misalignment patterns or central claim (integration failures dominate over retrieval) to definitions or inputs internal to the paper. The diagnostics remain independent of the claims and rely on external data and off-the-shelf NLI/retrieval tools.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Questions can be decomposed into atomic reasoning facets that preserve essential information for evidence checking.
- domain assumption Natural language inference scores between facets and chunks provide a reliable measure of faithfulness and grounding.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a facet-level diagnostics framework... Facet×Chunk matrix that combines retrieval relevance with natural language inference–based faithfulness scores... three controlled inference modes: Strict RAG, Soft RAG, and LLM-only
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evidence Override emerges as the dominant failure mode at 28.4%... Evidence Helpful (38.1%)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A survey on evaluation of large language mod- els.ACM Trans. Intell. Syst. Technol., 15(3). Hung-Ting Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. 2022. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence.arXiv preprint arXiv:2210.13701. Florin Cuconasu, Giovanni Trappolini, Feder...
-
[2]
The power of noise: Redefining retrieval for rag systems,
The power of noise: Redefining retrieval for rag systems.arXiv preprint arXiv:2401.14887. Hanane Djeddal, Pierre Erbacher, Raouf Toukal, Laure Soulier, Karen Pinel-Sauvagnat, Sophia Katrenko, and Lynda Tamine. 2024. An evaluation framework for attributed information retrieval using large lan- guage models. InProceedings of the 33rd ACM International Confe...
-
[3]
Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan
Ragbench: Explainable benchmark for retrieval-augmented generation systems.CoRR, abs/2407.11005. Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan. 2021. Decomposing complex ques- tions makes multi-hop QA easier and more inter- pretable. InFindings of the Association for Compu- tational Linguistics: EMNLP 2021, pages 169–180, Punta Cana, Domin...
-
[4]
Context or retrieval? evaluating RAG methods for art and museum QA system. InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 129–136, Bilbao, Spain. Association for Computational Linguistics. Keonwoo Roh, Yeong-Joon Ju, and Seong-Whan Lee
-
[5]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
XLQA: A benchmark for locale-aware mul- tilingual open-domain question answering. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 28797– 28809, Suzhou, China. Association for Computa- tional Linguistics. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Richard James, Mike Lewis, Luke Zettle- moy...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Stitch it in time: Gnn-based prediction of out-of-distribution questions in stackoverflow.arXiv preprint arXiv:2306.16655. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Context versus prior knowledge in language models.arXiv preprint arXiv:2306.04757. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennigho...
-
[7]
Bridge Questions Prompt: Facet Decomposition Prompt Task:Convert this bridge question into reasoning steps (facets). Example 1: Question:What nationality is the director of the film Masked and Anonymous? Supporting Facts: [["Masked and Anonymous", 0], ["Larry Charles", 0]] Facets:
-
[8]
Who directed the film Masked and Anony- mous?
-
[9]
What is Larry Charles’s nationality? Example 2: Question:What year was the director of Blade Run- ner born? Supporting Facts: [["Blade Runner", 1], ["Ridley Scott", 0]] Facets:
-
[10]
Who directed Blade Runner?
-
[11]
When was Ridley Scott born? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets:
-
[12]
Comparison Questions Prompt: Facet Decomposition Prompt Task:Convert this comparison question into reason- ing steps (facets). Example 1: Question:Who was born first, Arthur Conan Doyle or Artur Schnitzler? Supporting Facts: [["Arthur Conan Doyle", 0], ["Artur Schnitzler", 0]] Facets:
-
[13]
When was Arthur Conan Doyle born?
-
[14]
When was Artur Schnitzler born? Example 2: Question:Which has more species, genus A or genus B? Supporting Facts: [["Genus A", 0], ["Genus B", 0]] Facets:
-
[15]
How many species are in genus A?
-
[16]
Give a short, direct answer in one or two sentences
How many species are in genus B? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets: B.3 Facet-Level Answer Generation Prompts We generate answers for each reasoning facet under three controlled inference modes: B.3.1 Strict RAG Prompt For facet-level generation with strict evidence grounding: Strict RAG Facet Generation Sys...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.