Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

Markus Schedl; Monorama Swain; Passant Elchafei; Shahed Masoudian

arxiv: 2604.09174 · v2 · pith:7QL2LDXCnew · submitted 2026-04-10 · 💻 cs.CL

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

Passant Elchafei , Monorama Swain , Shahed Masoudian , Markus Schedl This is my paper

Pith reviewed 2026-05-21 08:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval-augmented generationhallucinationfacet-level analysisevidence groundingRAG evaluationnatural language inferencequestion answering

0 comments

The pith

RAG hallucinations stem mainly from how evidence is integrated during generation rather than retrieval failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a facet-level diagnostics framework for QA in RAG systems by decomposing each question into atomic reasoning facets. It builds a Facet x Chunk matrix that pairs retrieval relevance scores with NLI-based faithfulness measures to check evidence sufficiency and grounding. Comparing three inference modes—Strict RAG that relies only on retrieved evidence, Soft RAG that mixes evidence with parametric knowledge, and LLM-only generation without retrieval—exposes cases of retrieval-generation misalignment. Analysis on medical QA and HotpotQA with GPT, Gemini, and LLaMA models reveals recurring patterns such as evidence override and prior-driven overrides that answer-level checks miss. The work concludes that integration during generation, not retrieval accuracy, drives most hallucinations.

Core claim

Hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

What carries the argument

The facet-level diagnostics framework that uses a Facet x Chunk matrix combining retrieval relevance with NLI-based faithfulness scores across Strict RAG, Soft RAG, and LLM-only modes to trace evidence usage.

If this is right

Relevant evidence is often retrieved but not correctly integrated, producing hallucinations despite good retrieval.
Answer-level accuracy metrics overlook systematic facet-level evidence misalignment and overrides.
Controlled comparisons of strict-evidence, mixed, and no-retrieval modes isolate where generation diverges from available evidence.
Recurring failure modes including evidence absence, misalignment, and prior-driven overrides appear across open- and closed-source LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG improvements may need to target evidence-integration steps inside the generator rather than retrieval alone.
The matrix approach could be adapted to trace uncertainty in other generation tasks beyond question answering.
Practitioners could apply facet diagnostics to audit and refine models for lower hallucination rates in medical or factual domains.

Load-bearing premise

Decomposing questions into atomic reasoning facets and measuring grounding via NLI-based faithfulness scores on a Facet x Chunk matrix accurately captures how evidence is actually used or ignored during generation.

What would settle it

A test showing that models still hallucinate at high rates even when the Facet x Chunk matrix records high faithfulness and sufficiency for every facet would indicate that integration is not the main driver.

Figures

Figures reproduced from arXiv: 2604.09174 by Markus Schedl, Monorama Swain, Passant Elchafei, Shahed Masoudian.

**Figure 2.** Figure 2: HotpotQA: Evidence Taxonomy Distribution [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Medical Dataset: Facet semantic type × evidence taxonomy distribution. Boolean and Temporal facets show highest failure rates. Comparative facets are most unstable with highest misalignment and lowest robust rates. Misalignment nearly vanishes (0.7% versus 7.5% medical), confirming Wikipedia’s broad coverage provides better retrieval recall. The consistent 7:1 override-to-failure ratio across datasets (42… view at source ↗

**Figure 4.** Figure 4: HotpotQA: Facet Semantic Type × Evidence Taxonomy Distribution. Boolean and Temporal facets show lowest failure rates. Override rates are consistently high across all types. Comparative facets remain most unstable in both datasets. quality: models unpredictably either incorporate or contradict retrieved evidence. Detailed distributions in Appendix D [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Illustrative facet-level diagnostic example [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Medical Dataset: Facet-level faithfulness dis [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Medical Dataset: Per-question ∆F1 distributions (Soft − Strict) by model [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The facet matrix and mode comparisons give a workable diagnostic for RAG failures, but NLI scores remain a loose proxy for actual evidence use.

read the letter

The useful part is the controlled setup that splits questions into facets, builds a relevance-plus-NLI matrix per chunk, and runs the same inputs under strict RAG, soft RAG, and LLM-only conditions. That lets them separate retrieval misses from cases where relevant chunks are retrieved but ignored or overridden during generation. On HotpotQA and medical QA they show this pattern across GPT, Gemini, and LLaMA, and the differences between modes make the integration claim more concrete than standard answer-level scores do.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces a facet-level diagnostics framework for QA in RAG systems. It decomposes input questions into atomic reasoning facets and uses a structured Facet x Chunk matrix combining retrieval relevance with NLI-based faithfulness scores to assess evidence sufficiency and grounding. By analyzing three controlled inference modes—Strict RAG, Soft RAG, and LLM-only generation—across medical QA and HotpotQA datasets with models including GPT, Gemini, and LLaMA, the paper claims that hallucinations are primarily driven by failures in integrating retrieved evidence during generation rather than by retrieval accuracy alone, with facet-level analysis revealing systematic patterns of evidence override and misalignment hidden in answer-level evaluations.

Significance. If the framework's measurements hold, this work provides a more granular and interpretable way to diagnose RAG failures, shifting focus from retrieval to generation integration. The use of controlled comparisons across open- and closed-source models and two distinct datasets (medical QA and HotpotQA) is a strength, as is the identification of recurring failure modes such as evidence absence, misalignment, and prior-driven overrides. This could lead to better RAG designs if the facet decomposition and NLI proxies are validated.

major comments (3)

[Facet x Chunk matrix construction] The diagnosis of retrieval-generation misalignment depends on the Facet x Chunk matrix correctly capturing evidence usage via NLI faithfulness scores. However, NLI entailment is a post-hoc proxy that may not reflect actual integration during generation; it risks conflating parametric knowledge leakage or coincidental matches with true grounding, especially under Soft RAG or long chunks. Without supporting evidence such as attention tracing or targeted ablations, the interpretation of 'prior-driven overrides' and 'evidence misalignment' remains suggestive rather than demonstrated. This assumption is central to the paper's main claim.
[Definition of inference modes] The paper contrasts Strict RAG (enforcing exclusive reliance on retrieved evidence) with Soft RAG (allowing parametric knowledge integration). The manuscript should specify the prompting or implementation details used to enforce 'exclusive reliance' in Strict RAG, as ambiguity here could confound the misalignment measurements.
[Facet decomposition] The validity of the entire analysis hinges on the rules for decomposing questions into atomic reasoning facets. The manuscript does not appear to provide explicit criteria or examples for this decomposition, nor does it report inter-annotator agreement or sensitivity analysis, which are necessary to ensure the facets are reproducible and not arbitrary.

minor comments (3)

[Notation] The notation for the Facet x Chunk matrix should be formalized with equations to improve clarity and reproducibility.
[Related work] Consider citing additional works on hallucination detection in RAG, such as those using attention mechanisms or factuality metrics.
[Figures] Ensure that visualizations of the Facet x Chunk matrix are legible and include legends explaining the color scales for relevance and faithfulness scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and describe the revisions we intend to make to improve clarity and strengthen the evidence for our claims.

read point-by-point responses

Referee: [Facet x Chunk matrix construction] The diagnosis of retrieval-generation misalignment depends on the Facet x Chunk matrix correctly capturing evidence usage via NLI faithfulness scores. However, NLI entailment is a post-hoc proxy that may not reflect actual integration during generation; it risks conflating parametric knowledge leakage or coincidental matches with true grounding, especially under Soft RAG or long chunks. Without supporting evidence such as attention tracing or targeted ablations, the interpretation of 'prior-driven overrides' and 'evidence misalignment' remains suggestive rather than demonstrated. This assumption is central to the paper's main claim.

Authors: We recognize that NLI faithfulness scores provide an indirect measure of evidence grounding and do not directly observe the model's generation process. This proxy approach is widely used in hallucination detection literature, and our controlled inference modes (Strict vs. Soft RAG) are designed to highlight differences attributable to integration failures. Nevertheless, we agree that additional validation is valuable. In the revised version, we will add a dedicated limitations subsection acknowledging the proxy limitations and potential for coincidental matches. We will also perform and report a targeted ablation on a subset of the data where we compare NLI scores against human annotations of grounding to quantify their alignment. Attention tracing is not applicable to proprietary models like GPT and Gemini, but for open-source LLaMA we can include preliminary attention analysis in the appendix if space permits. revision: partial
Referee: [Definition of inference modes] The paper contrasts Strict RAG (enforcing exclusive reliance on retrieved evidence) with Soft RAG (allowing parametric knowledge integration). The manuscript should specify the prompting or implementation details used to enforce 'exclusive reliance' in Strict RAG, as ambiguity here could confound the misalignment measurements.

Authors: We thank the referee for highlighting the need for greater specificity. In the revised manuscript, we will include the exact prompting templates and implementation details used to define the Strict RAG mode, ensuring that the enforcement of exclusive reliance on retrieved evidence is transparent and reproducible. revision: yes
Referee: [Facet decomposition] The validity of the entire analysis hinges on the rules for decomposing questions into atomic reasoning facets. The manuscript does not appear to provide explicit criteria or examples for this decomposition, nor does it report inter-annotator agreement or sensitivity analysis, which are necessary to ensure the facets are reproducible and not arbitrary.

Authors: We agree that the rules for decomposing questions into atomic reasoning facets require clearer documentation to ensure reproducibility. In the revised manuscript, we will provide explicit criteria and additional examples for the decomposition process. We will also include inter-annotator agreement metrics from our annotation procedure and a sensitivity analysis to demonstrate robustness to different facet definitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard components applied to external benchmarks

full rationale

The paper introduces a facet-level framework that decomposes questions into atomic facets and constructs a Facet x Chunk matrix using retrieval relevance plus NLI faithfulness scores, then compares Strict RAG, Soft RAG, and LLM-only modes on HotpotQA and medical QA. No equations, fitted parameters, or self-citations are shown that reduce the reported misalignment patterns or central claim (integration failures dominate over retrieval) to definitions or inputs internal to the paper. The diagnostics remain independent of the claims and rely on external data and off-the-shelf NLI/retrieval tools.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that questions admit clean decomposition into independent atomic facets and that NLI faithfulness scores serve as a faithful proxy for evidence grounding; no free parameters or new invented entities are introduced in the abstract.

axioms (2)

domain assumption Questions can be decomposed into atomic reasoning facets that preserve essential information for evidence checking.
Foundational to the entire Facet x Chunk analysis described in the abstract.
domain assumption Natural language inference scores between facets and chunks provide a reliable measure of faithfulness and grounding.
Used to populate the diagnostic matrix and identify misalignment.

pith-pipeline@v0.9.0 · 5790 in / 1333 out tokens · 38982 ms · 2026-05-21T08:52:48.039013+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a facet-level diagnostics framework... Facet×Chunk matrix that combines retrieval relevance with natural language inference–based faithfulness scores... three controlled inference modes: Strict RAG, Soft RAG, and LLM-only
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evidence Override emerges as the dominant failure mode at 28.4%... Evidence Helpful (38.1%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

A survey on evaluation of large language mod- els.ACM Trans. Intell. Syst. Technol., 15(3). Hung-Ting Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. 2022. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence.arXiv preprint arXiv:2210.13701. Florin Cuconasu, Giovanni Trappolini, Feder...

work page arXiv 2022
[2]

The power of noise: Redefining retrieval for rag systems,

The power of noise: Redefining retrieval for rag systems.arXiv preprint arXiv:2401.14887. Hanane Djeddal, Pierre Erbacher, Raouf Toukal, Laure Soulier, Karen Pinel-Sauvagnat, Sophia Katrenko, and Lynda Tamine. 2024. An evaluation framework for attributed information retrieval using large lan- guage models. InProceedings of the 33rd ACM International Confe...

work page arXiv 2024
[3]

Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

Ragbench: Explainable benchmark for retrieval-augmented generation systems.CoRR, abs/2407.11005. Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan. 2021. Decomposing complex ques- tions makes multi-hop QA easier and more inter- pretable. InFindings of the Association for Compu- tational Linguistics: EMNLP 2021, pages 169–180, Punta Cana, Domin...

work page arXiv 2021
[4]

InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 129–136, Bilbao, Spain

Context or retrieval? evaluating RAG methods for art and museum QA system. InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 129–136, Bilbao, Spain. Association for Computational Linguistics. Keonwoo Roh, Yeong-Joon Ju, and Seong-Whan Lee

work page
[5]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

XLQA: A benchmark for locale-aware mul- tilingual open-domain question answering. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 28797– 28809, Suzhou, China. Association for Computa- tional Linguistics. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Richard James, Mike Lewis, Luke Zettle- moy...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Evidence Overridden

Stitch it in time: Gnn-based prediction of out-of-distribution questions in stackoverflow.arXiv preprint arXiv:2306.16655. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Context versus prior knowledge in language models.arXiv preprint arXiv:2306.04757. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennigho...

work page arXiv 2023
[7]

Masked and Anonymous

Bridge Questions Prompt: Facet Decomposition Prompt Task:Convert this bridge question into reasoning steps (facets). Example 1: Question:What nationality is the director of the film Masked and Anonymous? Supporting Facts: [["Masked and Anonymous", 0], ["Larry Charles", 0]] Facets:

work page
[8]

Who directed the film Masked and Anony- mous?

work page
[9]

Blade Runner

What is Larry Charles’s nationality? Example 2: Question:What year was the director of Blade Run- ner born? Supporting Facts: [["Blade Runner", 1], ["Ridley Scott", 0]] Facets:

work page
[10]

Who directed Blade Runner?

work page
[11]

When was Ridley Scott born? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets:

work page
[12]

Arthur Conan Doyle

Comparison Questions Prompt: Facet Decomposition Prompt Task:Convert this comparison question into reason- ing steps (facets). Example 1: Question:Who was born first, Arthur Conan Doyle or Artur Schnitzler? Supporting Facts: [["Arthur Conan Doyle", 0], ["Artur Schnitzler", 0]] Facets:

work page
[13]

When was Arthur Conan Doyle born?

work page
[14]

Genus A", 0], [

When was Artur Schnitzler born? Example 2: Question:Which has more species, genus A or genus B? Supporting Facts: [["Genus A", 0], ["Genus B", 0]] Facets:

work page
[15]

How many species are in genus A?

work page
[16]

Give a short, direct answer in one or two sentences

How many species are in genus B? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets: B.3 Facet-Level Answer Generation Prompts We generate answers for each reasoning facet under three controlled inference modes: B.3.1 Strict RAG Prompt For facet-level generation with strict evidence grounding: Strict RAG Facet Generation Sys...

work page 2024

[1] [1]

A survey on evaluation of large language mod- els.ACM Trans. Intell. Syst. Technol., 15(3). Hung-Ting Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. 2022. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence.arXiv preprint arXiv:2210.13701. Florin Cuconasu, Giovanni Trappolini, Feder...

work page arXiv 2022

[2] [2]

The power of noise: Redefining retrieval for rag systems,

The power of noise: Redefining retrieval for rag systems.arXiv preprint arXiv:2401.14887. Hanane Djeddal, Pierre Erbacher, Raouf Toukal, Laure Soulier, Karen Pinel-Sauvagnat, Sophia Katrenko, and Lynda Tamine. 2024. An evaluation framework for attributed information retrieval using large lan- guage models. InProceedings of the 33rd ACM International Confe...

work page arXiv 2024

[3] [3]

Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

Ragbench: Explainable benchmark for retrieval-augmented generation systems.CoRR, abs/2407.11005. Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan. 2021. Decomposing complex ques- tions makes multi-hop QA easier and more inter- pretable. InFindings of the Association for Compu- tational Linguistics: EMNLP 2021, pages 169–180, Punta Cana, Domin...

work page arXiv 2021

[4] [4]

InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 129–136, Bilbao, Spain

Context or retrieval? evaluating RAG methods for art and museum QA system. InProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 129–136, Bilbao, Spain. Association for Computational Linguistics. Keonwoo Roh, Yeong-Joon Ju, and Seong-Whan Lee

work page

[5] [5]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

XLQA: A benchmark for locale-aware mul- tilingual open-domain question answering. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 28797– 28809, Suzhou, China. Association for Computa- tional Linguistics. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Richard James, Mike Lewis, Luke Zettle- moy...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Evidence Overridden

Stitch it in time: Gnn-based prediction of out-of-distribution questions in stackoverflow.arXiv preprint arXiv:2306.16655. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Context versus prior knowledge in language models.arXiv preprint arXiv:2306.04757. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennigho...

work page arXiv 2023

[7] [7]

Masked and Anonymous

Bridge Questions Prompt: Facet Decomposition Prompt Task:Convert this bridge question into reasoning steps (facets). Example 1: Question:What nationality is the director of the film Masked and Anonymous? Supporting Facts: [["Masked and Anonymous", 0], ["Larry Charles", 0]] Facets:

work page

[8] [8]

Who directed the film Masked and Anony- mous?

work page

[9] [9]

Blade Runner

What is Larry Charles’s nationality? Example 2: Question:What year was the director of Blade Run- ner born? Supporting Facts: [["Blade Runner", 1], ["Ridley Scott", 0]] Facets:

work page

[10] [10]

Who directed Blade Runner?

work page

[11] [11]

When was Ridley Scott born? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets:

work page

[12] [12]

Arthur Conan Doyle

Comparison Questions Prompt: Facet Decomposition Prompt Task:Convert this comparison question into reason- ing steps (facets). Example 1: Question:Who was born first, Arthur Conan Doyle or Artur Schnitzler? Supporting Facts: [["Arthur Conan Doyle", 0], ["Artur Schnitzler", 0]] Facets:

work page

[13] [13]

When was Arthur Conan Doyle born?

work page

[14] [14]

Genus A", 0], [

When was Artur Schnitzler born? Example 2: Question:Which has more species, genus A or genus B? Supporting Facts: [["Genus A", 0], ["Genus B", 0]] Facets:

work page

[15] [15]

How many species are in genus A?

work page

[16] [16]

Give a short, direct answer in one or two sentences

How many species are in genus B? Now convert this: Question:[INPUT_QUESTION] Supporting Facts:[INPUT_FACTS] Facets: B.3 Facet-Level Answer Generation Prompts We generate answers for each reasoning facet under three controlled inference modes: B.3.1 Strict RAG Prompt For facet-level generation with strict evidence grounding: Strict RAG Facet Generation Sys...

work page 2024