pith. sign in

arxiv: 2505.21072 · v5 · submitted 2025-05-27 · 💻 cs.CL

Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

Pith reviewed 2026-05-19 13:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords hallucination detectionretrieval-augmented generationRAGfactualityfaithfulnessuncertainty quantificationLLM outputs
0
0 comments X

The pith

FRANQ detects factual errors in RAG outputs more accurately by conditioning uncertainty quantification on faithfulness to the retrieved context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FRANQ to address the problem that standard uncertainty methods for detecting hallucinations in retrieval-augmented generation mix up two separate issues. One is whether a statement is factually true, and the other is whether it stays faithful to the specific evidence that was retrieved. FRANQ estimates faithfulness first and then applies different uncertainty measures for factuality depending on the result. The authors created a new long-form question answering dataset with manual validation for both labels and tested the approach on several datasets, tasks, and language models. A sympathetic reader would care because clearer separation of these two signals could reduce mistaken rejections of correct but unsupported answers while still catching real factual mistakes.

Core claim

FRANQ applies distinct uncertainty quantification techniques to estimate factuality conditioned on whether a statement is faithful to the retrieved context, and extensive experiments across multiple datasets, tasks, and LLMs demonstrate that this yields more accurate detection of factual errors in RAG-generated responses than existing approaches.

What carries the argument

FRANQ, which first estimates faithfulness to the retrieved context and then applies tailored uncertainty quantification for factuality depending on that estimate.

If this is right

  • Clearer distinction between unsupported true statements and actual factual errors in RAG responses.
  • More reliable hallucination detection for long-form question answering tasks.
  • Uncertainty estimates that can be adjusted separately for faithful and unfaithful statements.
  • Improved performance that holds across different large language models and retrieval setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The conditioning step could be applied to other generation settings that mix internal knowledge with external sources.
  • If faithfulness estimation itself carries high error, the overall factuality signal may still degrade in practice.
  • This separation might help downstream systems decide whether to retrieve more context or to trust the model's internal knowledge.

Load-bearing premise

The newly constructed dataset supplies reliable ground-truth labels for both factuality and faithfulness that can be used to evaluate and improve uncertainty quantification.

What would settle it

An experiment that shows FRANQ loses its accuracy advantage when the faithfulness labels are replaced by random guesses or by labels from an independent annotator who disagrees with the original dataset would falsify the central claim.

Figures

Figures reproduced from arXiv: 2505.21072 by Aleksandr Rubashevskii, Artem Shelmanov, Dzianis Piatrashyn, Ekaterina Fadeeva, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Roman Vashurin, Shehzaad Dhuliawala, Timothy Baldwin.

Figure 1
Figure 1. Figure 1: FRANQ illustration. Left: A user poses a question, and the RAG retrieves relevant documents and formulates an answer, potentially using information from the retrieved documents. Middle: The RAG output is decomposed into atomic claims. Right: The FRANQ method assesses factuality by evaluating three components: (1) faithfulness, (2) factuality under faithful condition, and (3) factuality under unfaithful con… view at source ↗
Figure 2
Figure 2. Figure 2: PRR of condition-calibrated FRANQ for different choices of UQfaith and UQunfaith. RAG-specific baselines. We also evaluate the two FRANQ components in isolation, AlignScore and Parametric Knowledge, to assess how much their combination in FRANQ improves over using each component individually (see Section 2.2). XGBoost methods. We include XGBoost models trained on factuality labels using two feature sets: (… view at source ↗
Figure 3
Figure 3. Figure 3: PRR comparison of FRANQ and XGBoost methods across different training set sizes. Computational efficiency [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template used with GPT-4o for decomposing an answer into a set of atomic claims. Task: Analyze the given text and the claim (which was extracted from the text). For each sentence in the text: 1. Copy the sentence exactly as it appears in the text. 2. Identify the words from the sentence that are related to the claim, in the same order they appear in the sentence. If no words are related, output "No … view at source ↗
Figure 4
Figure 4. Figure 4: Prompt used in short-form QA datasets. Titles and retrievals correspond to the Wikipedia page title and the passage retrieved from it. Using the context provided below, answer the question with a balanced approach. Ensure your response contains an equal number of claims or details drawn directly from the context and from your own knowledge: Context: passage 1:{retrieval1} passage 2:{retrieval2} passage 3:{… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used with GPT-4o-search to automati [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Balance of classes of factuality annotations for the Llama 3B Instruct model. Each matrix is based on 100 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Balance of classes of factuality annotations for the Falcon 3B Base model. Each matrix is based on 76 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Balance of classes of faithfulness annotations for Llama 3B Instruct and Falcon 3B Base models. The [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example illustrating intermediate Align [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of AlignScore-based faithfulness estimates on the long-form QA benchmark for Llama 3B [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Top vertices of first XGBoost tree trained [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example outputs from FRANQ. Left: Each example includes the input question, retrieved passages, the LLM-generated answer, a selected claim from the answer, and corresponding factuality and faithfulness annotations. Claims and their spans in the answer are highlighted in yellow. If a claim is faithful, its corresponding span in the retrieved passages is also highlighted. Right: The FRANQ component scores a… view at source ↗
read the original abstract

Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FRANQ, a faithfulness-aware uncertainty quantification approach for detecting factual errors (hallucinations) in RAG outputs. It applies separate UQ techniques to factuality estimates while conditioning on whether statements are faithful to the retrieved context, constructs a new long-form QA dataset annotated for both factuality and faithfulness via automated labeling plus manual validation of challenging cases, and reports extensive experiments across datasets, tasks, and LLMs claiming superior factual-error detection over existing methods.

Significance. If the empirical advantages hold under reliable ground-truth labels, FRANQ could advance hallucination mitigation in RAG by explicitly separating factuality from faithfulness, thereby reducing erroneous flagging of correct but unsupported statements and yielding more trustworthy uncertainty estimates for retrieved-augmented generations. The new annotated dataset would constitute a useful community resource provided its labeling process is fully documented.

major comments (2)
  1. [Dataset Construction] Dataset Construction section: The new long-form QA dataset is described as the product of automated labeling followed by manual validation of challenging cases, yet the manuscript supplies no annotation guidelines, inter-annotator agreement statistics, number of validators, or details on how challenging cases were selected and resolved. This is load-bearing for the central claim, because every downstream comparison (performance tables, figures, and statistical tests) inherits the quality of these ground-truth labels for both factuality and faithfulness; without these details it is impossible to determine whether measured gains for FRANQ reflect genuine UQ improvement or artifacts of label noise.
  2. [Evaluation] Evaluation section: The claim of more accurate factual-error detection rests on comparisons to existing UQ methods, but the manuscript does not report concrete metrics (e.g., AUC, F1, or calibration error), baseline implementations, or statistical significance tests in sufficient detail to allow independent verification of the superiority result.
minor comments (1)
  1. [Abstract] Abstract: The abstract asserts superior performance from extensive experiments but would be strengthened by including at least one quantitative highlight (e.g., relative improvement on a primary metric).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of reproducibility for both the new dataset and the experimental claims. We address each point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: The new long-form QA dataset is described as the product of automated labeling followed by manual validation of challenging cases, yet the manuscript supplies no annotation guidelines, inter-annotator agreement statistics, number of validators, or details on how challenging cases were selected and resolved. This is load-bearing for the central claim, because every downstream comparison (performance tables, figures, and statistical tests) inherits the quality of these ground-truth labels for both factuality and faithfulness; without these details it is impossible to determine whether measured gains for FRANQ reflect genuine UQ improvement or artifacts of label noise.

    Authors: We agree that comprehensive documentation of the labeling process is essential to substantiate the ground-truth quality. In the revised manuscript we have expanded the Dataset Construction section (now Section 3.2) with the complete annotation guidelines for both factuality and faithfulness, the number of validators (three), inter-annotator agreement statistics, and a precise description of how challenging cases were identified (low automated confidence or initial annotator disagreement) and resolved (majority vote following discussion). These additions are also summarized in a new appendix table. revision: yes

  2. Referee: [Evaluation] Evaluation section: The claim of more accurate factual-error detection rests on comparisons to existing UQ methods, but the manuscript does not report concrete metrics (e.g., AUC, F1, or calibration error), baseline implementations, or statistical significance tests in sufficient detail to allow independent verification of the superiority result.

    Authors: We acknowledge that additional detail is required for independent verification. The revised Evaluation section now explicitly lists the primary metrics (AUC, F1, and expected calibration error), provides implementation details and references for all baselines, and reports statistical significance results (paired t-tests with p-values) for the key comparisons. Full per-model and per-dataset tables have been moved to an appendix to improve clarity without altering the main narrative. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent dataset construction and evaluation

full rationale

The paper introduces FRANQ as an empirical uncertainty quantification approach for RAG outputs that conditions factuality estimates on faithfulness to retrieved context. It constructs a new long-form QA dataset via automated labeling plus manual validation of challenging cases, then evaluates detection accuracy across datasets, tasks, and LLMs. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the described method or claims. The central result (improved factual-error detection) rests on experimental comparisons rather than reducing to inputs by construction. Dataset label quality is a validity concern but does not create circularity per the enumerated patterns, as no load-bearing step quotes or reduces to a self-citation chain, ansatz, or fitted input.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no free parameters, axioms, or invented entities are identifiable or required by the described method.

pith-pipeline@v0.9.0 · 5766 in / 1099 out tokens · 59787 ms · 2026-05-19T13:11:47.296544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge

    cs.IR 2026-04 unverdicted novelty 6.0

    SmartVector augments embeddings with time, confidence, and relation signals plus a consolidation process, raising top-1 accuracy on versioned queries from 31% to 62% on a synthetic benchmark while cutting stale answer...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), ACL ’2024, pages 10862– 10878, Bangkok, Thailand

    RAGTruth: A hallucination corpus for de- veloping trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), ACL ’2024, pages 10862– 10878, Bangkok, Thailand. Association for Compu- tational Linguistics. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan ...

  2. [2]

    Gemma 3 Technical Report

    ReDeEP: Detecting hallucination in retrieval- augmented generation via mechanistic interpretabil- ity. In Proceedings of the Thirteenth International Conference on Learning Representations, ICLR ’25, Singapore. Falcon-LLM Team. 2024. The Falcon 3 family of open models. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhe...

  3. [3]

    Photosynthesis in plants converts sunlight, carbon dioxide, and water into glucose and oxygen

    **Atomicity**: Break down each statement into the smallest possible unit of factual information. Avoid grouping multiple facts in one claim. For example: - Instead of: "Photosynthesis in plants converts sunlight, carbon dioxide, and water into glucose and oxygen." - Output: ["Photosynthesis in plants converts sunlight into glucose.", "Photosynthesis in pl...

  4. [4]

    This process is important for life

    **Context-Independent**: Each claim must be understandable and verifiable on its own without requiring additional context or references to other claims. Avoid vague claims like "This process is important for life."

  5. [5]

    **Precise and Unambiguous**: Ensure the claims are specific and avoid combining related ideas that can stand independently

  6. [6]

    python". ### Example: If the input text is:

    **No Formatting**: The response must be a Python list of strings without any extra formatting, code blocks, or labels like "python". ### Example: If the input text is: "Mary is a five-year-old girl. She likes playing piano and doesn’t like cookies." The output should be: ["Mary is a five-year-old girl.", "Mary likes playing piano.", "Mary doesn’t like coo...

  7. [7]

    Copy the sentence exactly as it appears in the text

  8. [8]

    No related words

    Identify the words from the sentence that are related to the claim, in the same order they appear in the sentence. If no words are related, output "No related words." Example: Text: "Sure! Here are brief explanations of each type of network topology mentioned in the passages: [...]" Claim: "Distributed Bus topology connects all network nodes to a shared t...

  9. [9]

    Though our Chestnuts are quite sweet right out of the bag, due to our SECRET post harvest conditioning process, for best flavor you should dry them slightly

    Properly stored fresh Chestnuts will feel hard as rocks if you squeeze them. Though our Chestnuts are quite sweet right out of the bag, due to our SECRET post harvest conditioning process, for best flavor you should dry them slightly. A unique aspect of Chestnuts is that they become sweeter as they dry

  10. [10]

    After the chestnuts fall from the tree, watch for the spiny burs to split

    Harvesting Chestnut Trees. After the chestnuts fall from the tree, watch for the spiny burs to split. Don’t harvest chestnuts if the burs are still green and closed because the nuts inside will be unripe. Harvest the nuts every couple of days

  11. [11]

    September

    When to harvest chestnuts? Chestnuts don’t ripen at the same time and chestnut harvest time can span as much as five weeks, although the nuts generally ripen in a 10- to 30-day span of time in late August and . September. Allow the nuts to fall from the tree naturally. LLM Answer: To determine the best time and method for harvesting chestnuts, it's essent...

  12. [12]

    P(c is faithful to r) = AlignScore(c, r) = 0.98

  13. [13]

    MaxProb(c) = 2.7·10-6

  14. [14]

    This results in an appropriately highFRANQscore

    ParametricKnowledge(c) = = = 0.52 · 0.66 · … · 0.32 = 3.5·10-15 52% 66% 33% 44% 22% 2% 83% 3% 69% 87% 13% 100% 0.1% 32% 0.2% 0.6% 32% Token probabilities from parametric knowledge FRANQno calibration(c) = = 0.98 · 2.7 ·10-6 + 0.02 · 3.5 · 10-15 = 2.6·10-6 FRANQcondition-calibrated(c) = = 0.98 · f(2.7 ·10-6 ) + 0.02 · g(3.5·10-15) = 0.59 0.6 0.13 (a)Faithf...

  15. [15]

    amounts of RAM (mostly SRAM) are also integrated in the CPU and other ICs on the motherboard, as well as in hard-drives, CD-ROMs, and …

  16. [16]

    Because of this refresh process, DRAM uses more power, but it can achieve greater storage densities and …

    must be refreshed periodically. Because of this refresh process, DRAM uses more power, but it can achieve greater storage densities and …

  17. [17]

    will always find it easiest to grasp the physical configuration. Much of the confusion surrounding memory geometry occurs when … LLM Answer: RAM, or Random Access Memory, plays a crucial role in a computer system… RAM is a volatile memory technology, meaning that its contents are lost when the power is turned off … Claim: RAM is a volatile memory technolo...

  18. [18]

    P(c is faithful to r) = AlignScore(c, r) = 0.05

  19. [19]

    In the uncalibrated version, the final score is underestimated due to the uncalibrated Parametric Knowledge score

    ParametricKnowledge(c) = = = 1.0 · 0.81 · … · 0.44 = 0.05 Token probabilities from parametric knowledge FRANQno calibration(c) = = 0.05 · 0.17 + 0.95 · 0.05 = 0.06 FRANQcondition-calibrated(c) = = 0.05 · f(0.17) + 0.95 · g(0.05) = 0.84 0.78 0.85 100% 81% 14% 98% 99% 44% (b)Unfaithful–True.FRANQaccurately detects the claim’s low faithfulness and assigns it...

  20. [20]

    Untreated, type 2 diabetes can be life-threatening

    With type 2 diabetes, your body either resists the effects of insulin — a hormone that regulates the movement of sugar into your cells — or doesn’t produce enough insulin to maintain a normal glucose level. Untreated, type 2 diabetes can be life-threatening

  21. [21]

    Glucose is a form of carbohydrate that comes from foods such as breads, cereals, pasta, rice, potatoes, fruits and some vegetables

    More specifically, diabetes is a condition that affects the body's ability to use glucose (a type of sugar) as fuel. Glucose is a form of carbohydrate that comes from foods such as breads, cereals, pasta, rice, potatoes, fruits and some vegetables

  22. [22]

    Now, the cause of type 2 diabetes is quite different from type 1. The cause of type 2 diabete is primarily a complicated medical condition called 'insulin resistance.' In fact, in the early stages of type 2 diabetes, there's plenty of insulin around, it just doesn't work well. LLM Answer: Determining which type of diabetes is worse is a complex task … Typ...

  23. [23]

    P(c is faithful to r) = AlignScore(c, r) = 0.04

  24. [24]

    MaxProb(c) = 7.0 · 10-19

  25. [25]

    ParametricKnowledge(c) = = = 0.005 · 1.0 · … · 0.96 = 3.8 · 10-15 0.5% Token probabilities from parametric knowledge FRANQno calibration(c) = = 0.04 · 7.0·10-19 + 0.96 · 3.8·10-15 = 3.6·10-15 FRANQcondition-calibrated(c) = = 0.04 · f(7.0·10-19) + 0.96 · g(3.8·10-15) = 0.14 0.24 0.14 100% 30% 37% 6% 2% 91% 100% 78% 0.1% 1% 99% 38% 41% 100% 100% 60% 57% 100...