Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Michael R. DeMarco

arxiv: 2605.31506 · v2 · pith:NTYILTUTnew · submitted 2026-05-29 · 💻 cs.IR · cs.CL

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Michael R. DeMarco This is my paper

Pith reviewed 2026-06-28 20:37 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords factual densityRAG retrievalmedical AI accuracyHealthFC benchmarkfactuality analysisevidence saturationretrieval reranking

0 comments

The pith

Factual density reranking surfaces all relevant medical evidence in top-5 results where standard similarity search fails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Factual Density (FD*) as a retrieval signal that scores documents by the proportion of verified atomic claims they contain relative to their token count. Standard RAG methods rank by keyword match or topic similarity and therefore bury high-fact content under lexically dominant text on the same topic. After correcting a length bias through Z-score normalization within bins, FD* reranking on the HealthFC benchmark of expert-labeled health claims reaches 100 percent systematic review saturation in the top five results. It alone surfaces Cochrane evidence that cosine similarity ranks outside the top ten, with ground-truth checks confirming 25 mappings across seven supported claims. This positions factual density as a low-cost addition to health RAG pipelines.

Core claim

Factual Density (FD*) measures the proportion of verified atomic claims relative to total token count after probabilistic factuality analysis and Z-score normalization within length bins. On the HealthFC benchmark, FD*-optimized retrieval was the only condition to achieve 100 percent systematic review saturation in top-5 results, surfacing Cochrane evidence ranked outside the top ten by cosine similarity, with ground truth verification confirming 25 mappings across seven supported claims.

What carries the argument

Factual Density (FD*), the proportion of verified atomic claims to total token count, computed via probabilistic factuality analysis before corpus ingestion and made length-independent by Z-score normalization within length bins.

If this is right

FD* reranking surfaces Cochrane evidence that standard cosine similarity ranks beyond the top ten.
Z-score normalization within length bins removes the severe document-length confound (Pearson R = -0.8636).
Ground truth verification confirms 25 mappings across seven HealthFC-supported claims under the FD* condition.
Factual density reranking offers a low-cost intervention for factual precision in health RAG architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-scoring pipeline could be applied to other domains that maintain expert-verified claim sets, such as legal precedents or scientific abstracts.
Hybrid ranking that combines FD* with existing similarity scores may improve overall recall at negligible extra cost.
Extending the evaluation to the full n=50 query set would test whether the observed saturation advantage persists beyond the reported cases.

Load-bearing premise

The probabilistic factuality analysis produces accurate, unbiased labels for atomic claims that remain independent of the retrieval ranking task.

What would settle it

Running the full evaluation on the complete set of 50 aligned queries and checking whether FD* still achieves 100 percent saturation while cosine similarity continues to miss the same Cochrane items.

read the original abstract

Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FD* depends on unvalidated probabilistic factuality labels, which undercuts the HealthFC saturation claim.

read the letter

The paper spots a real issue in medical RAG: standard cosine retrieval can bury short, fact-dense documents. They define FD* as verified atomic claims divided by token count, apply Z-score normalization inside length bins to remove the strong length correlation (R = -0.8636 before, p = 0.0749 after), and report that only the FD* condition reaches 100% systematic review saturation in top-5 on HealthFC, pulling in Cochrane evidence that cosine ranked lower.

The concrete numbers and the ground-truth mapping check (25 mappings across seven claims) are the parts that hold up. The length fix is a straightforward statistical correction that works on the data they show.

The soft spot is bigger than minor. The entire signal rests on the Ghost Audit probabilistic factuality step, yet the abstract supplies no inter-annotator agreement, no expert validation set, and no comparison to human medical labels. Because that same step is used both to filter the corpus and to compute the density scores, any bias or noise there propagates directly into the reported gain. The evaluation is also labeled preliminary with full n=50 validation left for future work, and no error bars or run-to-run details appear.

This is for people already tuning RAG pipelines in health or other high-stakes domains who want to test a density reranker. A reader could pull the length-normalization idea and try it, but the lack of label validation makes the superiority claim hard to rely on. The work shows clear thinking about the retrieval objective but needs the factuality component grounded before it changes practice.

Send it to peer review only if the authors add an external validation experiment for the factuality labels; otherwise the foundation stays too thin.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard RAG retrieval suffers from an 'Expert Blindness Effect' by favoring lexically similar but low-fact-density text, and introduces Factual Density (FD*)—defined as the proportion of verified atomic claims (from probabilistic factuality analysis in the Ghost Audit pipeline) to total tokens—as a retrieval signal. After observing a strong negative length correlation (Pearson R = -0.8636) in an initial formulation, the authors apply Z-score normalization within length bins to produce a length-independent signal (post-fix p = 0.0749). On the HealthFC benchmark, FD*-optimized retrieval is reported as the only method achieving 100% systematic review saturation in top-5 results, surfacing Cochrane evidence missed by cosine similarity, with ground-truth verification of 25 mappings across seven claims; full statistical validation across n=50 queries is noted as future work.

Significance. If the central claims hold after proper validation, FD* could provide a lightweight, domain-agnostic reranking signal that improves factual precision in medical RAG without requiring changes to the underlying retriever. The reported 100% saturation outcome and the explicit contrast with cosine similarity on a concrete benchmark constitute a falsifiable prediction that, if replicated, would be of practical interest to health-AI systems. However, the current evidence base is preliminary and the significance is constrained by the absence of independent validation for the factuality labels on which FD* depends.

major comments (3)

[Abstract] Abstract and Ghost Audit pipeline description: FD* is defined using probabilistic factuality labels that are applied both to filter the corpus before ingestion and to compute the density scores yielding the 100% saturation result. No inter-annotator agreement, expert validation set, calibration details, or comparison against human medical labels is reported for this analysis. Because the performance gain and the length-normalization fix rest on these labels, the absence of external validation is load-bearing for the central claim.
[Abstract] Abstract: The length confound (Pearson R = -0.8636, p = 2.27e-07) was identified on the same data used to motivate and evaluate the Z-score normalization fix (post-fix p = 0.0749). This raises the possibility that the normalization boundaries and the reported independence are post-hoc adjustments rather than an a-priori, held-out test of the FD* signal.
[Abstract] Abstract: The 100% top-5 saturation claim and the statement that 'FD*-optimized retrieval was the only condition' to achieve it are presented without error bars, multiple-run statistics, or the full n=50 query results (explicitly deferred to future work). The ground-truth verification of 25 mappings is mentioned but not broken down by query or retrieval condition, making it impossible to assess robustness of the superiority claim.

minor comments (2)

[Abstract] The abstract states that full statistical validation 'remains future work due to constraints on corpus-benchmark alignment'; a brief description of those alignment constraints would help readers understand the scope of the current results.
No table or supplementary material is referenced that lists the 25 verified mappings or the seven HealthFC claims, which would allow independent inspection of the ground-truth verification step.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive feedback highlighting the preliminary nature of our study. We address each major comment point-by-point below with honest clarifications on what the current manuscript can and cannot support.

read point-by-point responses

Referee: [Abstract] Abstract and Ghost Audit pipeline description: FD* is defined using probabilistic factuality labels that are applied both to filter the corpus before ingestion and to compute the density scores yielding the 100% saturation result. No inter-annotator agreement, expert validation set, calibration details, or comparison against human medical labels is reported for this analysis. Because the performance gain and the length-normalization fix rest on these labels, the absence of external validation is load-bearing for the central claim.

Authors: We agree that the manuscript does not report inter-annotator agreement, expert validation sets, or direct comparisons of the Ghost Audit probabilistic labels against human medical annotations. The pipeline is presented as an automated, lightweight preprocessing tool rather than a human-validated factuality oracle. This reliance is a genuine limitation for the central claims. We will revise the manuscript to add an explicit Limitations section discussing the probabilistic nature of the labels and the absence of external validation. revision: yes
Referee: [Abstract] Abstract: The length confound (Pearson R = -0.8636, p = 2.27e-07) was identified on the same data used to motivate and evaluate the Z-score normalization fix (post-fix p = 0.0749). This raises the possibility that the normalization boundaries and the reported independence are post-hoc adjustments rather than an a-priori, held-out test of the FD* signal.

Authors: The length correlation was observed during exploratory analysis on the HealthFC corpus, prompting the development of the Z-score normalization within length bins as a methodological correction. The post-fix p-value reflects the outcome of that correction applied uniformly to the same benchmark. While we acknowledge the data overlap, the normalization procedure is deterministic and was not tuned to achieve a specific result on held-out data. We will add a sentence clarifying the exploratory origin of the fix but maintain that it produces a length-independent signal as reported. revision: partial
Referee: [Abstract] Abstract: The 100% top-5 saturation claim and the statement that 'FD*-optimized retrieval was the only condition' to achieve it are presented without error bars, multiple-run statistics, or the full n=50 query results (explicitly deferred to future work). The ground-truth verification of 25 mappings is mentioned but not broken down by query or retrieval condition, making it impossible to assess robustness of the superiority claim.

Authors: We agree the 100% saturation result is presented without error bars, multiple runs, or the full n=50 statistics, and the ground-truth verification of 25 mappings lacks per-query breakdown. The manuscript already states that full statistical validation across n=50 queries is future work due to corpus-benchmark alignment constraints. We will revise the abstract and results to emphasize the preliminary character of the 100% figure, remove any implication of definitive superiority, and note the limited scope of the 25-mapping verification. revision: yes

standing simulated objections not resolved

Independent expert validation or inter-annotator agreement for the Ghost Audit probabilistic factuality labels
Full n=50 query statistical results with error bars and per-condition breakdowns, as these are deferred to future work

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines FD* using probabilistic factuality analysis from the Ghost Audit pipeline as a preprocessing step, observes a length confound on initial formulation, applies Z-score normalization within bins, and evaluates the resulting retrieval on the external HealthFC benchmark with expert labels and ground-truth mappings. No equations or steps are shown that reduce the reported 100% saturation result or performance claims to the inputs by construction. The benchmark evaluation provides independent verification separate from the internal scoring pipeline, making the central claim self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim depends on an unvalidated probabilistic factuality analyzer and on the assumption that Z-score normalization within length bins removes all length-related bias without introducing new selection effects. No external benchmarks or machine-checked components are referenced.

free parameters (1)

length bin boundaries for Z-score normalization
Chosen after observing the Pearson correlation on the evaluation corpus; the abstract does not state how bin edges were selected.

axioms (2)

domain assumption Probabilistic factuality analysis produces accurate counts of verified atomic claims
Invoked to compute FD* before corpus ingestion; no validation details supplied.
domain assumption HealthFC labels constitute reliable ground truth for medical claim support
Used to measure saturation and to confirm 25 mappings.

invented entities (2)

Factual Density (FD*) no independent evidence
purpose: Length-independent retrieval signal based on verified-claim proportion
Newly defined in the paper; no independent evidence outside this work is cited.
Expert Blindness Effect no independent evidence
purpose: Label for the tendency of lexical retrieval to ignore high-density factual content
Term introduced to motivate the work; no prior literature reference given.

pith-pipeline@v0.9.1-grok · 5835 in / 1779 out tokens · 30376 ms · 2026-06-28T20:37:36.766951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Introduction Health misinformation is a documented public health risk with measurable long-term consequences for individuals and health systems (Tabatabaei Far & Ahmadi Marzaleh, 2025). As large language models become embedded in consumer-facing health applications, the reliability of the information they surface has moved from an academic concern to a cl...

2025
[2]

(2021) established RAG as the standard for grounding LLM outputs in external knowledge, proving it outperforms purely parametric models on knowledge-heavy tasks

Related Work Evaluating Factual Density in Multi-Source RAG NexusAgentics Research arXiv preprint Page 3 - 2.1 Retrieval-Augmented Generation Lewis et al. (2021) established RAG as the standard for grounding LLM outputs in external knowledge, proving it outperforms purely parametric models on knowledge-heavy tasks. Gao et al. (2023) subsequently mapped th...

2021
[3]

Because it maps real-world claims to objective truth labels, it is the appropriate benchmark for testing health-domain RAG precision

provides 750 health claims annotated for veracity by medical experts across three labels: Supported, Refuted, and No Evidence. Because it maps real-world claims to objective truth labels, it is the appropriate benchmark for testing health-domain RAG precision. HealthFC labels are withheld from the ingestion and retrieval pipeline entirely in this work, pr...

2024
[4]

A 2021 RCT found 45% efficacy in Phase 3 trials

Methodology 3.1 Corpus Construction A 600-chunk evidence hierarchy corpus was constructed from three source tiers, each representing a distinct level of medical evidence authority. All abstracts were retrieved via the NCBI Entrez API using the Biopython library, ensuring full reproducibility: any researcher with an NCBI email can execute the identical que...

2021
[5]

Conclusion This paper introduced Factual Density (FD*), a novel retrieval optimization signal for health RAG systems that measures the concentration of probabilistically verified atomic claims per token. Three experiments were conducted to validate the metric, characterize a previously undocumented retrieval failure mode, and establish a methodology for c...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.13845 2023
[6]

Dalal, Jennifer L

(pp. 8095-8107). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.709 Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J. L., Moor, M., Alexander, K., Ashley, E., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., Nelson, J., & Hiesinger, W. (2024). Almanac: Retrieval-augmented language models for clinical medicine. NEJM AI, 1(2). https://doi.org/1...

work page doi:10.1056/aioa2300068 2024

[1] [1]

Introduction Health misinformation is a documented public health risk with measurable long-term consequences for individuals and health systems (Tabatabaei Far & Ahmadi Marzaleh, 2025). As large language models become embedded in consumer-facing health applications, the reliability of the information they surface has moved from an academic concern to a cl...

2025

[2] [2]

(2021) established RAG as the standard for grounding LLM outputs in external knowledge, proving it outperforms purely parametric models on knowledge-heavy tasks

Related Work Evaluating Factual Density in Multi-Source RAG NexusAgentics Research arXiv preprint Page 3 - 2.1 Retrieval-Augmented Generation Lewis et al. (2021) established RAG as the standard for grounding LLM outputs in external knowledge, proving it outperforms purely parametric models on knowledge-heavy tasks. Gao et al. (2023) subsequently mapped th...

2021

[3] [3]

Because it maps real-world claims to objective truth labels, it is the appropriate benchmark for testing health-domain RAG precision

provides 750 health claims annotated for veracity by medical experts across three labels: Supported, Refuted, and No Evidence. Because it maps real-world claims to objective truth labels, it is the appropriate benchmark for testing health-domain RAG precision. HealthFC labels are withheld from the ingestion and retrieval pipeline entirely in this work, pr...

2024

[4] [4]

A 2021 RCT found 45% efficacy in Phase 3 trials

Methodology 3.1 Corpus Construction A 600-chunk evidence hierarchy corpus was constructed from three source tiers, each representing a distinct level of medical evidence authority. All abstracts were retrieved via the NCBI Entrez API using the Biopython library, ensuring full reproducibility: any researcher with an NCBI email can execute the identical que...

2021

[5] [5]

Conclusion This paper introduced Factual Density (FD*), a novel retrieval optimization signal for health RAG systems that measures the concentration of probabilistically verified atomic claims per token. Three experiments were conducted to validate the metric, characterize a previously undocumented retrieval failure mode, and establish a methodology for c...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.13845 2023

[6] [6]

Dalal, Jennifer L

(pp. 8095-8107). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.709 Zakka, C., Chaurasia, A., Shad, R., Dalal, A. R., Kim, J. L., Moor, M., Alexander, K., Ashley, E., Boyd, J., Boyd, K., Hirsch, K., Langlotz, C., Nelson, J., & Hiesinger, W. (2024). Almanac: Retrieval-augmented language models for clinical medicine. NEJM AI, 1(2). https://doi.org/1...

work page doi:10.1056/aioa2300068 2024