Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Ramayya Krishnan; Rema Padman; Yubo Li

arxiv: 2605.29084 · v1 · pith:7BKWA45Knew · submitted 2026-05-27 · 💻 cs.CL · cs.AI· cs.IR

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Yubo Li , Rema Padman , Ramayya Krishnan This is my paper

Pith reviewed 2026-06-29 12:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords retrieval-augmented generationsource dependencemedical question answeringinter-source disagreementtransplant patient educationRAG evaluationmulti-source corpora

0 comments

The pith

Multi-source RAG systems produce different answers to the same medical question depending on the institutional source retrieved, so evaluation must shift from single-answer correctness to inter-source relationships.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RAG benchmarks assume one gold answer, yet institutional sources such as transplant handbooks can legitimately disagree on the same patient question. The paper treats source-dependence as a distinct failure mode that single-source evaluation cannot detect. It introduces TransplantQA, a set of real questions each paired with answers grounded in multiple candidate sources, together with HERO-QA retrieval and a structured judge that labels answer pairs on a five-category taxonomy. When retrieval quality improves, the measured rate of disagreement rises sharply, indicating that earlier studies understated how common source-dependence is. The same auditing approach is presented as applicable to any multi-author corpus.

Core claim

In transplant patient education, where institutional handbooks demonstrably disagree, grounding generation in different sources produces conflicting answers to identical questions. Auditing therefore requires measuring the relationship between those answers rather than checking against a single correct response; a structured-output judge assigns each pair one of five validated labels. At scale, stronger retrieval exposes substantially more disagreement than prior estimates had indicated, understating prevalence rather than intensity.

What carries the argument

A structured-output judge that assigns one of five validated labels to the relationship between answers generated from different sources.

If this is right

Evaluation of multi-source RAG must include inter-source relationship labels in addition to correctness checks.
The TransplantQA benchmark and HERO-QA strategy enable systematic auditing of source-dependence in medical QA.
The five-label taxonomy provides a reusable tool for quantifying disagreement between grounded answers.
The auditing framework transfers directly to legal and educational RAG systems that draw from multiple institutional sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG interfaces could surface multiple source-grounded answers and flag conflicts rather than synthesize a single response.
Persistent source-dependence may indicate underlying inconsistencies in the source documents themselves.
The taxonomy could be extended to measure whether disagreement correlates with downstream user confusion or harm.

Load-bearing premise

The structured-output judge reliably assigns the five validated labels to inter-source answer relationships without systematic bias or need for human verification on the TransplantQA data.

What would settle it

A human re-labeling of a random sample of TransplantQA answer pairs that shows low agreement with the judge, or an experiment in which improved retrieval does not increase the observed disagreement rate.

Figures

Figures reproduced from arXiv: 2605.29084 by Ramayya Krishnan, Rema Padman, Yubo Li.

**Figure 1.** Figure 1: TransplantQA construction. Patient questions are harvested from real online transplant communities and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: HERO-QA: a multi-layer retrieval system. A query is routed by handbook length: short handbooks [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Handbook × question-organ absence rate. Rows are the 102 handbooks (grouped and colour-coded by organ); columns are the six question-organ groups. Red = the handbook is silent on that organ’s questions. The block-diagonal structure reflects that organ-specific handbooks answer mainly their own-organ and general questions; rows that are pale across all columns (e.g., several Mayo Clinic, UChicago, Houston M… view at source ↗

read the original abstract

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a useful framing and artifacts for auditing source disagreements in multi-source medical RAG, but the judge reliability is the part that still needs evidence.

read the letter

The main thing to know is that this paper treats source-dependence as its own evaluation axis instead of assuming a single correct answer. They build TransplantQA from real patient questions each paired with answers drawn from multiple institutional handbooks, add HERO-QA as a hierarchical retrieval method that surfaces and audits those answers, and supply a structured judge that assigns one of five labels to the relationship between sources.

The artifacts and the domain-agnostic pitch are the parts that land. Releasing a benchmark grounded in actual conflicting handbooks, plus a retrieval strategy that makes the disagreements visible, gives people working on medical or legal RAG something concrete to test against. The observation that stronger retrieval surfaces more disagreement than single-source setups had suggested is a straightforward practical point.

The soft spot is the judge. The scale claim rests on the 5-label assignments being accurate enough to measure prevalence. The abstract calls the taxonomy validated, yet the details on human agreement, error rates, or prompt sensitivity on the actual TransplantQA instances are not shown in what is available. If the judge systematically flags disagreement because of phrasing differences or source formatting, the finding about understated prevalence becomes harder to trust. That assumption is load-bearing for the empirical contribution.

This is for researchers who build or evaluate RAG in domains where authoritative sources can legitimately differ. A reader focused on evaluation methods or safety-critical deployment would get value from the benchmark and taxonomy even if the numbers need more checking. It deserves a serious referee so the validation gaps can be addressed and the results can be stress-tested properly.

Referee Report

1 major / 2 minor

Summary. The manuscript argues that source-dependence is an overlooked failure mode in multi-source RAG systems, where the same question can yield different answers depending on the retrieved source. To address this, the authors release TransplantQA, a benchmark of patient questions grounded in multiple institutional handbooks; HERO-QA, a hierarchical retrieval strategy; and a structured-output judge using a 5-label taxonomy for inter-source relationships. They claim that at scale, better retrieval exposes substantially more disagreement than prior single-gold-answer evaluations suggested, understating prevalence rather than intensity. The approach is presented as domain-agnostic.

Significance. If the empirical findings hold, the work makes a valuable contribution by identifying a missing evaluation axis for RAG systems and providing concrete auditing tools. The emphasis on inter-source relationships over correctness to a single answer is a useful reframing for domains with inherent source disagreement, such as medicine. Releasing the benchmark, retrieval method, and judge supports reproducibility and extension to other fields like law and education.

major comments (1)

[Abstract and Methods (structured-output judge)] Abstract and Methods (structured-output judge): The claim that the 5-label taxonomy is 'validated' and that better retrieval reveals more disagreement rests on the judge's assignments, yet no human verification, inter-annotator agreement rates, or bias/error analysis is reported on TransplantQA instances. Without these, it is unclear whether the increase in detected disagreement is a property of the sources or an artifact of prompt sensitivity or label assignment on medical phrasing.

minor comments (2)

[Abstract] The abstract states the framework 'transfers to legal and educational RAG' but provides no concrete transfer experiment or adaptation details.
[Methods] Notation for the five labels in the taxonomy could be introduced with an explicit table or figure early in the paper to aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger validation of the structured-output judge. We address this concern directly below.

read point-by-point responses

Referee: [Abstract and Methods (structured-output judge)] Abstract and Methods (structured-output judge): The claim that the 5-label taxonomy is 'validated' and that better retrieval reveals more disagreement rests on the judge's assignments, yet no human verification, inter-annotator agreement rates, or bias/error analysis is reported on TransplantQA instances. Without these, it is unclear whether the increase in detected disagreement is a property of the sources or an artifact of prompt sensitivity or label assignment on medical phrasing.

Authors: We agree that the initial manuscript does not report human verification, inter-annotator agreement, or a dedicated bias/error analysis of the judge on TransplantQA instances. The 5-label taxonomy was constructed iteratively with input from transplant clinicians to reflect observable inter-source relationships (full agreement, partial agreement, contradiction on key facts, etc.), but this design process and any associated checks were not documented with quantitative human evaluation in the submitted version. In the revision we will add a new subsection under Methods that reports: (1) a human annotation study on a random sample of 200 TransplantQA instances (two annotators with medical background), (2) inter-annotator agreement (Cohen's kappa) both on the 5-label taxonomy and on a collapsed 3-label version, and (3) an error analysis that compares judge outputs against human labels, including cases of prompt sensitivity tested across two prompt variants. This addition will allow readers to assess whether the reported increase in disagreement is driven by source content rather than judge artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical auditing framework with no derivation chain or self-referential reductions

full rationale

The paper introduces TransplantQA, HERO-QA, and a structured-output judge as new empirical artifacts for auditing source-dependence in multi-source RAG. No equations, fitted parameters, predictions, or derivation steps appear in the abstract or described framework. Claims about disagreement prevalence rest on direct measurement from the released benchmark rather than any quantity defined in terms of itself or reduced by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are referenced in a way that would create circularity. This is a standard non-finding for an empirical auditing paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim depends on the domain fact that institutional medical sources contain measurable disagreements and that an automated judge can capture those relationships at scale.

axioms (2)

domain assumption Institutional medical handbooks demonstrably disagree on answers to the same patient questions
Stated directly in the abstract as the motivating premise for the TransplantQA benchmark.
domain assumption A 5-label taxonomy can be validated for scoring inter-source answer relationships
Abstract refers to a 'validated 5-label taxonomy' without further detail on validation process.

invented entities (3)

TransplantQA benchmark no independent evidence
purpose: Collection of real patient questions each grounded in multiple institutional handbooks
New artifact introduced to enable source-dependence auditing
HERO-QA hierarchical retrieval strategy no independent evidence
purpose: Grounds and audits each answer across sources
New retrieval method presented for the auditing task
Structured-output judge with 5-label taxonomy no independent evidence
purpose: Scores inter-source relationships automatically
New evaluation component introduced in the framework

pith-pipeline@v0.9.1-grok · 5726 in / 1423 out tokens · 22057 ms · 2026-06-29T12:19:38.548306+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references

[1]

Early access

The Faiss library.IEEE Transactions on Big Data. Early access. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets.Communications of the ACM, 64(12):86– 92. Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What...

2021
[2]

InThe Thirteenth International Conference on Learning Representations

JudgeLM: Fine-tuned large language models are scalable judges. InThe Thirteenth International Conference on Learning Representations. 10 A Question sources and inclusion criteria The 1,115 released questions were drawn from an initial pool of 3,000+ candidates collected from four families of public, patient-facing sources. Table 4 reports the top-10 sourc...

2021
[3]

If the evidence answers the question, give the answer using only that evidence
[4]

If pages are unknown, cite the section heading only

Cite the supporting section heading, and page if provided. If pages are unknown, cite the section heading only
[5]

NOT ADDRESSED: This handbook does not contain information on this topic

If the evidence does not answer the question, respond exactly: "NOT ADDRESSED: This handbook does not contain information on this topic."
[6]

classification

Do not use outside medical knowledge. Do not fill gaps with general transplant advice. User: ## Handbook Context {context} ## Patient Question {question} Generation runs with greedy decoding (temperature 0), max_new_tokens=512, and <think>...</think>reasoning blocks stripped before the answer is persisted. E Judge prompt and output schema Our judge uses t...

[1] [1]

Early access

The Faiss library.IEEE Transactions on Big Data. Early access. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets.Communications of the ACM, 64(12):86– 92. Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What...

2021

[2] [2]

InThe Thirteenth International Conference on Learning Representations

JudgeLM: Fine-tuned large language models are scalable judges. InThe Thirteenth International Conference on Learning Representations. 10 A Question sources and inclusion criteria The 1,115 released questions were drawn from an initial pool of 3,000+ candidates collected from four families of public, patient-facing sources. Table 4 reports the top-10 sourc...

2021

[3] [3]

If the evidence answers the question, give the answer using only that evidence

[4] [4]

If pages are unknown, cite the section heading only

Cite the supporting section heading, and page if provided. If pages are unknown, cite the section heading only

[5] [5]

NOT ADDRESSED: This handbook does not contain information on this topic

If the evidence does not answer the question, respond exactly: "NOT ADDRESSED: This handbook does not contain information on this topic."

[6] [6]

classification

Do not use outside medical knowledge. Do not fill gaps with general transplant advice. User: ## Handbook Context {context} ## Patient Question {question} Generation runs with greedy decoding (temperature 0), max_new_tokens=512, and <think>...</think>reasoning blocks stripped before the answer is persisted. E Judge prompt and output schema Our judge uses t...