arxiv: 2605.14115 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

Yikun Han , Mengfei Lan , Halil Kilicoglu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval-augmented generationbiomedical question answeringevidence conflictorder effectsabstentionlarge language modelsHealthContradict

0 comments

The pith

Reversing the order of conflicting biomedical documents causes large language models to flip their answers in 11 to 25 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that retrieval-augmented LLMs in biomedicine are sensitive to the sequence of contradictory evidence documents even when the underlying facts stay the same. Across six models and controlled conditions that include both correct and incorrect documents, simply swapping their order lowers accuracy and changes between 11.4 and 25.2 percent of final predictions. The work also shows that adding an explicit detector of evidence conflict to the model's own improves selective accuracy when the system is allowed to abstain on uncertain cases.

Core claim

Using the HealthContradict benchmark, the authors evaluate models under no-context, correct-only, incorrect-only, and two mixed conditions that contain the same pair of contradictory documents presented in opposite orders. In the order-contrast setting, accuracy falls for every model tested and 11.4 to 25.2 percent of predictions change solely because of document ordering. A conflict-aware abstention score that combines model confidence with an evidence-conflict detector raises selective accuracy over confidence-only baselines, with gains of 7.2 to 33.4 points in the incorrect-only condition and 3.6 to 14.4 points in the incorrect-first mixed condition at 75, 50, and 25 percent coverage.

What carries the argument

The conflicting-evidence order contrast, which holds the two documents fixed and reverses only their presentation order to isolate sequence-induced prediction changes.

If this is right

Accuracy drops for every tested model when the same conflicting documents appear in reversed order.
Between 11.4 and 25.2 percent of model outputs change due to document ordering alone.
A conflict-aware abstention score raises selective accuracy by 3.6 to 33.4 points over confidence-only methods in the hardest conditions.
Conflicting evidence creates both an uncertainty problem and a robustness problem for biomedical retrieval-augmented systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Biomedical RAG pipelines may benefit from retrieval strategies that surface conflicting evidence early rather than ranking by relevance alone.
Order-invariant training or post-processing could reduce the observed prediction flips without requiring new data.
Selective answering based on detected conflict could lower error rates in clinical decision-support tools where wrong answers carry high cost.
Benchmarks that only measure accuracy under helpful context will underestimate the reliability gap exposed by natural evidence disagreement.

Load-bearing premise

That the controlled conflicting-evidence conditions created in HealthContradict are representative of the contradictory or incomplete evidence that real-world biomedical RAG systems encounter.

What would settle it

Run the same six models on a fresh collection of real PubMed queries that naturally contain contradictory passages, present the passages in both orders, and measure whether prediction-flip rates remain in the 11 to 25 percent range.

Figures

Figures reproduced from arXiv: 2605.14115 by Halil Kilicoglu, Mengfei Lan, Yikun Han.

**Figure 1.** Figure 1: HEALTHCONTRADICT example used to motivate the benchmark structure in the introduction. The displayed mixed-evidence prompt corresponds to the ‘CIC’ condition, where the question is paired with the correct retrieved document first and the contradictory retrieved document second. contain the same two documents but in opposite orders, namely correct-first conflicting (‘CIC’, correct document followed by inc… view at source ↗

**Figure 2.** Figure 2: Overview of the experimental framework. (A) Input example structure: each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Held-out selective-accuracy gain of Conf over the no-abstention baseline in the two hardest evidence [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Held-out selective-accuracy lift of CAS over Conf in the two hardest evidence conditions. Cells report [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mean held-out selective-accuracy lift of CAS over Conf under the train-threshold transfer protocol as the [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Mean held-out selective-accuracy lift of CAS over Conf under the train-threshold transfer protocol as the [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2--33.4 points in incorrect-only (`IC') and 3.6--14.4 points in incorrect-first conflicting (`ICC') conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reversing order of the same correct and incorrect documents drops accuracy and flips 11-25% of answers across models, with some abstention gains, but HealthContradict looks too artificial to trust for real retrieval conflicts.

read the letter

The key point is that when the same correct and incorrect documents are fed to biomedical QA models in opposite orders, every model loses accuracy and between 11 and 25 percent of its answers flip. The conflict-aware abstention score then lifts selective accuracy in the hardest settings by 3.6 to 33.4 points depending on the condition and coverage level they test. That pattern is consistent enough to notice. The paper does a clean job of holding the documents fixed and only changing their order. That isolates the effect nicely. Running the same contrast across six open-weight models and reporting the flip rates gives a solid picture of how widespread the issue is. The abstention results at 75, 50, and 25 percent coverage also show the practical upside without overclaiming. The soft spot is the HealthContradict dataset. The abstract gives no numbers on how many pairs were created, how conflicts were identified, or whether the incorrect documents reflect typical retrieval mistakes like outdated info or partial matches. If the contradictions were engineered to be stark, the order sensitivity could be stronger than what happens with the noisier evidence real systems retrieve. That matches the stress-test worry about artificial pairs. This paper is for people building or evaluating RAG systems in biomedicine who care about reliability under conflict. It is not a broad theoretical advance, but the empirical measurement is useful for that audience. I would send it to peer review. The central pattern is worth checking, and the abstention idea is straightforward to implement and test. Reviewers can push for the missing dataset details and any statistical significance tests.

Referee Report

3 major / 1 minor

Summary. The paper claims that retrieval-augmented LLMs for biomedical QA exhibit order sensitivity under conflicting evidence: using the HealthContradict dataset, six models show accuracy drops and 11.4–25.2% prediction flips when the same correct and incorrect documents are presented in reversed order. It further claims that a conflict-aware abstention score (model confidence plus evidence-conflict detector) yields selective-accuracy gains of 7.2–33.4 points (incorrect-only) and 3.6–14.4 points (incorrect-first) over confidence-only baselines at 75/50/25% coverage levels.

Significance. If the empirical results hold, the work is significant because it isolates order effects as a distinct robustness failure mode in biomedical RAG and supplies a concrete, observable abstention mechanism that improves reliability under conflict. The multi-model, multi-condition design and the explicit comparison of conflicting-order conditions provide a reusable evaluation template for high-stakes evidence-based QA.

major comments (3)

[Dataset section] Dataset section: the manuscript supplies no construction details for HealthContradict (pair selection criteria, source of incorrect documents, length/specificity balancing, or validation against real retrieval failures such as partial facts or outdated citations). Because the central claim rests on these pairs being representative controlled contradictions, the observed order drops and flip rates could be artifacts of the synthetic setup rather than a general property of conflicting biomedical evidence.
[Results section] Results section: the abstract reports accuracy drops and 11.4–25.2% flip rates across models but contains no statistical tests, confidence intervals, or per-model variance for the flip percentages. Without these, it is impossible to determine whether the order effect is reliably nonzero or whether the range is driven by a subset of models or questions.
[Abstention method] Abstention method: the exact conflict detector (features, threshold, or model used) underlying the conflict-aware abstention score is not specified. This detail is load-bearing for the reported selective-accuracy gains (7.2–33.4 points in IC, 3.6–14.4 points in ICC), as the gains cannot be reproduced or compared to alternative conflict signals without it.

minor comments (1)

[Abstract] The abstract would be clearer if it stated the number of questions and documents in HealthContradict to contextualize the flip-rate percentages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the points raised identify areas where the manuscript can be strengthened through added details and statistical reporting. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses

Referee: [Dataset section] Dataset section: the manuscript supplies no construction details for HealthContradict (pair selection criteria, source of incorrect documents, length/specificity balancing, or validation against real retrieval failures such as partial facts or outdated citations). Because the central claim rests on these pairs being representative controlled contradictions, the observed order drops and flip rates could be artifacts of the synthetic setup rather than a general property of conflicting biomedical evidence.

Authors: We agree that expanded construction details for HealthContradict are required to substantiate the representativeness of the conflicting pairs. In the revised manuscript we will add a dedicated subsection describing the pair selection criteria (opposing claims drawn from PubMed abstracts on the same topic), sources of incorrect documents (simulating common retrieval errors such as outdated citations and partial fact matches), length and specificity balancing procedures, and validation steps including manual expert review and comparison against observed RAG failure modes in biomedical QA. Dataset statistics and representative examples will also be included. revision: yes
Referee: [Results section] Results section: the abstract reports accuracy drops and 11.4–25.2% flip rates across models but contains no statistical tests, confidence intervals, or per-model variance for the flip percentages. Without these, it is impossible to determine whether the order effect is reliably nonzero or whether the range is driven by a subset of models or questions.

Authors: We acknowledge the absence of statistical support for the reported flip rates and accuracy drops. In the revised Results section we will report per-model flip percentages with 95% bootstrap confidence intervals, per-question variance, and formal significance tests (McNemar’s test for paired prediction flips and paired t-tests for accuracy differences across order conditions). These additions will demonstrate that the order effect is consistently nonzero across models rather than driven by outliers. revision: yes
Referee: [Abstention method] Abstention method: the exact conflict detector (features, threshold, or model used) underlying the conflict-aware abstention score is not specified. This detail is load-bearing for the reported selective-accuracy gains (7.2–33.4 points in IC, 3.6–14.4 points in ICC), as the gains cannot be reproduced or compared to alternative conflict signals without it.

Authors: We agree that the conflict detector implementation must be fully specified for reproducibility. In the revised manuscript we will describe the detector in detail, including the underlying model, input features (e.g., pairwise entailment and contradiction scores between evidence passages), threshold selection procedure (via cross-validation on held-out data), and the precise combination rule with model confidence (weighted sum). This will enable exact replication of the abstention scores and selective-accuracy results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on constructed benchmark is self-contained

full rationale

The paper reports direct empirical measurements of accuracy drops (11.4–25.2% prediction flips) when the same correct/incorrect document pair is presented in reversed order within the HealthContradict conditions. These outcomes are computed from observable model generations on fixed inputs rather than from any fitted parameter renamed as a prediction. The conflict-aware abstention score is defined as a combination of model confidence and an independent evidence-conflict detector; neither component is derived from the target accuracy metric. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central claims to the paper’s own inputs by construction. Self-citations, if present, are not load-bearing for the reported order-effect results. The evaluation therefore remains externally falsifiable against the benchmark and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard assumptions about LLM behavior under input variation and the representativeness of the constructed test conditions; no free parameters are fitted to produce the reported accuracy or abstention numbers.

axioms (2)

domain assumption LLM outputs are sensitive to the order of contradictory retrieved documents
Invoked to explain the observed accuracy drops and flip rates in the order-contrast conditions
domain assumption A detector of evidence conflict can be combined with model confidence to improve abstention decisions
Basis for the proposed conflict-aware abstention score

invented entities (1)

conflict-aware abstention score no independent evidence
purpose: Combine model confidence with an evidence-conflict detector to decide when to abstain
Newly defined scoring method evaluated in the paper

pith-pipeline@v0.9.0 · 5530 in / 1342 out tokens · 43990 ms · 2026-05-15T05:02:16.584402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page
[2]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page
[3]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Benchmarking large language models in retrieval-augmented generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[4]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

work page
[5]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Benchmarking retrieval-augmented generation for medicine , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[6]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[7]

Proceedings of the AAAI conference on artificial intelligence , volume=

Obtaining well calibrated probabilities using bayesian binning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[8]

Advances in neural information processing systems , volume=

Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[9]

International conference on machine learning , pages=

Selectivenet: A deep neural network with an integrated reject option , author=. International conference on machine learning , pages=. 2019 , organization=

work page 2019
[10]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Transactions on Machine Learning Research , issn=

Teaching Models to Express Their Uncertainty in Words , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

work page 2022
[12]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Workshop on Large Language Models and Generative AI for Health at AAAI 2025 , year=

Llama-3-meditron: An open-weight suite of medical llms based on llama-3.1 , author=. Workshop on Large Language Models and Generative AI for Health at AAAI 2025 , year=

work page 2025
[14]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2026 , month=

Qwen3.5-9B , author=. 2026 , month=

work page 2026
[17]

npj Digital Medicine , year=

HealthContradict: Evaluating biomedical knowledge conflicts in language models , author=. npj Digital Medicine , year=

work page
[18]

Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

Entity-based knowledge conflicts in question answering , author=. Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

work page 2021
[19]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

work page 2023
[20]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Selective question answering under domain shift , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

work page
[21]

International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. International Conference on Learning Representations , year=

work page
[22]

Pharmaceutical Statistics , volume=

Heterogeneity in treatment effects across diverse populations , author=. Pharmaceutical Statistics , volume=. 2021 , publisher=

work page 2021
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Mapping from meaning: Addressing the miscalibration of prompt-sensitive language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[24]

Health Expectations , volume=

Conflicting health information: a critical research need , author=. Health Expectations , volume=. 2016 , publisher=

work page 2016