{"paper":{"title":"When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reversing the order of conflicting biomedical documents causes large language models to flip their answers in 11 to 25 percent of cases.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Halil Kilicoglu, Mengfei Lan, Yikun Han","submitted_at":"2026-05-13T21:02:24Z","abstract_excerpt":"Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the controlled conflicting-evidence conditions created in HealthContradict are representative of the contradictory or incomplete evidence that real-world biomedical RAG systems encounter.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Conflicting biomedical evidence triggers order-dependent prediction flips in RAG LLMs, and a new abstention score combining confidence with conflict detection raises selective accuracy by 7-33 points in the hardest conditions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reversing the order of conflicting biomedical documents causes large language models to flip their answers in 11 to 25 percent of cases.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"df91e9b58ded7e5d919b630fc2e28ebfbe684602d3028d960e7b82ff7e1a4ad2"},"source":{"id":"2605.14115","kind":"arxiv","version":1},"verdict":{"id":"9cf2c7a3-a599-4caf-9b43-a65ad4019155","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:02:16.584402Z","strongest_claim":"In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip.","one_line_summary":"Conflicting biomedical evidence triggers order-dependent prediction flips in RAG LLMs, and a new abstention score combining confidence with conflict detection raises selective accuracy by 7-33 points in the hardest conditions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the controlled conflicting-evidence conditions created in HealthContradict are representative of the contradictory or incomplete evidence that real-world biomedical RAG systems encounter.","pith_extraction_headline":"Reversing the order of conflicting biomedical documents causes large language models to flip their answers in 11 to 25 percent of cases."},"references":{"count":24,"sample":[{"doi":"","year":null,"title":"Advances in neural information processing systems , volume=","work_id":"27180506-faab-44b1-9295-6bf5eb791eed","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in neural information processing systems , volume=","work_id":"9e679e73-1ab3-49cd-a880-6583acefe275","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Proceedings of the AAAI Conference on Artificial Intelligence , volume=","work_id":"801fb41c-c304-4f51-b1e2-93fd05ce9da7","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Transactions of the association for computational linguistics , volume=","work_id":"35c18954-8a0c-4935-b39c-5981194920ba","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Findings of the Association for Computational Linguistics: ACL 2024 , pages=","work_id":"8f78dd6e-2cb9-4fbd-8904-8bdce7e671ac","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":24,"snapshot_sha256":"d958cc98eeda90da503ba50e11706f8c6c5d8116f163e0603da3e56ba5715949","internal_anchors":4},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}