Auditing medical multi-agent AI reveals risks of false consensus

Dehao Sui; Ewen Harrison; Haoran Sang; Junyi Gao; Lan Mi; Lei Gu; Lequan Yu; Liang Yao; Tianfan Fu; Wen Tang

arxiv: 2510.10185 · v2 · pith:ZSBJLE7Jnew · submitted 2025-10-11 · 💻 cs.CL · cs.AI· cs.MA

Auditing medical multi-agent AI reveals risks of false consensus

Yinghao Zhu , Lei Gu , Zixiang Wang , Haoran Sang , Dehao Sui , Wen Tang , Lan Mi , Yasha Wang

show 5 more authors

Junyi Gao Liang Yao Tianfan Fu Ewen Harrison Lequan Yu

This is my paper

classification 💻 cs.CL cs.AIcs.MA

keywords medicalmulti-agentsystemsacrosscasescollaborativeconsensusevidence

0 comments

read the original abstract

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi-agent systems. From 3,600 execution logs, we derive an expert-validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision-making. We then deploy an expert-validated automated auditor as non-interventional probes across 14,400 cases, covering six multi-agent architectures, six medical text and vision datasets, and four large language model settings per modality. Across systems, collaboration yields uneven accuracy gains and frequent process failures. Unsupported observations affect 16.63% of cases and propagate downstream. In discussion, agents repeat initial views in 98.42% of cases rather than re-examining evidence, and fail to activate specialist reasoning in 42.73%. During synthesis, final answers often substitute authority or majority count for evidence checking, showing authority bias in 28.76% (rising from 35.30% to 68.75% across rounds), self-contradiction in 18.53%, contradiction neglect in 5.48% and minority suppression in 5.11%. MedAgentAudit reframes medical AI evaluation from output scoring to process-level safety and accountability, providing a practical foundation for transparent, auditable and clinician-supervised agentic systems in medicine.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

Introduces EPC-AW to mitigate epistemic miscalibration in LLM multi-agent planning via consistency-based selection and refinement, reporting 9.75% average success improvement.
Governed Reasoning for Institutional AI
cs.AI 2026-04 unverdicted novelty 5.0

Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior auth...