pith. machine review for the scientific record. sign in

arxiv: 2603.27820 · v2 · submitted 2026-03-29 · 💻 cs.CL

Recognition: no theorem link

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords clinical diagnosiscounterfactual reasoningmulti-agent systemslarge language modelsdiagnostic accuracyevidence verificationdifferential diagnosisinterpretability
0
0 comments X

The pith

Counterfactual edits to clinical findings let multi-agent LLMs test evidence support and raise diagnostic accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework where LLM agents perform counterfactual edits to clinical findings, such as removing or altering symptoms, to measure their impact on competing diagnoses. This is quantified through the Counterfactual Probability Gap, which tracks shifts in model confidence. These signals then inform multi-round discussions among specialist agents to challenge and refine diagnoses. The approach yields consistent accuracy gains over standard prompting and other multi-agent methods on three benchmarks with seven models, particularly in ambiguous cases. Human assessments confirm the outputs are more reliable and coherent.

Core claim

The authors establish that explicitly introducing counterfactual case editing to modify clinical findings and computing the resulting probability gaps allows multi-agent systems to verify evidence support for diagnoses, leading to improved accuracy and interpretability in LLM-based clinical reasoning.

What carries the argument

The Counterfactual Probability Gap, a metric that calculates how much a diagnosis's confidence changes when specific clinical findings are edited or removed, thereby identifying which evidence supports or refutes each hypothesis.

If this is right

  • Accuracy improves across three diagnostic benchmarks and seven LLMs compared to prompting and prior multi-agent baselines.
  • Gains are largest in complex and ambiguous diagnostic cases.
  • Reasoning trajectories become more interpretable through explicit evidence testing.
  • Human evaluations rate the outputs as more clinically useful, reliable, and coherent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be adapted to non-medical domains requiring hypothesis testing against alternatives, such as legal or scientific reasoning.
  • Integrating this with real-time patient data might further validate the counterfactual assumptions in practice.
  • It suggests a general strategy for reducing hallucination in LLM reasoning by enforcing evidence-grounded challenges.

Load-bearing premise

That counterfactual edits to clinical findings and the resulting probability gaps validly capture how individual pieces of evidence support or refute diagnoses in real clinical reasoning.

What would settle it

Apply the framework to clinical cases where counterfactual edits are constructed to contradict established medical knowledge and observe whether accuracy gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2603.27820 by Aniket Vashishtha, Gabriel Erion-Barner, Hao Peng, Hongyuan Mei, Simo Du, Xi Chen, Yue Guo, Zhiwen You.

Figure 1
Figure 1. Figure 1: Overview of the proposed counterfactual case editing-based multi-agent diagnostic frame [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average diagnostic accuracy of seven LLMs on three datasets, including [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average diagnostic accuracy of Llama-3.1-8B-Instruct for four diseases/specialties on three datasets, including (a) Disease-level accuracy on MIMIC, (b) Specialty-level accuracy on MedCaseReasoning, (c) Specialty-level accuracy on ER-Reason. Following Liu et al.[9], we categorize the test cases into specialties, and select the Top-4 most frequent specialties with relevant diagnoses from MedCaseReasoning an… view at source ↗
Figure 4
Figure 4. Figure 4: Multi-round discussion statistics and the impact of counterfactual case editing on diagnostic [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study of Llama-3.1-8B-Instruct over different functional modules on Med￾CaseReasoning. The shaded area represents the 95% CI. (a) Diagnostic performance with various functional moduels added in our multi-agent diagnostic system. w/o: without; CF: counterfactual. (b) Diagnostic performance with various hyperparameters. DDx: differential diagnosis. lower error rates and higher completeness compared … view at source ↗
Figure 6
Figure 6. Figure 6: Human evaluation of clinical reasoning quality comparing our method with zero-shot CoT [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of diagnostic rationales generated by our multi-agent diagnostic framework using [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of our counterfactual case editing approach. CF: counterfactual. DDx: differential [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of Deepseek-R1 for the diagnosis of case presentation on three datasets. (a), Consensus rate achieved by the multi-round discussion process across datasets. (b), Average number of the discussion rounds required per case. (c), Specialist diagnosis-change rate across the three datasets. Error bars indicate the standard deviation across three random seeds. Bar graphs indicate the standard deviatio… view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of the Top-10 most frequently assigned specialists by the triage agent for [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The user interface of annotation guideline and the case presentation. Each case is provided [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The user interface of side-by-side reasoning trace comparison. Users can highlight [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The user interface of error, safety, and completeness assessment. Criteria are evaluated [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The user interface of reasoning quality evaluation. Each category is evaluated on a 5-point [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The user interface of clinical contribution evaluation. Each category is evaluated on a [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The user interface of annotation on reasoning traces trust assessment bias classification. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
read the original abstract

Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a counterfactual multi-agent diagnostic framework for LLMs in clinical settings. It introduces counterfactual editing of clinical findings and defines a Counterfactual Probability Gap to quantify how individual findings support or refute competing diagnoses. These signals drive multi-round specialist agent discussions to challenge hypotheses and refine outputs. Experiments across three diagnostic benchmarks and seven LLMs report consistent accuracy gains over standard prompting and prior multi-agent baselines, with larger improvements in complex/ambiguous cases, plus positive human evaluations of reasoning quality.

Significance. If the gains are robust and attributable to the counterfactual mechanism rather than prompting depth, the work could meaningfully advance interpretable LLM-based clinical decision support by explicitly incorporating hypothesis-testing practices from medical training.

major comments (2)
  1. [Method (Counterfactual Probability Gap and multi-agent discussion)] The central claim that the Counterfactual Probability Gap validly quantifies evidential strength (and thereby improves diagnosis) rests on the untested assumption that LLM confidence shifts under counterfactual edits track real clinical causality. No section provides expert clinician validation, comparison to established differential-diagnosis metrics, or controls showing that the gaps are not artifacts of LLM priors; this directly affects whether the multi-agent guidance produces genuine refinement or merely deeper prompting.
  2. [Experiments and Results] The experimental results section asserts consistent improvements across three benchmarks and seven LLMs but supplies no statistical tests, variance estimates, exact baseline re-implementations, or ablation isolating the contribution of the probability gap versus increased discussion rounds. Without these, it is impossible to determine whether the reported gains (especially the largest ones in ambiguous cases) are load-bearing evidence for the framework.
minor comments (1)
  1. [Abstract] The abstract and method would be clearer if the three benchmarks and seven LLMs were named explicitly rather than left generic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the validation of the Counterfactual Probability Gap and the experimental rigor. We address each point below and commit to revisions that directly respond to the concerns while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: [Method (Counterfactual Probability Gap and multi-agent discussion)] The central claim that the Counterfactual Probability Gap validly quantifies evidential strength (and thereby improves diagnosis) rests on the untested assumption that LLM confidence shifts under counterfactual edits track real clinical causality. No section provides expert clinician validation, comparison to established differential-diagnosis metrics, or controls showing that the gaps are not artifacts of LLM priors; this directly affects whether the multi-agent guidance produces genuine refinement or merely deeper prompting.

    Authors: We agree that explicit validation of the Counterfactual Probability Gap against clinician judgments would strengthen the central claim. The existing human evaluation focuses on overall reasoning quality, clinical usefulness, and coherence but does not isolate ratings for the gap metric itself. In revision we will add (1) a direct comparison of the gap to standard differential-diagnosis heuristics drawn from medical literature, (2) controls that test gap behavior on synthetic cases with known causal structure to check for LLM-prior artifacts, and (3) an expanded clinician rating task that specifically scores the evidential strength signals produced by the gap. These additions will be included without requiring an entirely new large-scale clinician study. revision: partial

  2. Referee: [Experiments and Results] The experimental results section asserts consistent improvements across three benchmarks and seven LLMs but supplies no statistical tests, variance estimates, exact baseline re-implementations, or ablation isolating the contribution of the probability gap versus increased discussion rounds. Without these, it is impossible to determine whether the reported gains (especially the largest ones in ambiguous cases) are load-bearing evidence for the framework.

    Authors: We acknowledge that the current manuscript lacks statistical tests, variance reporting, and targeted ablations. In the revised version we will (1) re-run all experiments across multiple random seeds and report means with standard deviations, (2) include paired statistical tests (e.g., McNemar or Wilcoxon) with p-values for all accuracy comparisons, (3) document exact baseline re-implementations including prompt templates and discussion-round counts, and (4) add an ablation that replaces the learned Counterfactual Probability Gap with either random values or a fixed-discussion-round baseline to isolate its contribution. These changes will make the evidence for the framework's gains transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces counterfactual case editing and defines the Counterfactual Probability Gap directly as a measure of confidence shifts under those edits, without any reduction to parameters fitted on the evaluation benchmarks or self-citation chains that force the result. Accuracy gains are presented as empirical outcomes across three diagnostic benchmarks and seven LLMs, independent of the framework's internal definitions. No equations, uniqueness theorems, or ansatzes are shown to collapse the claimed improvements back to the inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters or invented entities; the framework implicitly assumes that LLM probability outputs under edited inputs meaningfully reflect diagnostic support.

axioms (1)
  • domain assumption Counterfactual edits to clinical findings simulate real diagnostic hypothesis testing
    The method treats modified cases as valid probes of evidence strength without further justification in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1114 out tokens · 43501 ms · 2026-05-14T21:40:49.698575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    National Academies Press, 2015

    John R Ball, Bryan T Miller, and Erin P Balogh.Improving diagnosis in health care. National Academies Press, 2015

  2. [2]

    Diagnostic errors, health disparities, and artificial intelligence: a combination for health or harm? InJAMA Health Forum, volume 2, pages e212430–e212430

    Said A Ibrahim and Peter J Pronovost. Diagnostic errors, health disparities, and artificial intelligence: a combination for health or harm? InJAMA Health Forum, volume 2, pages e212430–e212430. American Medical Association, 2021

  3. [3]

    John Wiley & Sons, 2024

    Harold C Sox, Michael C Higgins, Douglas K Owens, and Gillian Sanders Schmidler.Medical decision making. John Wiley & Sons, 2024

  4. [4]

    Evidence for anchoring bias during physician decision-making.JAMA internal medicine, 183(8):818–823, 2023

    Dan P Ly, Paul G Shekelle, and Zirui Song. Evidence for anchoring bias during physician decision-making.JAMA internal medicine, 183(8):818–823, 2023

  5. [5]

    Errors in clinical diagnosis: a narrative review.Journal of International Medical Research, 51(8):03000605231162798, 2023

    Zunaid Ismail Vally, Razia AG Khammissa, Gal Feller, Raoul Ballyram, Michaela Beetge, and Liviu Feller. Errors in clinical diagnosis: a narrative review.Journal of International Medical Research, 51(8):03000605231162798, 2023

  6. [6]

    Cognitive biases in diagnosis and decision making during anaesthesia and intensive care.BJA education, 21(11):420–425, 2021

    CS Webster, S Taylor, and JM Weller. Cognitive biases in diagnosis and decision making during anaesthesia and intensive care.BJA education, 21(11):420–425, 2021

  7. [7]

    Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

    Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

  8. [8]

    Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

    Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

  9. [9]

    A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

    Xiaohong Liu, Hao Liu, Guoxing Yang, Zeyu Jiang, Shuguang Cui, Zhaoze Zhang, Huan Wang, Liyuan Tao, Yongchang Sun, Zhu Song, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

  10. [10]

    Automated lay language summariza- tion of biomedical scientific reviews

    Yue Guo, Wei Qiu, Yizhong Wang, and Trevor Cohen. Automated lay language summariza- tion of biomedical scientific reviews. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 160–168, 2021. 18

  11. [11]

    Uiuc_bionlp at biolaysumm: an extract-then-summarize approach augmented with wikipedia knowledge for biomedical lay summarization

    Zhiwen You, Shruthan Radhakrishna, Shufan Ming, and Halil Kilicoglu. Uiuc_bionlp at biolaysumm: an extract-then-summarize approach augmented with wikipedia knowledge for biomedical lay summarization. InProceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 132–143, 2024

  12. [12]

    Pyhealth: A deep learning toolkit for healthcare predictive modeling

    Chaoqi Yang, Zhenbang Wu, Patrick Jiang, Zhen Lin, Junyi Gao, Benjamin Danek, Jimeng Sun, et al. Pyhealth: A deep learning toolkit for healthcare predictive modeling. InProceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), volume 2023, 2023

  13. [13]

    Reasoning-enhanced healthcare predictions with knowledge graph community retrieval

    Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval. InThe Thirteenth International Conference on Learning Representations

  14. [14]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  15. [15]

    Medagents: Large language models as collaborators for zero-shot medical reasoning

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguistics: ACL 2024, pages 599–621, 2024

  16. [16]

    Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  17. [17]

    Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree

    Qi Peng, Jialin Cui, Jiayuan Xie, Yi Cai, and Qing Li. Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree. InProceedings of the 33rd ACM International Conference on Multimedia, pages 1744–1753, 2025

  18. [18]

    Evalu- ation and mitigation of the limitations of large language models in clinical decision-making

    Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evalu- ation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30(9):2613–2622, 2024

  19. [19]

    ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

    Nikita Mehandru, Niloufar Golchini, David Bamman, Travis Zack, Melanie F Molina, and Ahmed Alaa. Er-reason: A benchmark dataset for llm-based clinical reasoning in the emergency room.arXiv preprint arXiv:2505.22919, 2025

  20. [20]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

  21. [21]

    Medagentboard: Benchmarking multi-agent collaboration with conventional methods for diverse medical tasks.arXiv preprint arXiv:2505.12371, 2025

    Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, and Lequan Yu. Medagentboard: Benchmarking multi-agent collaboration with conventional methods for diverse medical tasks.arXiv preprint arXiv:2505.12371, 2025

  22. [22]

    Llama 3.1: Open foundation and instruction-tuned models, 2024

    Meta. Llama 3.1: Open foundation and instruction-tuned models, 2024

  23. [23]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  24. [24]

    m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869.2025

    Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869, 2025

  25. [25]

    Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993

    Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025. 19

  26. [26]

    Verity Schaye, Louis Miller, David Kudlowitz, Jonathan Chun, Jesse Burk-Rafel, Patrick Cocks, Benedict Guzman, Yindalon Aphinyanaphongs, and Marina Marin. Development of a clinical reasoning documentation assessment tool for resident and fellow admission notes: a shared mental model for feedback.Journal of General Internal Medicine, 37(3):507–512, 2022

  27. [27]

    Reverse thinking makes llms stronger reasoners

    Justin Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, et al. Reverse thinking makes llms stronger reasoners. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...

  28. [28]

    Improving the accuracy of medical diagnosis with causal machine learning.Nature communications, 11(1):3923, 2020

    Jonathan G Richens, Ciarán M Lee, and Saurabh Johri. Improving the accuracy of medical diagnosis with causal machine learning.Nature communications, 11(1):3923, 2020

  29. [29]

    Aura: A multi-modal medical agent for understanding, reasoning and annotation

    Nima Fathi, Amar Kumar, and Tal Arbel. Aura: A multi-modal medical agent for understanding, reasoning and annotation. InInternational Workshop on Agentic AI for Medicine, pages 105–114. Springer, 2025

  30. [30]

    Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025

    Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Yi Fang, Sirui Ding, Jiashuo Wang, Kaishuai Xu, et al. Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025

  31. [31]

    Propaga- tion and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks

    Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang. Propaga- tion and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks. InFindings of the Association for Computational Linguistics ACL 2024, pages 12503–12525, 2024

  32. [32]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

  33. [33]

    Mimic-iv.PhysioNet

    Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv.PhysioNet. Available online at: https://physionet. org/content/mimi- civ/1.0/(accessed August 23, 2021), pages 49–55, 2020

  34. [34]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  35. [35]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  36. [36]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

  37. [37]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  38. [38]

    SciPrompt: Knowledge-augmented prompting for fine-grained categorization of scientific topics

    Zhiwen You, Kanyao Han, Haotian Zhu, Bertram Ludaescher, and Jana Diesner. SciPrompt: Knowledge-augmented prompting for fine-grained categorization of scientific topics. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6087–6104, Miami, Florida, USA,...

  39. [39]

    Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

    Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

  40. [40]

    Plainqafact: Retrieval-augmented factual consistency evaluation metric for biomedical plain language summarization.Journal of Biomedical Informatics, page 105019, 2026

    Zhiwen You and Yue Guo. Plainqafact: Retrieval-augmented factual consistency evaluation metric for biomedical plain language summarization.Journal of Biomedical Informatics, page 105019, 2026. 20

  41. [41]

    thinking

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 21 A Experiment Details A.1 Datasets We evaluate our proposed diagnostic system on three datasets: MIMIC-CDM-F...

  42. [42]

    smooth out

    Source Fidelity - Extract facts only from the supplied case presentation. a. Do NOT invent, embellish, or “smooth out” missing data. b. Paraphrase narrative prose into concise bullets where helpful, but never add new facts

  43. [43]

    Use the XML Tags Exactly as Shown <case_prompt>...</case_prompt> - the information given to students be- fore they generate a differential

  44. [44]

    Present only the facts known before a working differential was made: chief complaint, HPI, vitals, physical exam, and early investigations

    What Goes Inside<case_prompt>: a. Present only the facts known before a working differential was made: chief complaint, HPI, vitals, physical exam, and early investigations. b. Do not include references to Figure 1, Table 1, etc. directly. Summarize any imaging findings from what is given in the text. c. Present the case in the order presented in the orig...

  45. [45]

    Detect the patient’s main symptom(s) and list the key problems from the case presentation

  46. [46]

    num_agents

    Based on the patient features/symptoms and clinical notes, assign UP TO FIVE most relevant specialists from the specialist list below. Based on the case, determine what kind of experts will you recruit to better make an accurate answer. Allowed specialists: {specialist_pool} Constraints: - “num_agents” must equal the number of objects in “assigned_special...

  47. [47]

    **COPY-PASTE FIRST**: Start by copying the ENTIRE original case text verbatim

  48. [48]

    **MINIMAL EDITS ONLY**: Then modify ONLY the specific spans listed in target_evidence_group

  49. [49]

    **PRESERVE EVERYTHING ELSE**: Do NOT change, remove, summarize, or rephrase any other part of the case

  50. [50]

    **SAME LENGTH**: The edited_case MUST be approximately the same length as the original (+/−5%)

  51. [51]

    normal results

    **NO SUMMARIZATION**: NEVER replace detailed information with summaries like “normal results” or “unremarkable” Operations: - negate: Change positive finding to negative (e.g., “fever present” -> “no fever”) - remove: Delete the specific span (keep surrounding context) - replace: Substitute with different but plausible value - weaken: Reduce severity (e.g...

  52. [52]

    **Clinical reasoning quality**: Which diagnosis has the strongest evidence-based reasoning from the case presentation?

  53. [53]

    **Counterfactual hypothesis testing evidence**: Which diagnosis shows the strongest counterfactual evidence (high CPG scores, consistent hypothesis testing)?

  54. [54]

    **Specialist consensus on critical features**: Which diagnosis has the most specialists identifying the same critical features?

  55. [55]

    **Initial symptom explanation**: Which diagnosis best explains WHY the initial symptoms occurred?

  56. [56]

    **Timeline consistency**: Which diagnosis fits the temporal evolution of the case?

  57. [57]

    CRITICAL V ALIDATION:

    **Diagnostic criteria matching**: Which diagnosis best matches established clinical criteria for that condition? WARNING: If specialists converged on a high-probability diagnosis but counterfactual hypothesis testing shows weak evidence, consider alternative diagnoses even if they have lower probability. CRITICAL V ALIDATION:

  58. [58]

    If your rationale supports diagnosis A, you MUST choose diagnosis A as final_diagnosis

    Your final_diagnosis MUST be consistent with your rationale. If your rationale supports diagnosis A, you MUST choose diagnosis A as final_diagnosis

  59. [59]

    had_consensus

    **You MUST select your final_diagnosis from the top 3 differential diagnoses provided. You cannot propose a diagnosis outside this list.** Output JSON only with this schema: { “had_consensus”: true/false, “final_diagnosis”: "final chosen diagnosis label (must select from the top 3 differential diagnoses)", “winner_role”: “Role name whose reasoning most st...