arxiv: 2603.27820 · v2 · submitted 2026-03-29 · 💻 cs.CL

Recognition: no theorem link

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

Zhiwen You , Xi Chen , Aniket Vashishtha , Simo Du , Gabriel Erion-Barner , Hongyuan Mei , Hao Peng , Yue Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical diagnosiscounterfactual reasoningmulti-agent systemslarge language modelsdiagnostic accuracyevidence verificationdifferential diagnosisinterpretability

0 comments

The pith

Counterfactual edits to clinical findings let multi-agent LLMs test evidence support and raise diagnostic accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework where LLM agents perform counterfactual edits to clinical findings, such as removing or altering symptoms, to measure their impact on competing diagnoses. This is quantified through the Counterfactual Probability Gap, which tracks shifts in model confidence. These signals then inform multi-round discussions among specialist agents to challenge and refine diagnoses. The approach yields consistent accuracy gains over standard prompting and other multi-agent methods on three benchmarks with seven models, particularly in ambiguous cases. Human assessments confirm the outputs are more reliable and coherent.

Core claim

The authors establish that explicitly introducing counterfactual case editing to modify clinical findings and computing the resulting probability gaps allows multi-agent systems to verify evidence support for diagnoses, leading to improved accuracy and interpretability in LLM-based clinical reasoning.

What carries the argument

The Counterfactual Probability Gap, a metric that calculates how much a diagnosis's confidence changes when specific clinical findings are edited or removed, thereby identifying which evidence supports or refutes each hypothesis.

If this is right

Accuracy improves across three diagnostic benchmarks and seven LLMs compared to prompting and prior multi-agent baselines.
Gains are largest in complex and ambiguous diagnostic cases.
Reasoning trajectories become more interpretable through explicit evidence testing.
Human evaluations rate the outputs as more clinically useful, reliable, and coherent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be adapted to non-medical domains requiring hypothesis testing against alternatives, such as legal or scientific reasoning.
Integrating this with real-time patient data might further validate the counterfactual assumptions in practice.
It suggests a general strategy for reducing hallucination in LLM reasoning by enforcing evidence-grounded challenges.

Load-bearing premise

That counterfactual edits to clinical findings and the resulting probability gaps validly capture how individual pieces of evidence support or refute diagnoses in real clinical reasoning.

What would settle it

Apply the framework to clinical cases where counterfactual edits are constructed to contradict established medical knowledge and observe whether accuracy gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2603.27820 by Aniket Vashishtha, Gabriel Erion-Barner, Hao Peng, Hongyuan Mei, Simo Du, Xi Chen, Yue Guo, Zhiwen You.

**Figure 2.** Figure 2: Average diagnostic accuracy of seven LLMs on three datasets, including [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Average diagnostic accuracy of Llama-3.1-8B-Instruct for four diseases/specialties on three datasets, including (a) Disease-level accuracy on MIMIC, (b) Specialty-level accuracy on MedCaseReasoning, (c) Specialty-level accuracy on ER-Reason. Following Liu et al.[9], we categorize the test cases into specialties, and select the Top-4 most frequent specialties with relevant diagnoses from MedCaseReasoning an… view at source ↗

**Figure 4.** Figure 4: Multi-round discussion statistics and the impact of counterfactual case editing on diagnostic [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of Llama-3.1-8B-Instruct over different functional modules on MedCaseReasoning. The shaded area represents the 95% CI. (a) Diagnostic performance with various functional moduels added in our multi-agent diagnostic system. w/o: without; CF: counterfactual. (b) Diagnostic performance with various hyperparameters. DDx: differential diagnosis. lower error rates and higher completeness compared … view at source ↗

**Figure 6.** Figure 6: Human evaluation of clinical reasoning quality comparing our method with zero-shot CoT [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Example of diagnostic rationales generated by our multi-agent diagnostic framework using [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Example of our counterfactual case editing approach. CF: counterfactual. DDx: differential [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Performance of Deepseek-R1 for the diagnosis of case presentation on three datasets. (a), Consensus rate achieved by the multi-round discussion process across datasets. (b), Average number of the discussion rounds required per case. (c), Specialist diagnosis-change rate across the three datasets. Error bars indicate the standard deviation across three random seeds. Bar graphs indicate the standard deviatio… view at source ↗

**Figure 10.** Figure 10: Distribution of the Top-10 most frequently assigned specialists by the triage agent for [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: The user interface of annotation guideline and the case presentation. Each case is provided [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: The user interface of side-by-side reasoning trace comparison. Users can highlight [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: The user interface of error, safety, and completeness assessment. Criteria are evaluated [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: The user interface of reasoning quality evaluation. Each category is evaluated on a 5-point [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: The user interface of clinical contribution evaluation. Each category is evaluated on a [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: The user interface of annotation on reasoning traces trust assessment bias classification. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

read the original abstract

Clinical diagnosis is a complex reasoning process in which clinicians gather evidence, form hypotheses, and test them against alternative explanations. In medical training, this reasoning is explicitly developed through counterfactual questioning--e.g., asking how a diagnosis would change if a key symptom were absent or altered--to strengthen differential diagnosis skills. As large language model (LLM)-based systems are increasingly used for diagnostic support, ensuring the interpretability of their recommendations becomes critical. However, most existing LLM-based diagnostic agents reason over fixed clinical evidence without explicitly testing how individual findings support or weaken competing diagnoses. In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded. Our framework introduces counterfactual case editing to modify clinical findings and evaluate how these changes affect competing diagnoses. We further define the Counterfactual Probability Gap, a method that quantifies how strongly individual findings support a diagnosis by measuring confidence shifts under these edits. These counterfactual signals guide multi-round specialist discussions, enabling agents to challenge unsupported hypotheses, refine differential diagnoses, and produce more interpretable reasoning trajectories. Across three diagnostic benchmarks and seven LLMs, our method consistently improves diagnostic accuracy over prompting and prior multi-agent baselines, with the largest gains observed in complex and ambiguous cases. Human evaluation further indicates that our framework produces more clinically useful, reliable, and coherent reasoning. These results suggest that incorporating counterfactual evidence verification is an important step toward building reliable AI systems for clinical decision support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds explicit counterfactual case editing and a probability gap metric to steer multi-agent LLM diagnosis, which is a clean extension of clinical training ideas, but the claimed accuracy lifts rest on unverified assumptions about what those edits actually measure.

read the letter

The main thing to know is that this work takes the standard multi-agent setup for medical diagnosis and adds a loop where agents edit clinical findings, measure how much the model's confidence shifts, and use that gap to challenge or keep hypotheses. The abstract reports consistent gains over plain prompting and earlier multi-agent baselines across three benchmarks and seven models, with bigger lifts on ambiguous cases, plus better human ratings on reasoning quality. That framing is straightforward and the metric itself is defined directly from the edits rather than fitted post hoc, so the core mechanism is easy to follow and reproduce in principle. The clinical-training analogy gives the method a clear motivation that prior prompting work often lacks. If the numbers survive proper controls, it supplies a practical way to make LLM outputs more traceable in a high-stakes setting. The soft spot is exactly the one the stress-test note flags: the edits are generated by the same LLMs, and the probability shifts are taken as evidence of real evidential support without any expert review of whether the edited cases stay clinically realistic. In complex cases the gains are largest, which is also where model artifacts would be hardest to spot. The abstract gives no details on edit generation rules, statistical tests, or prompt-variation controls, so it is still possible the improvements come from extra rounds of discussion rather than the counterfactual signal. This is aimed at groups already running LLM agents on diagnostic tasks who want a lightweight way to add interpretability. Readers working on medical AI or multi-agent reasoning will find the method description useful even if they end up replacing the gap metric with something externally validated. I would send it to peer review because the idea is coherent, the benchmarks are standard, and the experiments are at least described at a level that referees can evaluate directly; the main questions will be about grounding and robustness rather than whether the work is worth looking at.

Referee Report

2 major / 1 minor

Summary. The paper proposes a counterfactual multi-agent diagnostic framework for LLMs in clinical settings. It introduces counterfactual editing of clinical findings and defines a Counterfactual Probability Gap to quantify how individual findings support or refute competing diagnoses. These signals drive multi-round specialist agent discussions to challenge hypotheses and refine outputs. Experiments across three diagnostic benchmarks and seven LLMs report consistent accuracy gains over standard prompting and prior multi-agent baselines, with larger improvements in complex/ambiguous cases, plus positive human evaluations of reasoning quality.

Significance. If the gains are robust and attributable to the counterfactual mechanism rather than prompting depth, the work could meaningfully advance interpretable LLM-based clinical decision support by explicitly incorporating hypothesis-testing practices from medical training.

major comments (2)

[Method (Counterfactual Probability Gap and multi-agent discussion)] The central claim that the Counterfactual Probability Gap validly quantifies evidential strength (and thereby improves diagnosis) rests on the untested assumption that LLM confidence shifts under counterfactual edits track real clinical causality. No section provides expert clinician validation, comparison to established differential-diagnosis metrics, or controls showing that the gaps are not artifacts of LLM priors; this directly affects whether the multi-agent guidance produces genuine refinement or merely deeper prompting.
[Experiments and Results] The experimental results section asserts consistent improvements across three benchmarks and seven LLMs but supplies no statistical tests, variance estimates, exact baseline re-implementations, or ablation isolating the contribution of the probability gap versus increased discussion rounds. Without these, it is impossible to determine whether the reported gains (especially the largest ones in ambiguous cases) are load-bearing evidence for the framework.

minor comments (1)

[Abstract] The abstract and method would be clearer if the three benchmarks and seven LLMs were named explicitly rather than left generic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the validation of the Counterfactual Probability Gap and the experimental rigor. We address each point below and commit to revisions that directly respond to the concerns while preserving the core contributions of the work.

read point-by-point responses

Referee: [Method (Counterfactual Probability Gap and multi-agent discussion)] The central claim that the Counterfactual Probability Gap validly quantifies evidential strength (and thereby improves diagnosis) rests on the untested assumption that LLM confidence shifts under counterfactual edits track real clinical causality. No section provides expert clinician validation, comparison to established differential-diagnosis metrics, or controls showing that the gaps are not artifacts of LLM priors; this directly affects whether the multi-agent guidance produces genuine refinement or merely deeper prompting.

Authors: We agree that explicit validation of the Counterfactual Probability Gap against clinician judgments would strengthen the central claim. The existing human evaluation focuses on overall reasoning quality, clinical usefulness, and coherence but does not isolate ratings for the gap metric itself. In revision we will add (1) a direct comparison of the gap to standard differential-diagnosis heuristics drawn from medical literature, (2) controls that test gap behavior on synthetic cases with known causal structure to check for LLM-prior artifacts, and (3) an expanded clinician rating task that specifically scores the evidential strength signals produced by the gap. These additions will be included without requiring an entirely new large-scale clinician study. revision: partial
Referee: [Experiments and Results] The experimental results section asserts consistent improvements across three benchmarks and seven LLMs but supplies no statistical tests, variance estimates, exact baseline re-implementations, or ablation isolating the contribution of the probability gap versus increased discussion rounds. Without these, it is impossible to determine whether the reported gains (especially the largest ones in ambiguous cases) are load-bearing evidence for the framework.

Authors: We acknowledge that the current manuscript lacks statistical tests, variance reporting, and targeted ablations. In the revised version we will (1) re-run all experiments across multiple random seeds and report means with standard deviations, (2) include paired statistical tests (e.g., McNemar or Wilcoxon) with p-values for all accuracy comparisons, (3) document exact baseline re-implementations including prompt templates and discussion-round counts, and (4) add an ablation that replaces the learned Counterfactual Probability Gap with either random values or a fixed-discussion-round baseline to isolate its contribution. These changes will make the evidence for the framework's gains transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces counterfactual case editing and defines the Counterfactual Probability Gap directly as a measure of confidence shifts under those edits, without any reduction to parameters fitted on the evaluation benchmarks or self-citation chains that force the result. Accuracy gains are presented as empirical outcomes across three diagnostic benchmarks and seven LLMs, independent of the framework's internal definitions. No equations, uniqueness theorems, or ansatzes are shown to collapse the claimed improvements back to the inputs by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters or invented entities; the framework implicitly assumes that LLM probability outputs under edited inputs meaningfully reflect diagnostic support.

axioms (1)

domain assumption Counterfactual edits to clinical findings simulate real diagnostic hypothesis testing
The method treats modified cases as valid probes of evidence strength without further justification in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1114 out tokens · 43501 ms · 2026-05-14T21:40:49.698575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

National Academies Press, 2015

John R Ball, Bryan T Miller, and Erin P Balogh.Improving diagnosis in health care. National Academies Press, 2015

work page 2015
[2]

Diagnostic errors, health disparities, and artificial intelligence: a combination for health or harm? InJAMA Health Forum, volume 2, pages e212430–e212430

Said A Ibrahim and Peter J Pronovost. Diagnostic errors, health disparities, and artificial intelligence: a combination for health or harm? InJAMA Health Forum, volume 2, pages e212430–e212430. American Medical Association, 2021

work page 2021
[3]

John Wiley & Sons, 2024

Harold C Sox, Michael C Higgins, Douglas K Owens, and Gillian Sanders Schmidler.Medical decision making. John Wiley & Sons, 2024

work page 2024
[4]

Evidence for anchoring bias during physician decision-making.JAMA internal medicine, 183(8):818–823, 2023

Dan P Ly, Paul G Shekelle, and Zirui Song. Evidence for anchoring bias during physician decision-making.JAMA internal medicine, 183(8):818–823, 2023

work page 2023
[5]

Errors in clinical diagnosis: a narrative review.Journal of International Medical Research, 51(8):03000605231162798, 2023

Zunaid Ismail Vally, Razia AG Khammissa, Gal Feller, Raoul Ballyram, Michaela Beetge, and Liviu Feller. Errors in clinical diagnosis: a narrative review.Journal of International Medical Research, 51(8):03000605231162798, 2023

work page 2023
[6]

Cognitive biases in diagnosis and decision making during anaesthesia and intensive care.BJA education, 21(11):420–425, 2021

CS Webster, S Taylor, and JM Weller. Cognitive biases in diagnosis and decision making during anaesthesia and intensive care.BJA education, 21(11):420–425, 2021

work page 2021
[7]

Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

work page 2025
[8]

Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

work page arXiv 2025
[9]

A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

Xiaohong Liu, Hao Liu, Guoxing Yang, Zeyu Jiang, Shuguang Cui, Zhaoze Zhang, Huan Wang, Liyuan Tao, Yongchang Sun, Zhu Song, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

work page 2025
[10]

Automated lay language summariza- tion of biomedical scientific reviews

Yue Guo, Wei Qiu, Yizhong Wang, and Trevor Cohen. Automated lay language summariza- tion of biomedical scientific reviews. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 160–168, 2021. 18

work page 2021
[11]

Uiuc_bionlp at biolaysumm: an extract-then-summarize approach augmented with wikipedia knowledge for biomedical lay summarization

Zhiwen You, Shruthan Radhakrishna, Shufan Ming, and Halil Kilicoglu. Uiuc_bionlp at biolaysumm: an extract-then-summarize approach augmented with wikipedia knowledge for biomedical lay summarization. InProceedings of the 23rd Workshop on Biomedical Natural Language Processing, pages 132–143, 2024

work page 2024
[12]

Pyhealth: A deep learning toolkit for healthcare predictive modeling

Chaoqi Yang, Zhenbang Wu, Patrick Jiang, Zhen Lin, Junyi Gao, Benjamin Danek, Jimeng Sun, et al. Pyhealth: A deep learning toolkit for healthcare predictive modeling. InProceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), volume 2023, 2023

work page 2023
[13]

Reasoning-enhanced healthcare predictions with knowledge graph community retrieval

Pengcheng Jiang, Cao Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, and Jiawei Han. Reasoning-enhanced healthcare predictions with knowledge graph community retrieval. InThe Thirteenth International Conference on Learning Representations

work page
[14]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

work page 2022
[15]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguistics: ACL 2024, pages 599–621, 2024

work page 2024
[16]

Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

work page 2024
[17]

Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree

Qi Peng, Jialin Cui, Jiayuan Xie, Yi Cai, and Qing Li. Tree-of-reasoning: Towards complex medical diagnosis via multi-agent reasoning with evidence tree. InProceedings of the 33rd ACM International Conference on Multimedia, pages 1744–1753, 2025

work page 2025
[18]

Evalu- ation and mitigation of the limitations of large language models in clinical decision-making

Paul Hager, Friederike Jungmann, Robbie Holland, Kunal Bhagat, Inga Hubrecht, Manuel Knauer, Jakob Vielhauer, Marcus Makowski, Rickmer Braren, Georgios Kaissis, et al. Evalu- ation and mitigation of the limitations of large language models in clinical decision-making. Nature medicine, 30(9):2613–2622, 2024

work page 2024
[19]

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

Nikita Mehandru, Niloufar Golchini, David Bamman, Travis Zack, Melanie F Molina, and Ahmed Alaa. Er-reason: A benchmark dataset for llm-based clinical reasoning in the emergency room.arXiv preprint arXiv:2505.22919, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

work page
[21]

Medagentboard: Benchmarking multi-agent collaboration with conventional methods for diverse medical tasks.arXiv preprint arXiv:2505.12371, 2025

Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, and Lequan Yu. Medagentboard: Benchmarking multi-agent collaboration with conventional methods for diverse medical tasks.arXiv preprint arXiv:2505.12371, 2025

work page arXiv 2025
[22]

Llama 3.1: Open foundation and instruction-tuned models, 2024

Meta. Llama 3.1: Open foundation and instruction-tuned models, 2024

work page 2024
[23]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869.2025

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869, 2025

work page arXiv 2025
[25]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993

Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025. 19

work page arXiv 2025
[26]

Verity Schaye, Louis Miller, David Kudlowitz, Jonathan Chun, Jesse Burk-Rafel, Patrick Cocks, Benedict Guzman, Yindalon Aphinyanaphongs, and Marina Marin. Development of a clinical reasoning documentation assessment tool for resident and fellow admission notes: a shared mental model for feedback.Journal of General Internal Medicine, 37(3):507–512, 2022

work page 2022
[27]

Reverse thinking makes llms stronger reasoners

Justin Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, et al. Reverse thinking makes llms stronger reasoners. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...

work page 2025
[28]

Improving the accuracy of medical diagnosis with causal machine learning.Nature communications, 11(1):3923, 2020

Jonathan G Richens, Ciarán M Lee, and Saurabh Johri. Improving the accuracy of medical diagnosis with causal machine learning.Nature communications, 11(1):3923, 2020

work page 2020
[29]

Aura: A multi-modal medical agent for understanding, reasoning and annotation

Nima Fathi, Amar Kumar, and Tal Arbel. Aura: A multi-modal medical agent for understanding, reasoning and annotation. InInternational Workshop on Agentic AI for Medicine, pages 105–114. Springer, 2025

work page 2025
[30]

Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025

Shuang Zhou, Zidu Xu, Mian Zhang, Chunpu Xu, Yawen Guo, Zaifu Zhan, Yi Fang, Sirui Ding, Jiashuo Wang, Kaishuai Xu, et al. Large language models for disease diagnosis: A scoping review.npj Artificial Intelligence, 1(1):9, 2025

work page 2025
[31]

Propaga- tion and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks

Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang. Propaga- tion and pitfalls: Reasoning-based assessment of knowledge editing through counterfactual tasks. InFindings of the Association for Computational Linguistics ACL 2024, pages 12503–12525, 2024

work page 2024
[32]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

work page 2019
[33]

Mimic-iv.PhysioNet

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv.PhysioNet. Available online at: https://physionet. org/content/mimi- civ/1.0/(accessed August 23, 2021), pages 49–55, 2020

work page 2021
[34]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[35]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[36]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[38]

SciPrompt: Knowledge-augmented prompting for fine-grained categorization of scientific topics

Zhiwen You, Kanyao Han, Haotian Zhu, Bertram Ludaescher, and Jana Diesner. SciPrompt: Knowledge-augmented prompting for fine-grained categorization of scientific topics. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6087–6104, Miami, Florida, USA,...

work page 2024
[39]

Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

work page arXiv 2025
[40]

Plainqafact: Retrieval-augmented factual consistency evaluation metric for biomedical plain language summarization.Journal of Biomedical Informatics, page 105019, 2026

Zhiwen You and Yue Guo. Plainqafact: Retrieval-augmented factual consistency evaluation metric for biomedical plain language summarization.Journal of Biomedical Informatics, page 105019, 2026. 20

work page 2026
[41]

thinking

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 21 A Experiment Details A.1 Datasets We evaluate our proposed diagnostic system on three datasets: MIMIC-CDM-F...

work page 2022
[42]

smooth out

Source Fidelity - Extract facts only from the supplied case presentation. a. Do NOT invent, embellish, or “smooth out” missing data. b. Paraphrase narrative prose into concise bullets where helpful, but never add new facts

work page
[43]

Use the XML Tags Exactly as Shown <case_prompt>...</case_prompt> - the information given to students be- fore they generate a differential

work page
[44]

Present only the facts known before a working differential was made: chief complaint, HPI, vitals, physical exam, and early investigations

What Goes Inside<case_prompt>: a. Present only the facts known before a working differential was made: chief complaint, HPI, vitals, physical exam, and early investigations. b. Do not include references to Figure 1, Table 1, etc. directly. Summarize any imaging findings from what is given in the text. c. Present the case in the order presented in the orig...

work page
[45]

Detect the patient’s main symptom(s) and list the key problems from the case presentation

work page
[46]

num_agents

Based on the patient features/symptoms and clinical notes, assign UP TO FIVE most relevant specialists from the specialist list below. Based on the case, determine what kind of experts will you recruit to better make an accurate answer. Allowed specialists: {specialist_pool} Constraints: - “num_agents” must equal the number of objects in “assigned_special...

work page
[47]

**COPY-PASTE FIRST**: Start by copying the ENTIRE original case text verbatim

work page
[48]

**MINIMAL EDITS ONLY**: Then modify ONLY the specific spans listed in target_evidence_group

work page
[49]

**PRESERVE EVERYTHING ELSE**: Do NOT change, remove, summarize, or rephrase any other part of the case

work page
[50]

**SAME LENGTH**: The edited_case MUST be approximately the same length as the original (+/−5%)

work page
[51]

normal results

**NO SUMMARIZATION**: NEVER replace detailed information with summaries like “normal results” or “unremarkable” Operations: - negate: Change positive finding to negative (e.g., “fever present” -> “no fever”) - remove: Delete the specific span (keep surrounding context) - replace: Substitute with different but plausible value - weaken: Reduce severity (e.g...

work page
[52]

**Clinical reasoning quality**: Which diagnosis has the strongest evidence-based reasoning from the case presentation?

work page
[53]

**Counterfactual hypothesis testing evidence**: Which diagnosis shows the strongest counterfactual evidence (high CPG scores, consistent hypothesis testing)?

work page
[54]

**Specialist consensus on critical features**: Which diagnosis has the most specialists identifying the same critical features?

work page
[55]

**Initial symptom explanation**: Which diagnosis best explains WHY the initial symptoms occurred?

work page
[56]

**Timeline consistency**: Which diagnosis fits the temporal evolution of the case?

work page
[57]

CRITICAL V ALIDATION:

**Diagnostic criteria matching**: Which diagnosis best matches established clinical criteria for that condition? WARNING: If specialists converged on a high-probability diagnosis but counterfactual hypothesis testing shows weak evidence, consider alternative diagnoses even if they have lower probability. CRITICAL V ALIDATION:

work page
[58]

If your rationale supports diagnosis A, you MUST choose diagnosis A as final_diagnosis

Your final_diagnosis MUST be consistent with your rationale. If your rationale supports diagnosis A, you MUST choose diagnosis A as final_diagnosis

work page
[59]

had_consensus

**You MUST select your final_diagnosis from the top 3 differential diagnoses provided. You cannot propose a diagnosis outside this list.** Output JSON only with this schema: { “had_consensus”: true/false, “final_diagnosis”: "final chosen diagnosis label (must select from the top 3 differential diagnoses)", “winner_role”: “Role name whose reasoning most st...

work page