Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

Chen Zhan; Gengchen Ma; Liang Liu; Lu Gan; Peifeng Liu; Shuo Li; Xiaoxiao Ge; Xiaoyu Tan; Xibing Zhuang; Xihe Qiu

arxiv: 2605.22047 · v1 · pith:ADHRUFSAnew · submitted 2026-05-21 · 💻 cs.AI

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

Chen Zhan , Xihe Qiu , Xiaoyu Tan , Xibing Zhuang , Gengchen Ma , Yue Zhang , Shuo Li , Peifeng Liu

show 3 more authors

Xiaoxiao Ge Liang Liu Lu Gan

This is my paper

Pith reviewed 2026-05-22 06:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsclinical decision supportdiagnostic reasoninginteractive evaluationevidence seekingstandardized patient simulatormedical AIpremature diagnostic closure

0 comments

The pith

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% compared to full-context evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an OSCE-inspired standardized patient simulator to evaluate large language models on active diagnostic inquiry rather than static medical exams. Across 468 cases and 15 models, multi-turn interactions lower accuracy by 12.75 percent and supporting-evidence quality by 24.36 percent relative to giving the model all information upfront. These drops trace to premature diagnostic closure and inefficient questioning patterns. The findings indicate that full-context benchmarks likely overestimate how well models will perform when gathering information iteratively in clinical settings.

Core claim

Multi-turn evidence seeking in an interactive standardized patient simulator reduces diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% relative to full-context evaluation across 468 cases and 15 models, with error analyses linking the declines to premature diagnostic closure and inefficient questioning.

What carries the argument

OSCE-inspired standardized patient simulator that supports controlled multi-turn diagnostic interactions under uncertainty.

If this is right

Static full-context medical benchmarks overestimate model readiness for real clinical workflows.
Models exhibit premature closure when they must actively gather evidence rather than receive it all at once.
Interactive evaluation protocols are needed alongside static tests to assess safer clinical decision support.
Inefficient questioning strategies become visible only when models operate without complete context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that reward sustained uncertainty rather than quick answers could reduce the observed gaps.
Extending the simulator to include noisy or contradictory patient responses would test robustness beyond the current setup.
Deployment guidelines might require human oversight specifically during the evidence-gathering phase.

Load-bearing premise

The simulator captures enough of the uncertainty and dynamics of real clinical encounters for the observed performance gaps to apply outside the benchmark.

What would settle it

A direct comparison showing no accuracy or evidence-quality drop when the same models interact with real patients under matched conditions would falsify the claim that static benchmarks overestimate interactive performance.

read the original abstract

Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs drop 13% in diagnostic accuracy and 24% in evidence quality when they must ask questions interactively versus getting the full case at once.

read the letter

The central result is straightforward: on 468 cases across 15 models, multi-turn evidence gathering hurts performance relative to full-context evaluation, with errors clustering around premature closure and weak questioning. They introduce an OSCE-inspired simulator to make this comparison controlled and reproducible, which is the main addition over earlier interactive tests. The scale and the error breakdown give the numbers some grounding and make the point about static benchmarks being too lenient easy to see. That part is useful for anyone thinking about how to evaluate clinical tools. The soft spot is the simulator's realism. Nothing in the abstract or stress-test details shows calibration against real standardized-patient transcripts, inter-rater checks, or sensitivity to patient-model prompting. If the simulated responses are cleaner or less ambiguous than actual ones, the measured gap could be inflated. Without those checks the claim that static tests overestimate real-world readiness rests on an untested assumption. This is for groups working on medical LLM evaluation and benchmark design. It is worth sending to peer review so the simulator can be stress-tested properly and the statistical controls can be examined in full.

Referee Report

1 major / 2 minor

Summary. The paper introduces an OSCE-inspired standardized patient simulator and a controlled benchmark to evaluate active diagnostic inquiry in LLMs. Across 468 cases and 15 models, multi-turn evidence seeking is reported to reduce diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses link the drops to premature diagnostic closure and inefficient questioning. The authors conclude that static full-context benchmarks may overestimate performance in interactive clinical settings.

Significance. If the simulator faithfully models real clinical uncertainty, the work provides concrete empirical evidence that interactive evaluation is needed to complement static benchmarks for safer clinical decision support. The scale (468 cases, 15 models) and explicit error analysis are strengths that make the protocol-comparison claim falsifiable and reproducible in principle.

major comments (1)

The central generalization—that static benchmarks overestimate interactive performance—rests on the OSCE-inspired simulator reproducing the distribution of patient answers, hedging, and information gaps seen in real encounters. No section reports calibration against human-standardized-patient transcripts, inter-rater agreement on simulator outputs, or sensitivity analysis to patient-model prompt variations. Without such checks, the measured 12.75% accuracy drop and 24.36% evidence-quality drop risk being benchmark artifacts rather than clinically meaningful differences.

minor comments (2)

Clarify whether the reported 12.75% and 24.36% figures are absolute percentage-point differences or relative reductions, and include confidence intervals or statistical tests for the comparisons.
The error-analysis section would benefit from explicit quantitative criteria or annotated examples used to identify 'premature diagnostic closure' and 'inefficient questioning' in model traces.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: The central generalization—that static benchmarks overestimate interactive performance—rests on the OSCE-inspired simulator reproducing the distribution of patient answers, hedging, and information gaps seen in real encounters. No section reports calibration against human-standardized-patient transcripts, inter-rater agreement on simulator outputs, or sensitivity analysis to patient-model prompt variations. Without such checks, the measured 12.75% accuracy drop and 24.36% evidence-quality drop risk being benchmark artifacts rather than clinically meaningful differences.

Authors: We acknowledge that the manuscript does not report explicit calibration of simulator outputs against human-standardized-patient transcripts or inter-rater agreement statistics on those outputs. This is a substantive limitation for claims about ecological validity. In revision we will add a dedicated Limitations subsection that states this gap explicitly, describes the OSCE-inspired design choices used to approximate real encounters, and outlines planned follow-up calibration work using publicly available standardized-patient transcripts. On sensitivity analysis, internal prompt-variation checks were performed during benchmark construction; we will include a formal sensitivity table in the supplementary material showing that the reported accuracy and evidence-quality drops remain directionally consistent across reasonable prompt rephrasings. We maintain that the controlled, reproducible protocol still supplies falsifiable evidence that interactive settings differ from full-context ones, even while recognizing that stronger real-world anchoring would further support generalization. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of evaluation protocols

full rationale

The paper reports measured differences in diagnostic accuracy (12.75% drop) and evidence quality (24.36% drop) between multi-turn evidence-seeking and full-context evaluation on the same 468 cases across 15 models. These are straightforward experimental outcomes from running the introduced OSCE-inspired simulator benchmark; no equations, fitted parameters, or predictions reduce to inputs by construction. The protocol is self-contained as an empirical study with no load-bearing self-citations or ansatz smuggling. The simulator's external fidelity is a separate validity concern, not a circularity issue in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the simulator being a valid proxy for clinical reality and on the chosen metrics for diagnostic accuracy and evidence quality being appropriate.

axioms (1)

domain assumption The standardized patient simulator accurately represents real-world clinical diagnostic interactions under uncertainty.
Invoked to support generalization from benchmark results to clinical decision support.

invented entities (1)

OSCE-inspired standardized patient simulator no independent evidence
purpose: Enable controlled multi-turn active diagnostic inquiry testing.
Newly constructed for this benchmark; no independent evidence provided beyond the paper's own protocol.

pith-pipeline@v0.9.0 · 5671 in / 1294 out tokens · 54160 ms · 2026-05-22T06:17:47.405063+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 11 internal anchors

[1]

Liao, Y.et al.Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495(2024). 24

work page arXiv 2024
[2]

& Einav, S

Idan, D. & Einav, S. Primer on large language models: an educational overview for intensivists.Critical Care29, 238 (2025)

work page 2025
[3]

& Varghese, J

Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of chatgpt, google search and llama 2 for clinical decision support tasks.Nature communications15, 2050 (2024)

work page 2050
[4]

Jin, D.et al.What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences11, 6421 (2021)

work page 2021
[5]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146(2019)

work page arXiv 1909
[6]

Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa : A large-scale multi- subject multi-choice dataset for medical domain question answering (2022). URL https://arxiv.org/abs/2203.14371. arXiv:2203.14371

work page arXiv 2022
[7]

Singhal, K.et al.Toward expert-level medical question answering with large language models.Nature Medicine31, 943–950 (2025)

work page 2025
[8]

Capabilities of GPT-4 on Medical Challenge Problems

Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Brin, D.et al.Comparing chatgpt and gpt-4 performance in usmle soft skill assessments.Scientific Reports13, 16492 (2023)

work page 2023
[10]

NPJ Digital Medicine6, 226 (2023)

Liu, F.et al.A medical multimodal large language model for future pandemics. NPJ Digital Medicine6, 226 (2023)

work page 2023
[11]

Bedi, S.et al.Testing and evaluation of health care applications of large language models: a systematic review.Jama(2025)

work page 2025
[12]

Coderre, S., Mandin, H., Harasym, P. H. & Fick, G. H. Diagnostic reasoning strategies and diagnostic success.Medical education37, 695–703 (2003)

work page 2003
[13]

S., Shulman, L

Elstein, A. S., Shulman, L. S. & Sprafka, S. A.Medical problem solving: An analysis of clinical reasoning(Harvard University Press, 1978)

work page 1978
[14]

M., Stevenson, M., Downie, W

Harden, R. M., Stevenson, M., Downie, W. W. & Wilson, G. Assessment of clinical competence using objective structured examination.Br Med J1, 447–451 (1975)

work page 1975
[15]

Z., Ramachandran, S., Gaunt, K

Khan, K. Z., Ramachandran, S., Gaunt, K. & Pushkar, P. The objective struc- tured clinical examination (osce): Amee guide no. 81. part i: an historical and theoretical perspective.Medical teacher35, e1437–e1446 (2013). 25

work page 2013
[16]

Barrows, H. S. An overview of the uses of standardized patients for teaching and evaluating clinical skills. aamc.Academic medicine68, 443–51 (1993)

work page 1993
[17]

Yao, Z.et al.Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553(2024)

work page arXiv 2024
[18]

Jiang, Y.et al.Medagentbench: a virtual ehr environment to benchmark medical llm agents.NEJM AI2, AIdbp2500144 (2025)

work page 2025
[19]

Y., Miao, B

Williams, C. Y., Miao, B. Y., Kornblith, A. E. & Butte, A. J. Evaluating the use of large language models to provide clinical recommendations in the emergency department.Nature communications15, 8236 (2024)

work page 2024
[20]

Li, S.et al.Mediq: Question-asking llms and a benchmark for reliable interac- tive clinical reasoning.Advances in Neural Information Processing Systems37, 28858–28888 (2024)

work page 2024
[21]

Chen, S.et al.Meddialog: a large-scale medical dialogue dataset.arXiv preprint arXiv:2004.033293(2020)

work page arXiv 2004
[22]

V., Saha, G., Das, R

Saley, V. V., Saha, G., Das, R. J., Raghu, D.et al.Meditod: An english dia- logue dataset for medical history taking with comprehensive annotations.arXiv preprint arXiv:2410.14204(2024)

work page arXiv 2024
[23]

Tsoukalas, A., Albertson, T., Tagkopoulos, I.et al.From data to optimal deci- sion making: a data-driven, probabilistic machine learning approach to decision support for patients with sepsis.JMIR medical informatics3, e3445 (2015)

work page 2015
[24]

& Ahmidi, N

von Kleist, H., Zamanian, A., Shpitser, I. & Ahmidi, N. Evaluation of active feature acquisition methods for time-varying feature settings.Journal of Machine Learning Research26, 1–84 (2025)

work page 2025
[25]

F., Kors, J

Markus, A. F., Kors, J. A. & Rijnbeek, P. R. The role of explainability in creat- ing trustworthy artificial intelligence for health care: a comprehensive survey of the terminology, design choices, and evaluation strategies.Journal of biomedical informatics113, 103655 (2021)

work page 2021
[26]

Tu, T.et al.Towards conversational diagnostic artificial intelligence.Nature1–9 (2025)

work page 2025
[27]

Zhu, J., Pan, J., Liu, Y., Liu, F. & Wu, J. Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning (2025). URL https://arxiv.org/abs/2502.07143. arXiv:2502.07143

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Walker, L.Artificial narrow intelligence-driven diagnostics: impacts, inequities, and policy imperatives in global healthcare. Ph.D. thesis, Technische Universit¨ at Wien (2024). 26

work page 2024
[29]

Advances in neural information processing systems36, 46595–46623 (2023)

Zheng, L.et al.Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)

work page 2023
[30]

H.et al.Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial.npj Digital Medicine8, 462 (2025)

Ke, Y. H.et al.Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial.npj Digital Medicine8, 462 (2025)

work page 2025
[31]

& Yuksel, D

Griot, M., Hemptinne, C., Vanderdonckt, J. & Yuksel, D. Large language models lack essential metacognition for reliable medical reasoning.Nature communications16, 642 (2025)

work page 2025
[32]

Gaber, F.et al.Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine8, 263 (2025)

work page 2025
[33]

Luo, M.-J.et al.A large language model digital patient system enhances ophthalmology history taking skills.NPJ Digital Medicine8, 502 (2025)

work page 2025
[34]

Hurst, A.et al.Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Nature medicine31, 932–942 (2025)

Liu, X.et al.A generalist medical language model for disease diagnosis assistance. Nature medicine31, 932–942 (2025)

work page 2025
[36]

URL https://arxiv.org/abs/2505.11733

Wu, K.et al.Medcasereasoning: Evaluating and learning diagnostic reason- ing from clinical case reports (2025). URL https://arxiv.org/abs/2505.11733. arXiv:2505.11733

work page arXiv 2025
[37]

E.et al.Mimic-iii, a freely accessible critical care database.Scientific data3, 1–9 (2016)

Johnson, A. E.et al.Mimic-iii, a freely accessible critical care database.Scientific data3, 1–9 (2016)

work page 2016
[38]

DeepSeek-V3 Technical Report

DeepSeek-AI, A. L.et al.Deepseek-v3 technical report, 2024.URL https://arxiv. org/abs/2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Team, G.et al.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Comanici, G.et al.Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Guo, D.et al.Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Yang, A.et al.Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

GLM, T.et al.Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793(2024). 27

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Wang, B.et al.Baichuan-m1: Pushing the medical capability of large language models.arXiv preprint arXiv:2502.12671(2025)

work page arXiv 2025
[45]

Grattafiori, A.et al.The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Qwen2.5-Coder Technical Report

Hui, B.et al.Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186 (2024). Appendix A Extended dataset construction This appendix provides the exact prompt templates and examples used in the three- stage data curation pipeline described in Section 4.1. We include only implementation details that are necessary for replication and omit conceptual ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Patient Information - [Sex, Age] (or “None”)

work page
[48]

Chief Complaint - [Primary symptom] + [Duration] (or “None”)

work page
[49]

History of Present Illness - Progression: [Chronological illness course] - Accompanying symptoms: [Comma-separated symptoms] (Use “None” if not available)

work page
[50]

Past Medical History - [Relevant history] (or “None”)

work page
[51]

None”) - [System-specific findings] (or “None

Physical Examination - Vital signs: [Temperature, HR, RR, BP...] (or “None”) - [System-specific findings] (or “None”)

work page
[52]

None”) - (2) Laboratory tests: [Key abnormal results] (or “None

Auxiliary Examination - (1) Imaging test: [Findings] (or “None”) - (2) Laboratory tests: [Key abnormal results] (or “None”) - (3)..... Repeat back only the structured output. Please convert the following clinical vignette or case into a structured medical record. Do NOT include any diagnosis results. Fill ”None” for missing fields. The text may be a clini...

work page
[53]

The record must contain all 6 required sections with their respective content: • Patient Information • Chief Complaint • History of Present Illness • Past Medical History • Physical Examination • Auxiliary Examination Note: Minor variations in section titles (e.g., spacing, punctuation, casing) are acceptable as long as the structure is clearly preserved

work page
[54]

All included details must be: • Explicitly stated in the original case,or • Clearly implied with no assumptions beyond clinical description

The structured record mustnot fabricate any content. All included details must be: • Explicitly stated in the original case,or • Clearly implied with no assumptions beyond clinical description

work page
[55]

Original Case:original text Structured Medical Record:formatted record Please assess strictly but reasonably

If any section lacks source information, using"None"is acceptable. Original Case:original text Structured Medical Record:formatted record Please assess strictly but reasonably. Answer only withyes(fully valid)orno(any fabrication, omission, or structural failure). Fig. A3: Validation prompt for auditing the structured record against the original text: all...

work page
[56]

Patient Information - Male, 44

work page
[57]

Chief Complaint - Chills for 3 days and arthralgias in the knees and hips (preceded by several days of unproductive cough and headache)

work page
[58]

One week before presentation, he was treated with a macrolide antibiotic and an NSAID

History of Present Illness - Progression: Unproductive cough and headache preceded chills and arthralgias. One week before presentation, he was treated with a macrolide antibiotic and an NSAID. - Accompanying symptoms: Cough, Headache, Chills, Arthralgias

work page
[59]

Past Medical History - Smoking history (None otherwise)

work page
[60]

Physical Examination - Vital signs: Temperature 38.5°C, Heart Rate 113/min, Blood Pressure 126/64 mmHg, Oxygen Saturation 98% on room air - Findings: No pericardial rub or crackles; epigastric tenderness

work page
[61]

Transthoracic echocardiogra- phy revealed preserved LV function, a 9-mm pericardial effusion, and slight IVC dilation

Auxiliary Examination - (1) Imaging test: Chest radiograph showed mild peribronchial cuffing. Transthoracic echocardiogra- phy revealed preserved LV function, a 9-mm pericardial effusion, and slight IVC dilation. Coronary CT excluded obstructive disease. Cardiac MRI demonstrated myocardial edema with multifocal subepicar- dial and subendocardial late gado...

work page
[62]

Cardiovascular System - Includes: Acute coronary syndrome, heart failure, arrhythmias (e.g., atrial fibrillation), hypertensive emergencies, aortic dissection, pericarditis, etc

work page
[63]

Respiratory System - Includes: Asthma, COPD, pneumonia, pulmonary embolism, spontaneous pneumothorax, hemoptysis- related diseases, etc

work page
[64]

Gastro-Hepatobiliary System - Includes: Upper or lower GI bleeding, appendicitis, cholecystitis, pancreatitis, liver cirrhosis and complications, inflammatory bowel disease, abdominal pain, diarrhea, etc

work page
[65]

Neurological System - Includes: Ischemic stroke, TIA, seizures/epilepsy, subarachnoid hemorrhage, headaches, dizzi- ness/vertigo, migraine, CNS infections, etc

work page
[66]

Infectious Diseases - Includes: Bacterial meningitis, urinary tract infections, community or hospital-acquired pneumonia, skin and soft tissue infections, early sepsis, tropical diseases (e.g., dengue, malaria), etc

work page
[67]

primary diagnosis

Metabolic, Renal & Genitourinary System - Includes: Diabetes mellitus (DKA, HHS), hypoglycemia, thyroid disorders (hyper/hypothyroidism), electrolyte disorders (e.g., hyponatremia, hyperkalemia), acute kidney injury, kidney stones, urinary tract diseases, etc. If a disease does not fit into any of the above six categories, classify it as:Other Please retu...

work page
[68]

- YouMUSTinclude the exact tag[Final Diagnosis]with brackets — do not rephrase, omit, or replace it. Fig. B8: Clinician prompt forTask 1(full context). The agent reads the complete structured record and must output[Final Diagnosis]followed by exactly three evidential items, using the mandatory tag verbatim. 35 Task 2 — Active Evidence-Seeking Clinician Pr...

work page
[69]

This test was not performed yet

• YouMUSTinclude the exact tag[Final Diagnosis]with brackets — do not rephrase, omit, or replace it. Note: • To improve diagnostic efficiency, please perform tests only when necessary for diagnosis. • You may only requestone specific test per turn. • Do NOTrepeat tests or other modules. • When confident, issue a[Final Diagnosis]. • You must complete the d...

work page

[1] [1]

Liao, Y.et al.Automatic interactive evaluation for large language models with state aware patient simulator.arXiv preprint arXiv:2403.08495(2024). 24

work page arXiv 2024

[2] [2]

& Einav, S

Idan, D. & Einav, S. Primer on large language models: an educational overview for intensivists.Critical Care29, 238 (2025)

work page 2025

[3] [3]

& Varghese, J

Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of chatgpt, google search and llama 2 for clinical decision support tasks.Nature communications15, 2050 (2024)

work page 2050

[4] [4]

Jin, D.et al.What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences11, 6421 (2021)

work page 2021

[5] [5]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146(2019)

work page arXiv 1909

[6] [6]

Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa : A large-scale multi- subject multi-choice dataset for medical domain question answering (2022). URL https://arxiv.org/abs/2203.14371. arXiv:2203.14371

work page arXiv 2022

[7] [7]

Singhal, K.et al.Toward expert-level medical question answering with large language models.Nature Medicine31, 943–950 (2025)

work page 2025

[8] [8]

Capabilities of GPT-4 on Medical Challenge Problems

Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Brin, D.et al.Comparing chatgpt and gpt-4 performance in usmle soft skill assessments.Scientific Reports13, 16492 (2023)

work page 2023

[10] [10]

NPJ Digital Medicine6, 226 (2023)

Liu, F.et al.A medical multimodal large language model for future pandemics. NPJ Digital Medicine6, 226 (2023)

work page 2023

[11] [11]

Bedi, S.et al.Testing and evaluation of health care applications of large language models: a systematic review.Jama(2025)

work page 2025

[12] [12]

Coderre, S., Mandin, H., Harasym, P. H. & Fick, G. H. Diagnostic reasoning strategies and diagnostic success.Medical education37, 695–703 (2003)

work page 2003

[13] [13]

S., Shulman, L

Elstein, A. S., Shulman, L. S. & Sprafka, S. A.Medical problem solving: An analysis of clinical reasoning(Harvard University Press, 1978)

work page 1978

[14] [14]

M., Stevenson, M., Downie, W

Harden, R. M., Stevenson, M., Downie, W. W. & Wilson, G. Assessment of clinical competence using objective structured examination.Br Med J1, 447–451 (1975)

work page 1975

[15] [15]

Z., Ramachandran, S., Gaunt, K

Khan, K. Z., Ramachandran, S., Gaunt, K. & Pushkar, P. The objective struc- tured clinical examination (osce): Amee guide no. 81. part i: an historical and theoretical perspective.Medical teacher35, e1437–e1446 (2013). 25

work page 2013

[16] [16]

Barrows, H. S. An overview of the uses of standardized patients for teaching and evaluating clinical skills. aamc.Academic medicine68, 443–51 (1993)

work page 1993

[17] [17]

Yao, Z.et al.Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553(2024)

work page arXiv 2024

[18] [18]

Jiang, Y.et al.Medagentbench: a virtual ehr environment to benchmark medical llm agents.NEJM AI2, AIdbp2500144 (2025)

work page 2025

[19] [19]

Y., Miao, B

Williams, C. Y., Miao, B. Y., Kornblith, A. E. & Butte, A. J. Evaluating the use of large language models to provide clinical recommendations in the emergency department.Nature communications15, 8236 (2024)

work page 2024

[20] [20]

Li, S.et al.Mediq: Question-asking llms and a benchmark for reliable interac- tive clinical reasoning.Advances in Neural Information Processing Systems37, 28858–28888 (2024)

work page 2024

[21] [21]

Chen, S.et al.Meddialog: a large-scale medical dialogue dataset.arXiv preprint arXiv:2004.033293(2020)

work page arXiv 2004

[22] [22]

V., Saha, G., Das, R

Saley, V. V., Saha, G., Das, R. J., Raghu, D.et al.Meditod: An english dia- logue dataset for medical history taking with comprehensive annotations.arXiv preprint arXiv:2410.14204(2024)

work page arXiv 2024

[23] [23]

Tsoukalas, A., Albertson, T., Tagkopoulos, I.et al.From data to optimal deci- sion making: a data-driven, probabilistic machine learning approach to decision support for patients with sepsis.JMIR medical informatics3, e3445 (2015)

work page 2015

[24] [24]

& Ahmidi, N

von Kleist, H., Zamanian, A., Shpitser, I. & Ahmidi, N. Evaluation of active feature acquisition methods for time-varying feature settings.Journal of Machine Learning Research26, 1–84 (2025)

work page 2025

[25] [25]

F., Kors, J

Markus, A. F., Kors, J. A. & Rijnbeek, P. R. The role of explainability in creat- ing trustworthy artificial intelligence for health care: a comprehensive survey of the terminology, design choices, and evaluation strategies.Journal of biomedical informatics113, 103655 (2021)

work page 2021

[26] [26]

Tu, T.et al.Towards conversational diagnostic artificial intelligence.Nature1–9 (2025)

work page 2025

[27] [27]

Zhu, J., Pan, J., Liu, Y., Liu, F. & Wu, J. Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning (2025). URL https://arxiv.org/abs/2502.07143. arXiv:2502.07143

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Walker, L.Artificial narrow intelligence-driven diagnostics: impacts, inequities, and policy imperatives in global healthcare. Ph.D. thesis, Technische Universit¨ at Wien (2024). 26

work page 2024

[29] [29]

Advances in neural information processing systems36, 46595–46623 (2023)

Zheng, L.et al.Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)

work page 2023

[30] [30]

H.et al.Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial.npj Digital Medicine8, 462 (2025)

Ke, Y. H.et al.Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial.npj Digital Medicine8, 462 (2025)

work page 2025

[31] [31]

& Yuksel, D

Griot, M., Hemptinne, C., Vanderdonckt, J. & Yuksel, D. Large language models lack essential metacognition for reliable medical reasoning.Nature communications16, 642 (2025)

work page 2025

[32] [32]

Gaber, F.et al.Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine8, 263 (2025)

work page 2025

[33] [33]

Luo, M.-J.et al.A large language model digital patient system enhances ophthalmology history taking skills.NPJ Digital Medicine8, 502 (2025)

work page 2025

[34] [34]

Hurst, A.et al.Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Nature medicine31, 932–942 (2025)

Liu, X.et al.A generalist medical language model for disease diagnosis assistance. Nature medicine31, 932–942 (2025)

work page 2025

[36] [36]

URL https://arxiv.org/abs/2505.11733

Wu, K.et al.Medcasereasoning: Evaluating and learning diagnostic reason- ing from clinical case reports (2025). URL https://arxiv.org/abs/2505.11733. arXiv:2505.11733

work page arXiv 2025

[37] [37]

E.et al.Mimic-iii, a freely accessible critical care database.Scientific data3, 1–9 (2016)

Johnson, A. E.et al.Mimic-iii, a freely accessible critical care database.Scientific data3, 1–9 (2016)

work page 2016

[38] [38]

DeepSeek-V3 Technical Report

DeepSeek-AI, A. L.et al.Deepseek-v3 technical report, 2024.URL https://arxiv. org/abs/2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Team, G.et al.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Comanici, G.et al.Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Guo, D.et al.Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Yang, A.et al.Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

GLM, T.et al.Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793(2024). 27

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Wang, B.et al.Baichuan-m1: Pushing the medical capability of large language models.arXiv preprint arXiv:2502.12671(2025)

work page arXiv 2025

[45] [45]

Grattafiori, A.et al.The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Qwen2.5-Coder Technical Report

Hui, B.et al.Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186 (2024). Appendix A Extended dataset construction This appendix provides the exact prompt templates and examples used in the three- stage data curation pipeline described in Section 4.1. We include only implementation details that are necessary for replication and omit conceptual ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Patient Information - [Sex, Age] (or “None”)

work page

[48] [48]

Chief Complaint - [Primary symptom] + [Duration] (or “None”)

work page

[49] [49]

History of Present Illness - Progression: [Chronological illness course] - Accompanying symptoms: [Comma-separated symptoms] (Use “None” if not available)

work page

[50] [50]

Past Medical History - [Relevant history] (or “None”)

work page

[51] [51]

None”) - [System-specific findings] (or “None

Physical Examination - Vital signs: [Temperature, HR, RR, BP...] (or “None”) - [System-specific findings] (or “None”)

work page

[52] [52]

None”) - (2) Laboratory tests: [Key abnormal results] (or “None

Auxiliary Examination - (1) Imaging test: [Findings] (or “None”) - (2) Laboratory tests: [Key abnormal results] (or “None”) - (3)..... Repeat back only the structured output. Please convert the following clinical vignette or case into a structured medical record. Do NOT include any diagnosis results. Fill ”None” for missing fields. The text may be a clini...

work page

[53] [53]

The record must contain all 6 required sections with their respective content: • Patient Information • Chief Complaint • History of Present Illness • Past Medical History • Physical Examination • Auxiliary Examination Note: Minor variations in section titles (e.g., spacing, punctuation, casing) are acceptable as long as the structure is clearly preserved

work page

[54] [54]

All included details must be: • Explicitly stated in the original case,or • Clearly implied with no assumptions beyond clinical description

The structured record mustnot fabricate any content. All included details must be: • Explicitly stated in the original case,or • Clearly implied with no assumptions beyond clinical description

work page

[55] [55]

Original Case:original text Structured Medical Record:formatted record Please assess strictly but reasonably

If any section lacks source information, using"None"is acceptable. Original Case:original text Structured Medical Record:formatted record Please assess strictly but reasonably. Answer only withyes(fully valid)orno(any fabrication, omission, or structural failure). Fig. A3: Validation prompt for auditing the structured record against the original text: all...

work page

[56] [56]

Patient Information - Male, 44

work page

[57] [57]

Chief Complaint - Chills for 3 days and arthralgias in the knees and hips (preceded by several days of unproductive cough and headache)

work page

[58] [58]

One week before presentation, he was treated with a macrolide antibiotic and an NSAID

History of Present Illness - Progression: Unproductive cough and headache preceded chills and arthralgias. One week before presentation, he was treated with a macrolide antibiotic and an NSAID. - Accompanying symptoms: Cough, Headache, Chills, Arthralgias

work page

[59] [59]

Past Medical History - Smoking history (None otherwise)

work page

[60] [60]

Physical Examination - Vital signs: Temperature 38.5°C, Heart Rate 113/min, Blood Pressure 126/64 mmHg, Oxygen Saturation 98% on room air - Findings: No pericardial rub or crackles; epigastric tenderness

work page

[61] [61]

Transthoracic echocardiogra- phy revealed preserved LV function, a 9-mm pericardial effusion, and slight IVC dilation

Auxiliary Examination - (1) Imaging test: Chest radiograph showed mild peribronchial cuffing. Transthoracic echocardiogra- phy revealed preserved LV function, a 9-mm pericardial effusion, and slight IVC dilation. Coronary CT excluded obstructive disease. Cardiac MRI demonstrated myocardial edema with multifocal subepicar- dial and subendocardial late gado...

work page

[62] [62]

Cardiovascular System - Includes: Acute coronary syndrome, heart failure, arrhythmias (e.g., atrial fibrillation), hypertensive emergencies, aortic dissection, pericarditis, etc

work page

[63] [63]

Respiratory System - Includes: Asthma, COPD, pneumonia, pulmonary embolism, spontaneous pneumothorax, hemoptysis- related diseases, etc

work page

[64] [64]

Gastro-Hepatobiliary System - Includes: Upper or lower GI bleeding, appendicitis, cholecystitis, pancreatitis, liver cirrhosis and complications, inflammatory bowel disease, abdominal pain, diarrhea, etc

work page

[65] [65]

Neurological System - Includes: Ischemic stroke, TIA, seizures/epilepsy, subarachnoid hemorrhage, headaches, dizzi- ness/vertigo, migraine, CNS infections, etc

work page

[66] [66]

Infectious Diseases - Includes: Bacterial meningitis, urinary tract infections, community or hospital-acquired pneumonia, skin and soft tissue infections, early sepsis, tropical diseases (e.g., dengue, malaria), etc

work page

[67] [67]

primary diagnosis

Metabolic, Renal & Genitourinary System - Includes: Diabetes mellitus (DKA, HHS), hypoglycemia, thyroid disorders (hyper/hypothyroidism), electrolyte disorders (e.g., hyponatremia, hyperkalemia), acute kidney injury, kidney stones, urinary tract diseases, etc. If a disease does not fit into any of the above six categories, classify it as:Other Please retu...

work page

[68] [68]

- YouMUSTinclude the exact tag[Final Diagnosis]with brackets — do not rephrase, omit, or replace it. Fig. B8: Clinician prompt forTask 1(full context). The agent reads the complete structured record and must output[Final Diagnosis]followed by exactly three evidential items, using the mandatory tag verbatim. 35 Task 2 — Active Evidence-Seeking Clinician Pr...

work page

[69] [69]

This test was not performed yet

• YouMUSTinclude the exact tag[Final Diagnosis]with brackets — do not rephrase, omit, or replace it. Note: • To improve diagnostic efficiency, please perform tests only when necessary for diagnosis. • You may only requestone specific test per turn. • Do NOTrepeat tests or other modules. • When confident, issue a[Final Diagnosis]. • You must complete the d...

work page