pith. sign in

arxiv: 2605.30637 · v1 · pith:KTOG3O2Snew · submitted 2026-05-28 · 💻 cs.AI

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Pith reviewed 2026-06-29 06:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords EHR benchmarkLLM evaluationclinical decision makingelectronic health recordsautomated benchmarkdiagnosis treatment prognosisknowledge base verification
0
0 comments X

The pith

EHRBench builds nearly 1 million EHR-grounded QA items via an automated LLM-KB pipeline to evaluate LLMs on diagnosis, treatment, and prognosis tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EHRBench as a benchmark that generates nearly one million question-answer pairs directly from real patient electronic health records to test how LLMs handle clinical decision making. The construction uses an interaction pipeline where an LLM turns encounter trajectories into templates, which are then verified and enriched against a knowledge base to remove unreliable content. This produces items for three tasks that require inference under incomplete evidence: inferring diagnoses, choosing treatments, and predicting outcomes. A sympathetic reader would care because prior LLM tests in medicine have often used synthetic or small-scale data that do not match the demands of actual clinical workflows.

Core claim

EHRBench is constructed through an EHR-LLM-KB interaction pipeline. A specialized LLM converts encounter-level EHR trajectories into structured templates and deterministically instantiates them into QA items. Systematic KB-based verification and enrichment filters hallucinated or ambiguous relations. This yields 960067 QA items spanning diagnosis, treatment, and prognosis. Benchmarking more than 30 LLMs produces consistent capability trends across settings.

What carries the argument

The EHR-LLM-KB interaction pipeline that automates template generation from real EHR trajectories and applies knowledge-base verification to ensure reliability at scale.

If this is right

  • LLMs can be tested at scale on practical clinical decision tasks that require substantive biomedical inference from real patient records.
  • The three tasks together cover diagnosis, treatment selection, and prognosis prediction under incomplete evidence.
  • Consistent performance trends across more than thirty models validate that the benchmark produces stable rankings.
  • The automated pipeline removes the need for manual curation while maintaining claimed clinical reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline structure could be reused with updated knowledge bases to refresh the benchmark as medical guidelines evolve.
  • Models that perform well on EHRBench might still require separate testing on live clinical workflows where patient context changes over time.
  • The large item count makes it feasible to create task-specific subsets for targeted model improvement or error analysis.

Load-bearing premise

The knowledge-base verification step reliably catches and removes hallucinated or ambiguous clinical relations produced by the LLM template generator without needing manual review.

What would settle it

Independent expert review of a random sample of several hundred QA items revealing a substantial fraction that contain clinically incorrect, unverifiable, or ambiguous content.

Figures

Figures reproduced from arXiv: 2605.30637 by Carl Yang, Guanchen Wu, Hejie Cui, Jiaying Lu, Kai Shu, Keqi Han, Xiao Hu, Yunpeng Xiao, Yuzhang Xie, Ziyang Zhang.

Figure 1
Figure 1. Figure 1: Overview of EHRBench. EHRBench automatically [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Construction pipeline of EHRBench. Starting from raw structured EHRs, we preprocess and normalize encounter-level [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall accuracy versus throughput (M Tokens/h) for all evaluated models. Throughput is computed as total tokens [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall performance and token cost of reasoning model configurations on EHRBench. Each point corresponds to one [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
read the original abstract

Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EHRBench, a benchmark of 960,067 QA items spanning diagnosis, treatment, and prognosis tasks, constructed via an automated EHR-LLM-KB pipeline that converts EHR trajectories into templates, instantiates them deterministically, and applies KB verification/enrichment to filter hallucinations. It then benchmarks >30 LLMs on the resulting dataset and reports performance trends.

Significance. If the reliability of the KB filtering step holds, EHRBench would offer a valuable large-scale, EHR-grounded resource for evaluating LLMs on clinically relevant inference tasks, filling a gap in scalable CDM benchmarks. The scale and multi-task coverage are strengths, but the absence of validation data for the core reliability mechanism substantially reduces the immediate utility of the contribution.

major comments (1)
  1. [Abstract and Section 3 (pipeline)] Abstract and pipeline description (Section 3): The central claim that 'systematic KB-based verification and enrichment' successfully filters hallucinated or ambiguous relations to produce clinically reliable QA items without manual review is unsupported by any quantitative evidence. No error-rate measurements, precision/recall of the KB step, clinician-rated samples of pre- and post-filter items, or comparison to a human gold standard are reported. This directly undermines the 'reliable' qualifier in the title and abstract, as undetected errors from the LLM template generator can propagate into the final ~1M items.
minor comments (2)
  1. [Section 3] The manuscript should clarify the exact criteria and implementation details of the KB verification rules (e.g., which relations are checked and how enrichment is performed) to allow reproducibility.
  2. [Results section] Figure or table presenting the benchmark statistics (e.g., breakdown by task and source) would benefit from explicit error bars or confidence intervals if any sampling was involved in construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the scale and scope of EHRBench. We address the central concern regarding validation of the KB filtering step below.

read point-by-point responses
  1. Referee: Abstract and pipeline description (Section 3): The central claim that 'systematic KB-based verification and enrichment' successfully filters hallucinated or ambiguous relations to produce clinically reliable QA items without manual review is unsupported by any quantitative evidence. No error-rate measurements, precision/recall of the KB step, clinician-rated samples of pre- and post-filter items, or comparison to a human gold standard are reported. This directly undermines the 'reliable' qualifier in the title and abstract, as undetected errors from the LLM template generator can propagate into the final ~1M items.

    Authors: We agree that the manuscript does not report direct quantitative validation (e.g., precision/recall, clinician ratings, or pre/post-filter error rates) for the KB verification step, and that this weakens the strength of the reliability claim. The current argument rests on the pipeline design (deterministic instantiation plus KB grounding), but lacks empirical measurement of its filtering efficacy. In the revised manuscript we will add a dedicated evaluation subsection in Section 3 that reports: (i) the fraction of items filtered by the KB step, (ii) clinician review of a 500-item random sample of pre- and post-filter QA pairs with inter-rater agreement and error-rate reduction statistics, and (iii) a comparison against a small human-annotated gold set. These additions will either substantiate or appropriately qualify the 'reliable' descriptor. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark pipeline uses external EHR data and independent KB without self-referential reduction.

full rationale

The paper constructs EHRBench via an EHR-LLM-KB pipeline that converts external patient trajectories into templates and applies KB verification for filtering. No equations, fitted parameters, or predictions are defined in terms of the output QA items themselves. The reliability claim rests on the pipeline's design rather than any self-citation chain or ansatz that reduces the result to its inputs by construction. The construction draws from independent sources (EHR records and KB), making the derivation self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the LLM-KB pipeline produces clinically accurate QA items at scale; no free parameters or invented entities are introduced, but the domain assumption about verification reliability is load-bearing.

axioms (1)
  • domain assumption The EHR-LLM-KB interaction pipeline produces reliable QA items without significant hallucination or ambiguity after verification.
    Invoked to justify the automated construction of the benchmark at scale without manual curation.

pith-pipeline@v0.9.1-grok · 5851 in / 1268 out tokens · 25217 ms · 2026-06-29T06:38:53.112496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

133 extracted references · 49 canonical work pages · 20 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, et al . 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

  2. [2]

    Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. 2025. Longhealth: A question answering benchmark with long clinical documents.Journal of Healthcare Informatics Research(2025), 1–17

  3. [3]

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, et al. 2025. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)

  4. [4]

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, et al . 2025. Healthbench: Evaluat- ing large language models towards improved human health.arXiv preprint arXiv:2505.08775(2025)

  5. [5]

    Yaara Artsi, Vera Sorin, Eli Konen, Benjamin S Glicksberg, Girish Nadkarni, and Eyal Klang. 2024. Large language models for generating medical examinations: systematic review.BMC medical education24, 1 (2024), 354

  6. [6]

    Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, et al. 2023. Ehrxqa: A multi-modal question answering dataset for elec- tronic health records with chest x-ray images.Advances in Neural Information Processing Systems36 (2023), 3867–3880

  7. [7]

    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patino, et al. 2025. SmolLM3: smol, multilingual, long-context reasoner.Hugging Face Blog(2025)

  8. [8]

    Jayetri Bardhan, Anthony Colas, Kirk Roberts, and Daisy Zhe Wang. 2022. Drugehrqa: A question answering dataset on structured and unstructured electronic health records for medicine related queries.arXiv preprint arXiv:2205.01290(2022)

  9. [9]

    Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer Costa, Cindy Shavor, Wenshan Han, Zhiyong Lu, and Zhe He. 2025. Preliminary anal- ysis of the impact of lab results on large language model generated differential diagnoses.npj Digital Medicine8, 1 (2025), 166

  10. [10]

    Olivier Bodenreider. 2004. The unified medical language system (UMLS): in- tegrating biomedical terminology.Nucleic acids research32, suppl_1 (2004), D267–D270

  11. [11]

    Joeran S Bosma, Koen Dercksen, Luc Builtjes, Romain André, Christian Roest, Stefan J Fransen, Constant R Noordman, Mar Navarro-Padilla, Judith Lefkes, Natália Alves, et al. 2025. The DRAGON benchmark for clinical NLP.npj Digital Medicine8, 1 (2025), 289

  12. [12]

    Kathi Canese and Sarah Weis. 2013. PubMed: the bibliographic database.The NCBI handbook2, 1 (2013), 2013

  13. [13]

    Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. 2024. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925(2024)

  14. [14]

    Christopher Chiu, Silviu Pitis, and Mihaela van der Schaar. 2025. Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models.arXiv preprint arXiv:2510.10278(2025)

  15. [15]

    Clément Christophe, Praveen K Kanithi, Tathagata Raha, Shadab Khan, and Marco AF Pimentel. 2024. Med42-v2: A Suite of Clinical LLMs. https://arxiv. org/abs/2408.06142

  16. [16]

    Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam H Shah. 2025. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine8, 1 (2025), 577

  17. [17]

    Amin Dada, Osman Koraş, Marie Bauer, Amanda Butler, Kaleb Smith, Jens Kleesiek, and Julian Friedrich. 2025. Medisumqa: Patient-oriented question- answer generation from discharge letters. InProceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health). 124–136

  18. [18]

    Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, and Paul Pu Liang. 2025. Climb: Data foundations for large scale multimodal clinical foundation models.arXiv preprint arXiv:2503.07667(2025)

  19. [19]

    Felix J Dorfner, Amin Dada, Felix Busch, Marcus R Makowski, Tianyu Han, Daniel Truhn, et al. 2024. Biomedical large languages models seem not to be supe- rior to generalist models on unseen medical data.arXiv preprint arXiv:2408.13833 (2024)

  20. [20]

    Felix J Dorfner, Amin Dada, Felix Busch, Marcus R Makowski, Tianyu Han, Daniel Truhn, et al. 2025. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks.Journal of the American Medical Informatics Association32, 6 (2025), 1015–1024

  21. [21]

    Yongqi Fan, Hongli Sun, Kui Xue, Xiaofan Zhang, Shaoting Zhang, and Tong Ruan. 2025. Medodyssey: A medical domain benchmark for long context evalu- ation up to 200k tokens. InFindings of the Association for Computational Lin- guistics: NAACL 2025. 32–56

  22. [22]

    Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Wang Siyuan, Zhongyu Wei, and Fei Huang. 2025. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics. 10183–10213

  23. [23]

    Scott L Fleming, Alejandro Lozano, William J Haberkorn, Jenelle A Jindal, Ed- uardo Reis, Rahul Thapa, Louis Blankemeier, Julian Z Genkins, Ethan Steinberg, Ashwin Nayak, et al . 2024. Medalign: A clinician-generated dataset for in- struction following with electronic medical records. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 3...

  24. [24]

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, et al

  25. [25]

    Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793(2024)

  26. [26]

    Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, and Yang Liu. 2025. The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents’ Inquiry Capability.arXiv preprint arXiv:2509.24958(2025)

  27. [27]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  28. [28]

    Chenlu Guo, Nuo Xu, Yi Chang, and Yuan Wu. 2024. Chbench: A chinese dataset for evaluating health in large language models.arXiv preprint arXiv:2409.15766 (2024)

  29. [29]

    Keqi Han, Songlin Zhao, Yao Su, Xiang Li, Yixuan Yuan, Lifang He, and Carl Yang. 2026. Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration.arXiv preprint arXiv:2605.09366(2026)

  30. [30]

    Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. 2024. Medsafetybench: Evaluating and improving the medical safety of large language models.Advances in Neural Information Processing Systems37 (2024), 33423– 33454

  31. [31]

    Vinyas Harish, Felipe Morgado, Ariel D Stern, and Sunit Das. 2021. Artificial intelligence and clinical decision making: the new nature of medical uncertainty. Academic Medicine96, 1 (2021), 31–36

  32. [32]

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. 2024. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22170–22183

  33. [33]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55

  34. [34]

    Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. 2025. m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869(2025)

  35. [35]

    Liesbeth Hunik, Asma Chaabouni, Twan van Laarhoven, Tim C Olde Hartman, Ralph TH Leijenaar, Jochen WL Cals, Annemarie A Uijen, and Henk J Schers

  36. [36]

    Diagnostic Prediction Models for Primary Care, Based on AI and Electronic Health Records: Systematic Review.JMIR Medical Informatics13, 1 (2025), e62862

  37. [37]

    Mingyi Jia, Junwen Duan, Yan Song, and Jianxin Wang. 2025. medikal: Inte- grating knowledge graphs as assistants of llms for enhanced clinical diagnosis on emrs. InProceedings of the 31st International Conference on Computational Linguistics. 9278–9298

  38. [38]

    AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de Las Casas, et al. 2023. Mistral 7B. arXiv 2023.arXiv preprint arXiv:2310.06825(2023)

  39. [39]

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences11, 14 (2021), 6421

  40. [40]

    Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al

  41. [41]

    MIMIC-IV, a freely accessible electronic health record dataset.Scientific data10, 1 (2023), 1

  42. [42]

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data3, 1 (2016), 1–9

  43. [43]

    Shreya Johri, Jaehwan Jeong, Benjamin A Tran, Daniel I Schlessinger, Shannon Wongvibulsin, Zhuo Ran Cai, Roxana Daneshjou, and Pranav Rajpurkar. 2024. CRAFT-MD: A conversational evaluation framework for comprehensive assess- ment of clinical LLMs. InAAAI 2024 Spring Symposium on Clinical Foundation Models

  44. [44]

    Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad Safranek, Abid Anwar, Andrew Zhang, et al. 2024. Medcalc-bench: Evaluating large language models for medical calculations.Advances in Neural Information Processing Systems37 (2024), 84730– 84745

  45. [45]

    Halil Kilicoglu, Dongwook Shin, Marcelo Fiszman, Graciela Rosemblat, and Thomas C Rindflesch. 2012. SemMedDB: a PubMed-scale repository of biomedi- cal semantic predications.Bioinformatics28, 23 (2012), 3158–3160

  46. [46]

    Michelle Kang Kim, Carol Rouphael, John McMichael, Nicole Welch, and Srini- vasan Dasarathy. 2023. Challenges in and opportunities for electronic health record-based data analysis and interpretation.Gut and liver18, 2 (2023), 201

  47. [47]

    Yunsoo Kim, Jinge Wu, Yusuf Abdulle, and Honghan Wu. 2024. MedExQA: Medical question answering benchmark with multiple explanations.arXiv EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs KDD 2026, August 9–13, 2026, Jeju Island, Republic of Korea. preprint arXiv:2406.06331(2024)

  48. [48]

    Rachel Knevel and Katherine P Liao. 2023. From real-world electronic health record data to real-world results using artificial intelligence.Annals of the Rheumatic Diseases82, 3 (2023), 306–311

  49. [49]

    Yogesh Kumar, Apeksha Koul, Ruchi Singla, and Muhammad Fazal Ijaz. 2023. Artificial intelligence in disease diagnosis: a systematic literature review, syn- thesizing framework and future research agenda.Journal of ambient intelligence and humanized computing14, 7 (2023), 8459–8486

  50. [50]

    Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwang Kim, Jeewon Yang, Seunghyun Won, and Edward Choi. 2024. Ehrnoteqa: An llm benchmark for real-world clinical practice using discharge summaries. Advances in Neural Information Processing Systems37 (2024), 124575–124611

  51. [51]

    Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, and Yang Liu. 2025. Doctor- r1: Mastering clinical inquiry with experiential agentic reinforcement learning. arXiv preprint arXiv:2510.04284(2025)

  52. [52]

    Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. 2022. Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems35 (2022), 15589–15601

  53. [53]

    Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. 2024. Agent hospital: A simu- lacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957 (2024)

  54. [54]

    Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei W Koh, and Yulia Tsvetkov. 2024. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems37 (2024), 28858–28888

  55. [55]

    Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, et al . 2025. EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis.arXiv preprint arXiv:2510.25628(2025)

  56. [56]

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, et al . 2026. Ministral 3.arXiv preprint arXiv:2601.08584(2026)

  57. [57]

    Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, et al . 2024. Large language models in the clinic: a comprehensive benchmark.arXiv preprint arXiv:2405.00716(2024)

  58. [58]

    Fenglin Liu, Jinge Wu, Hongjian Zhou, Xiao Gu, Soheila Molaei, Anshul Thakur, Lei Clifton, Honghan Wu, and David A Clifton. 2025. RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction.arXiv preprint arXiv:2503.03802 (2025)

  59. [59]

    Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, and Michael Lyu. 2024. Medchain: Bridging the gap between llm agents and clinical practice through interactive sequential benchmarking.arXiv preprint arXiv:2412.01605(2024)

  60. [60]

    Jie Liu, Wenxuan Wang, Su Yihang, Jingyuan Huang, Yudi Zhang, Cheng-Yi Li, Wenting Chen, Xiaohan Xing, Kao-Jung Chang, Linlin Shen, et al . 2025. Asclepius: A spectrum evaluation benchmark for medical multi-modal large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 24181–24201

  61. [61]

    Ruoyu Liu, Kui Xue, Xiaofan Zhang, and Shaoting Zhang. 2025. Interactive evaluation for medical llms via task-oriented dialogue system. InProceedings of the 31st International Conference on Computational Linguistics. 4871–4896

  62. [62]

    Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. 2024. On llms-driven synthetic data generation, curation, and evaluation: A survey.arXiv preprint arXiv:2406.15126(2024)

  63. [63]

    Meng Lu, Yuzhang Xie, Zhenyu Bi, Shuxiang Cao, and Xuan Wang. 2025. CROSSAGENTIE: Cross-Type and Cross-Task Multi-Agent LLM Collaboration for Zero-Shot Information Extraction. InFindings of the Association for Compu- tational Linguistics: ACL 2025. 13953–13977

  64. [64]

    Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, and Dan Roth. 2024. Expertqa: Expert-curated questions and attributed answers. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 3025–3045

  65. [65]

    Izet Masic. 2022. Medical decision making-an overview.Acta Informatica Medica 30, 3 (2022), 230

  66. [66]

    Nikita Mehandru, Niloufar Golchini, David Bamman, Travis Zack, Melanie F Molina, and Ahmed Alaa. 2025. ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room.arXiv preprint arXiv:2505.22919(2025)

  67. [67]

    Benjamin Molinet, Santiago Marro, Elena Cabrio, and Serena Villata. 2024. Ex- planatory argumentation in natural language for correct and incorrect medical diagnoses.Journal of Biomedical Semantics15, 1 (2024), 8

  68. [68]

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10862–10878

  69. [69]

    Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, et al. 2025. Sequential Diagnosis with Language Models. arXiv preprint arXiv:2506.22405(2025)

  70. [70]

    Kimberly J O’malley, Karon F Cook, Matt D Price, Kimberly Raiford Wildes, John F Hurdle, and Carol M Ashton. 2005. Measuring diagnoses: ICD code accuracy.Health services research40, 5p2 (2005), 1620–1639

  71. [71]

    David Oniani, Xizhi Wu, Shyam Visweswaran, Sumit Kapoor, Shravan Koora- gayalu, Katelyn Polanska, and Yanshan Wang. 2024. Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In2024 IEEE 12th International Conference on Healthcare Informatics (ICHI). IEEE, 694–702

  72. [72]

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning. PMLR, 248–260

  73. [73]

    Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrqa: A large corpus for question answering on electronic medical records.arXiv preprint arXiv:1809.00732(2018)

  74. [74]

    Maria Panagioti, Kanza Khan, Richard N Keers, Aseel Abuzour, Denham Phipps, Evangelos Kontopantelis, Peter Bower, Stephen Campbell, Razaan Haneef, An- thony J Avery, et al. 2019. Prevalence, severity, and nature of preventable patient harm across medical care settings: systematic review and meta-analysis.bmj 366 (2019)

  75. [75]

    Junwoo Park, Youngwoo Cho, Haneol Lee, Jaegul Choo, and Edward Choi. 2021. Knowledge graph-based question answering with electronic health records. In Machine Learning for Healthcare Conference. PMLR, 36–53

  76. [76]

    Own-point-of-view

    Thierry Pelaccia, Jacques Tardif, Emmanuel Triby, and Bernard Charlin. 2017. A Novel Approach to Study Medical Decision Making in the Clinical Setting: The “Own-point-of-view” Perspective.Academic Emergency Medicine24, 7 (2017), 785–795

  77. [77]

    Oriel Perets, Ofir Ben Shoham, Nir Grinberg, and Nadav Rappoport. 2025. CUP- Case: Clinically Uncommon Patient Cases and Diagnoses Dataset. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 28293–28301

  78. [78]

    Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2025. Quantify- ing the reasoning abilities of llms on real-world clinical cases.arXiv preprint arXiv:2503.04691(2025)

  79. [79]

    Preethi Raghavan, Jennifer J Liang, Diwakar Mahajan, Rachita Chandra, and Peter Szolovits. 2021. emrkbqa: A clinical knowledge-base question answering dataset. Association for Computational Linguistics

  80. [80]

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. 2024. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments.arXiv preprint arXiv:2405.07960(2024)

Showing first 80 references.