AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

Bryan YP Yan; Guangxin Dai; Huizi Yu; Jiahui Niu; Jingxian He; Kent CY So; Lizhou Fan; Wenkong Wang; Xiang Li; Xin Ma

arxiv: 2606.17474 · v1 · pith:N624Z4RVnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

Jiahui Niu , Huizi Yu , Wenkong Wang , Guangxin Dai , Jingxian He , Xiang Li , Zhiying Liang , Xinxin Lin

show 6 more authors

Kent CY So Bryan YP Yan Yun Kwok Wing Yanqiu Xing Xin Ma Lizhou Fan

This is my paper

Pith reviewed 2026-06-27 01:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsclinical consultationelectronic health recordsmulti-turn evaluationdiagnostic reasoningmedical AIknowledge graphspatient safety

0 comments

The pith

Final-answer accuracy alone cannot show if LLMs are ready for clinical consultations because models often fail in gathering information and handling uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AIPatient Arena, an evaluation framework that turns electronic health record data into patient-specific knowledge graphs to run realistic multi-turn clinical consultations. It scores large language models on eight dimensions of clinical competence including questioning skills, ethical conduct, information coverage, and diagnostic reasoning, using data from 437 patients plus two validation sets. The tests find solid results in medical interviewing and professional behavior but clear shortfalls in dealing with ambiguous answers, covering all relevant history, and reaching correct diagnoses. These outcomes show that checking only the model's final answer misses repeated interaction problems that arise in actual care. The framework offers a workflow-based way to test medical LLMs before any real-world use.

Core claim

AIPatient Arena applied to LLMs on primary and out-of-distribution patient cohorts shows mean scores of 4.43-4.99/5 in questioning skills, 4.38-4.93/5 in ethical conduct, and 3.80-4.72/5 in explanation clarity, with lower scores of 3.19-4.21/5 in information integration, 3.13-3.78/5 in medication safety, 2.57-3.32/5 in handling ambiguous responses, 2.08-3.02/5 in information coverage, and 2.63-3.55/5 in diagnostic accuracy and reasoning, plus recurrent failures such as repetitive questions and omitted history, establishing that final-answer accuracy alone is insufficient for evaluating clinical readiness.

What carries the argument

AIPatient Arena, the EHR-grounded framework that builds patient-specific knowledge graphs from electronic health records to support and score multi-turn physician-patient interactions across eight clinical competence dimensions.

If this is right

Models perform well on structured tasks like questioning and ethical conduct but show persistent weaknesses on dynamic tasks like ambiguity handling and diagnosis.
Richer conversational context improves diagnostic reasoning scores but produces only limited gains in treatment planning and medication justification.
Recurrent process failures such as omission of past medical history and repetitive questioning appear across tested models and cohorts.
Workflow-oriented, multi-turn evaluation is required to assess clinical utility beyond static or single-turn benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-construction method could be reused to create training environments that target specific weaknesses like uncertainty handling before any deployment testing.
The identified gaps suggest value in hybrid setups that pair LLMs with separate verification tools for medication safety and information completeness.
Similar arena-style evaluations might apply to other interactive professional domains that require sequential information gathering under uncertainty.

Load-bearing premise

The patient-specific knowledge graphs built from EHR data accurately represent the sequential, uncertain, and interactive nature of real clinical consultations.

What would settle it

A direct comparison of AIPatient Arena scores against performance in supervised real-patient trials or live clinical simulations where the same models interact with actual patients.

read the original abstract

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AIPatient Arena adds a multi-turn EHR-KG setup for scoring clinical LLM consultations on eight dimensions and flags real weaknesses in ambiguity handling, but the simulation itself lacks external validation.

read the letter

The main takeaway is that this paper builds AIPatient Arena to run multi-turn doctor-patient simulations from real EHR data turned into patient knowledge graphs, then scores LLMs across eight competence areas on cohorts of 437 plus two smaller validation sets. It reports solid numbers on questioning skills and ethics but lower ones on handling ambiguous responses, information coverage, and diagnostic reasoning, plus some interaction failure patterns like repetitive questions.

What is actually new is the specific end-to-end workflow framing with those eight dimensions tied to EHR-derived graphs rather than static single-turn medical QA. The scale and the out-of-distribution cohorts are useful, and the concrete mean scores plus listed failure modes give a clearer picture than pure accuracy metrics.

The soft spot is exactly the one in the stress-test note: no evidence is given that the knowledge graphs reproduce the uncertain, incomplete, back-and-forth character of actual consultations. Without expert review of graph fidelity or comparison to real transcripts, the low scores on HR, IC, and Dx could stem from an overly deterministic simulation instead of model shortcomings. Scoring rubrics, inter-rater reliability, and how ambiguous cases were handled are also not detailed enough in the abstract to fully support the claims.

This is for groups working on clinical LLM evaluation and safety checks who already care about moving past single-turn tests. A reader looking for process-oriented benchmarks will get value from the framework and the observed failure modes. It deserves a serious referee because the core idea addresses a real gap even if the validation steps need more work.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AIPatient Arena, an EHR-grounded evaluation framework that converts patient EHR data into knowledge graphs to support multi-turn simulated physician-patient interactions. It evaluates LLMs across eight clinical competence dimensions on a primary cohort of 437 patients plus two out-of-distribution cohorts (119 and 67 patients), reporting high performance in questioning skills (QS: 4.43-4.99/5), ethical conduct (ET: 4.38-4.93/5), and explanations (EX: 3.80-4.72/5), moderate results in information integration and medication safety, and lower scores in handling ambiguity (HR: 2.57-3.32/5), information coverage (IC: 2.08-3.02/5), and diagnostic reasoning (Dx: 2.63-3.55/5). The work identifies recurrent failures such as repetitive questioning and inadequate uncertainty handling, and concludes that final-answer accuracy alone is insufficient for assessing clinical readiness.

Significance. If the patient-specific knowledge graphs validly instantiate real consultation dynamics, the framework supplies a concrete, workflow-oriented evaluation method that surfaces interaction-level weaknesses missed by static benchmarks. The use of defined cohorts, explicit dimension-wise mean scores, and enumeration of recurrent failure modes constitutes a useful empirical contribution for pre-deployment assessment of medical LLMs.

major comments (2)

[Framework description and methods (abstract and § on AIPatient Arena)] The central claim—that process-based evaluation via AIPatient Arena demonstrates clinically meaningful weaknesses beyond final-answer accuracy—depends on the assumption that EHR-derived knowledge graphs faithfully reproduce the sequential, uncertain, and interactive character of real consultations. The manuscript provides no external validation of this assumption (e.g., expert review of graph completeness against source EHR notes or side-by-side comparison of simulated versus real multi-turn transcripts on the same patients).
[Evaluation methodology and results (abstract and cohort evaluation sections)] The reported mean scores across the eight dimensions (particularly the low scores on HR, IC, and Dx) are presented without accompanying details on scoring rubrics, operationalization of ambiguous responses, or inter-rater reliability, leaving the quantitative empirical claims under-supported.

minor comments (1)

[Abstract] The abstract states performance ranges but does not indicate how many LLMs were tested or which specific models correspond to the reported score ranges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the work.

read point-by-point responses

Referee: [Framework description and methods (abstract and § on AIPatient Arena)] The central claim—that process-based evaluation via AIPatient Arena demonstrates clinically meaningful weaknesses beyond final-answer accuracy—depends on the assumption that EHR-derived knowledge graphs faithfully reproduce the sequential, uncertain, and interactive character of real consultations. The manuscript provides no external validation of this assumption (e.g., expert review of graph completeness against source EHR notes or side-by-side comparison of simulated versus real multi-turn transcripts on the same patients).

Authors: We acknowledge that the manuscript does not include external validation such as expert review of graph completeness or direct comparisons with real multi-turn transcripts. The framework derives patient-specific knowledge graphs directly from EHR data to instantiate the consultation context, which provides grounding beyond synthetic patients. However, we agree this leaves the fidelity assumption under-supported for the central claim. In revision, we will add an explicit limitations subsection discussing the assumption and outlining plans for future expert validation studies. We maintain that the EHR-grounded approach offers a meaningful advance over static benchmarks, but accept that additional validation evidence is required. revision: yes
Referee: [Evaluation methodology and results (abstract and cohort evaluation sections)] The reported mean scores across the eight dimensions (particularly the low scores on HR, IC, and Dx) are presented without accompanying details on scoring rubrics, operationalization of ambiguous responses, or inter-rater reliability, leaving the quantitative empirical claims under-supported.

Authors: The eight dimensions are operationalized with specific rubrics in the methods section, and ambiguous responses were defined as patient statements requiring clarification or lacking sufficient detail for diagnosis. We recognize that these details are not presented with sufficient explicitness or supporting reliability metrics in the current version. In the revision, we will expand the methods to include the complete rubrics (as supplementary material if needed), clarify the operationalization of ambiguous cases, and report any inter-rater reliability statistics from the evaluation protocol. This will better substantiate the reported dimension scores. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with direct observations

full rationale

The paper is a purely empirical evaluation study. It constructs patient-specific knowledge graphs from EHR data and reports observed performance scores across eight dimensions on fixed cohorts. No equations, fitted parameters, predictions derived from subsets of data, or self-citation chains appear in the provided text. All reported means (e.g., QS 4.43-4.99, HR 2.57-3.32) are presented as direct measurements from the framework rather than outputs forced by construction. The central claim that process-based metrics reveal weaknesses beyond final-answer accuracy rests on the evaluation results themselves, not on any reduction to inputs or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces a new evaluation framework without additional free parameters, mathematical axioms, or invented physical entities beyond the framework itself.

invented entities (1)

AIPatient Arena framework no independent evidence
purpose: To enable EHR-grounded multi-turn evaluation of LLMs across eight clinical competence dimensions
Newly proposed evaluation system whose validity rests on the untested assumption that the constructed knowledge graphs faithfully model real consultations.

pith-pipeline@v0.9.1-grok · 5941 in / 1136 out tokens · 35543 ms · 2026-06-27T01:29:32.947745+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking
cs.AI 2026-06 unverdicted novelty 5.0

A provenance-aware multi-agent framework integrates community posts and regulatory records for nine antidepressants into a traceable Neo4j knowledge graph using standard medical vocabularies.

Reference graph

Works this paper leans on

26 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Together, these results suggest that current LLMs may reproduce the outward structure of clinical consultations while still exhibiting important limitations in information synthesis, ambiguity resolution, and clinically reliable decision-making. These weaknesses are broadly consistent with recent efforts to move medical LLM evaluation beyond static examin...

2001
[2]

RELIGION

To derive an overall consultation performance score, dimension-level scores were aggregated using a predefined weighted average. Diagnostic reasoning and medication safety and justification were assigned the highest weights (0.25 and 0.20, respectively) because of their direct relevance to clinical decision-making and patient safety. Medical interview que...

2025
[3]

External validation To evaluate the robustness and generalizability of AIPatient Arena beyond the primary CCQA evaluation setting, we conducted external validation on two independent cohorts representing different degrees of distribution shift relative to CCQA. 29 The first external cohort was derived from the publicly available PMC-Patients dataset, whic...

2024
[4]

Zhang, T. et al. Automatic prompt design via particle swarm optimization driven LLM for efficient medical information extraction. Swarm Evol. Comput. 95, 101922 (2025)

2025
[5]

Chen, K. et al. MDTeamGPT: A self-evolving LLM-based multi-agent framework for multi-Disciplinary Team medical consultation. arXiv [cs.AI] (2025)

2025
[6]

Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 6, e2343689 (2023)

2023
[7]

Goodman, K. E. et al. Identification of long-term care facility residence from admission notes using large language models. JAMA Netw. Open 8, e2512032 (2025)

2025
[8]

E., Yi, P

Goodman, K. E., Yi, P. H. and Morgan, D. J. AI-generated clinical summaries require more than accuracy. JAMA 331, 637–638 (2024)

2024
[9]

Bedi, S. et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nat. Med. 32, 943–951 (2026)

2026
[10]

Arora, R. K. et al. HealthBench: Evaluating large language models towards improved human health. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2505.08775

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.08775 2025
[11]

Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit. Med. 8, 274 (2025)

2025
[12]

Feldman, M. J. et al. Dedicated AI expert system vs generative AI with large language model for clinical diagnoses. JAMA Netw. Open 8, e2512994 (2025)

2025
[13]

Vrdoljak, J. et al. Evaluating large language and large reasoning models as decision support tools in emergency internal medicine. Comput. Biol. Med. 192, 110351 (2025)

2025
[14]

Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025)

2025
[15]

Yu, H. et al. Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education. Commun. Med. (Lond.) (2025) doi:10.1038/s43856-025-01283-x

work page doi:10.1038/s43856-025-01283-x 2025
[16]

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016)

2016
[17]

Singh, A. et al. OpenAI GPT-5 System Card. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2601.03267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267 2025
[18]

GPT-4o System Card

OpenAI et al. GPT-4o System Card. arXiv [cs.CL] (2024)

2024
[19]

Claude Sonnet 4.6 System Card

Sonnet, C. Claude Sonnet 4.6 System Card. https://www.anthropic.com/claude-sonnet-4-6-system-card (2026)

2026
[20]

https://www.anthropic.com/claude-4-system-card (2025)

2025
[21]

DeepSeek-V3 Technical Report

DeepSeek-AI et al. DeepSeek-V3 Technical Report. arXiv [cs.CL] (2024). 33

2024
[22]

Yang, A. et al. Qwen3 Technical Report. arXiv [cs.CL] (2025)

2025
[23]

Sellergren, A. et al. MedGemma Technical Report. arXiv [cs.AI] (2025)

2025
[24]

Chen, J. et al. Towards medical complex reasoning with LLMs through medical verifiable problems. in Findings of the Association for Computational Linguistics: ACL 2025 14552–14573 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025)

2025
[25]

Baichuan-M2: Scaling medical capability with large verifier system

M2 Team et al. Baichuan-M2: Scaling medical capability with large verifier system. arXiv [cs.LG] (2025)

2025
[26]

基于电子病历的模拟临床会诊大语言模型评估方法及系统

Bardhan, J., Colas, A., Roberts, K. and Wang, D. Z. DrugEHRQA: A question answering dataset on structured and unstructured electronic health records for medicine related queries. arXiv [cs.AI] (2022). Author information Contributions J.N.: Writing - review and editing, Writing - original draft, Validation, Software, Resources, Methodology, Investigation, ...

2022

[1] [1]

Together, these results suggest that current LLMs may reproduce the outward structure of clinical consultations while still exhibiting important limitations in information synthesis, ambiguity resolution, and clinically reliable decision-making. These weaknesses are broadly consistent with recent efforts to move medical LLM evaluation beyond static examin...

2001

[2] [2]

RELIGION

To derive an overall consultation performance score, dimension-level scores were aggregated using a predefined weighted average. Diagnostic reasoning and medication safety and justification were assigned the highest weights (0.25 and 0.20, respectively) because of their direct relevance to clinical decision-making and patient safety. Medical interview que...

2025

[3] [3]

External validation To evaluate the robustness and generalizability of AIPatient Arena beyond the primary CCQA evaluation setting, we conducted external validation on two independent cohorts representing different degrees of distribution shift relative to CCQA. 29 The first external cohort was derived from the publicly available PMC-Patients dataset, whic...

2024

[4] [4]

Zhang, T. et al. Automatic prompt design via particle swarm optimization driven LLM for efficient medical information extraction. Swarm Evol. Comput. 95, 101922 (2025)

2025

[5] [5]

Chen, K. et al. MDTeamGPT: A self-evolving LLM-based multi-agent framework for multi-Disciplinary Team medical consultation. arXiv [cs.AI] (2025)

2025

[6] [6]

Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 6, e2343689 (2023)

2023

[7] [7]

Goodman, K. E. et al. Identification of long-term care facility residence from admission notes using large language models. JAMA Netw. Open 8, e2512032 (2025)

2025

[8] [8]

E., Yi, P

Goodman, K. E., Yi, P. H. and Morgan, D. J. AI-generated clinical summaries require more than accuracy. JAMA 331, 637–638 (2024)

2024

[9] [9]

Bedi, S. et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nat. Med. 32, 943–951 (2026)

2026

[10] [10]

Arora, R. K. et al. HealthBench: Evaluating large language models towards improved human health. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2505.08775

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.08775 2025

[11] [11]

Asgari, E. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit. Med. 8, 274 (2025)

2025

[12] [12]

Feldman, M. J. et al. Dedicated AI expert system vs generative AI with large language model for clinical diagnoses. JAMA Netw. Open 8, e2512994 (2025)

2025

[13] [13]

Vrdoljak, J. et al. Evaluating large language and large reasoning models as decision support tools in emergency internal medicine. Comput. Biol. Med. 192, 110351 (2025)

2025

[14] [14]

Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025)

2025

[15] [15]

Yu, H. et al. Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical education. Commun. Med. (Lond.) (2025) doi:10.1038/s43856-025-01283-x

work page doi:10.1038/s43856-025-01283-x 2025

[16] [16]

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016)

2016

[17] [17]

Singh, A. et al. OpenAI GPT-5 System Card. arXiv [cs.CL] (2025) doi:10.48550/arXiv.2601.03267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267 2025

[18] [18]

GPT-4o System Card

OpenAI et al. GPT-4o System Card. arXiv [cs.CL] (2024)

2024

[19] [19]

Claude Sonnet 4.6 System Card

Sonnet, C. Claude Sonnet 4.6 System Card. https://www.anthropic.com/claude-sonnet-4-6-system-card (2026)

2026

[20] [20]

https://www.anthropic.com/claude-4-system-card (2025)

2025

[21] [21]

DeepSeek-V3 Technical Report

DeepSeek-AI et al. DeepSeek-V3 Technical Report. arXiv [cs.CL] (2024). 33

2024

[22] [22]

Yang, A. et al. Qwen3 Technical Report. arXiv [cs.CL] (2025)

2025

[23] [23]

Sellergren, A. et al. MedGemma Technical Report. arXiv [cs.AI] (2025)

2025

[24] [24]

Chen, J. et al. Towards medical complex reasoning with LLMs through medical verifiable problems. in Findings of the Association for Computational Linguistics: ACL 2025 14552–14573 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025)

2025

[25] [25]

Baichuan-M2: Scaling medical capability with large verifier system

M2 Team et al. Baichuan-M2: Scaling medical capability with large verifier system. arXiv [cs.LG] (2025)

2025

[26] [26]

基于电子病历的模拟临床会诊大语言模型评估方法及系统

Bardhan, J., Colas, A., Roberts, K. and Wang, D. Z. DrugEHRQA: A question answering dataset on structured and unstructured electronic health records for medicine related queries. arXiv [cs.AI] (2022). Author information Contributions J.N.: Writing - review and editing, Writing - original draft, Validation, Software, Resources, Methodology, Investigation, ...

2022