MedEvoEval: Evaluating Continual Evolution of Doctor Agents through Simulated Clinical Episodes

Hui Zhang

arxiv: 2606.28900 · v1 · pith:SAWPJNJMnew · submitted 2026-06-27 · 💻 cs.AI · cs.CL

MedEvoEval: Evaluating Continual Evolution of Doctor Agents through Simulated Clinical Episodes

Hui Zhang This is my paper

Pith reviewed 2026-06-30 09:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords MedEvoEvaldoctor agentscontinual evolutionsimulated clinical episodeslongitudinal evaluationaction-gated evidencemedical decision systemsmemory retention

0 comments

The pith

MedEvoEval supplies an executable framework of action-gated simulated episodes to measure whether doctor agents improve, transfer skills, and retain capabilities across repeated clinical encounters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedEvoEval as a longitudinal evaluation setup that turns source cases into role-specific patient, examination, and manager views. Evidence appears only after valid agent actions, and every episode produces a structured trace connecting observations, decisions, scores, and optional memory updates. Existing medical QA benchmarks score final answers on complete inputs, while most interactive tests stay limited to single encounters, so they cannot track how experience changes behavior over time. If the framework works, researchers can run controlled tests of memory maturation, held-out transfer, and backward retention that current methods miss.

Core claim

MedEvoEval provides a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities over time. Each source case becomes executable outpatient episodes with action-gated evidence access; structured traces record the full sequence of observations, actions, final outputs, and manager scores, plus optional experience write-back. Experiments with the released artifact of 700 episodes demonstrate that these traces surface process costs invisible to final-answer scoring, show resource reallocation under MDT-style consultation, and enable direct longitudinal analyses of memory, transfer, update response, and retention.

What carries the argument

MedEvoEval, an executable longitudinal evaluation framework that converts cases into role-specific views, gates evidence behind valid actions, and records structured traces of observations, actions, scores, and experience updates.

If this is right

Episode traces reveal process costs such as unnecessary consultations or repeated tests that final-answer scoring conceals.
MDT-style consultation changes how agents allocate examination and manager resources within an episode.
Longitudinal runs can quantify memory maturation, held-out transfer of behaviors, response to update stages, and backward retention of prior capabilities.
The released artifact supplies schemas, runner, scoring scripts, and derivatives that let others replicate or extend the same measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If agents demonstrate measurable improvement on the framework, the same trace format could be adapted to test continual learning in other sequential decision domains such as legal case handling.
The action-gated design makes it possible to insert controlled interventions at specific steps and measure their downstream effect on later episodes.
Release of the full E&D artifact lowers the barrier for comparing different memory or reflection mechanisms under identical longitudinal conditions.

Load-bearing premise

The simulated outpatient episodes with action-gated evidence access and structured traces accurately model real clinical information gathering and decision evolution without introducing artifacts that distort measurements of agent improvement or retention.

What would settle it

Run the same set of agents on MedEvoEval and on a matched set of real outpatient cases; if the ranking of agents by improvement and retention differs substantially between the two, the simulation's validity for measuring continual evolution is challenged.

Figures

Figures reproduced from arXiv: 2606.28900 by Hui Zhang.

**Figure 2.** Figure 2: MedEvoEval experiment overview. (A) Episode traces expose outcome and process metrics beyond answer-only scoring. (B) MDT-style consultation tests resource reallocation with the primary doctor fixed. (C) Mature memory is evaluated on the longitudinal stream and held-out transfer cases. (D) Update-retention diagnostics separate adaptation from backward degradation. Statistical comparisons. Most comparisons … view at source ↗

**Figure 3.** Figure 3: Role-specific case views and visibility constraints. Patient and examination information is [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Model-comparison evidence on 80 shared episodes. (A) Outcome components separate [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: MDT decomposition on 100 shared episodes. (A) Outcome and utility compare no [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Stage-level memory curves. (A) Total score improves mainly after memory matures. (B) [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: External transfer comparison between H0 and H3 on 100 held-out episodes. [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Update and retention diagnostics. (A) Stage total score separates pre-update, update, [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Memory ablations on the 100-case mechanism subset. (A) Memory-content comparison [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

read the original abstract

Doctor agents are moving beyond single-turn answer generation toward evolving clinical decision systems. Within an outpatient episode, they acquire evidence, use examination and consultation resources, and decide when to finalize a diagnosis and management plan. Across episodes, their behavior may change through memory, retrieval, reflection, or other update mechanisms. Current evaluations only partially cover this setting. Fixed-input medical QA benchmarks score final answers from complete inputs, whereas many interactive benchmarks still focus on individual encounters or fixed runs, providing limited support for evaluating how episode-level decisions interact with cross-episode experience. We introduce MedEvoEval, an executable longitudinal evaluation framework based on action-gated simulated outpatient episodes. Each source case is converted into role-specific patient, examination, and manager views; evidence is revealed only through valid actions; and each episode records a structured trace that links observations, actions, final outputs, manager scores, and optional experience write-back. We release a runnable E&D artifact with 700 processed episodes, provenance notes, schemas, an episode runner, scoring scripts, configurations, example logs, analysis code, and trajectory- and step-level derivatives. Experiments show that episode traces expose process costs hidden by final-answer scoring, show how MDT-style consultation reallocates resources, and support longitudinal analyses of memory maturation, held-out transfer, update-stage response, and backward retention. Together, these results show that MedEvoEval provides a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedEvoEval gives a runnable framework for tracking doctor-agent behavior across episodes with experience updates and releases the code plus 700 cases, but the simulation's match to real clinical practice is not checked.

read the letter

The core contribution is a longitudinal setup where agents handle sequences of outpatient cases, gather evidence only through allowed actions, record full traces, and can write back experience between episodes. The release includes the runner, schemas, scoring scripts, and derivatives, which lets others run the same tests.

It does a few things cleanly. The traces separate final-answer scores from process costs like extra consultations or steps taken. The experiments illustrate how MDT-style consultation shifts resource use and set up checks for memory maturation, transfer to held-out cases, and retention after updates.

The main gap is validation. Nothing anchors the action-gated episodes or manager scoring to real physician sequences or expert judgments on realism. Without that, any measured improvement or retention could be driven by the particular gating rules or scoring definitions rather than transferable clinical behavior. The abstract gives no quantitative results or details on how the 700 episodes were processed, so the strength of the longitudinal findings is hard to judge from what's here.

This is aimed at groups building and iterating on clinical agents that need to handle repeated interactions. It is worth sending to referees because the framework and artifacts are concrete and fill a documented gap, even though the simulation fidelity needs direct attention in revision.

Referee Report

1 major / 0 minor

Summary. The paper introduces MedEvoEval, an executable longitudinal evaluation framework for doctor agents based on action-gated simulated outpatient episodes. Each source case is converted into role-specific patient, examination, and manager views with evidence revealed only through valid actions; episodes record structured traces linking observations, actions, final outputs, manager scores, and optional experience write-back. The authors release a runnable artifact with 700 processed episodes, provenance notes, schemas, an episode runner, scoring scripts, configurations, example logs, and analysis code. Experiments illustrate that traces expose process costs hidden by final-answer scoring, show resource reallocation under MDT-style consultation, and support longitudinal analyses of memory maturation, held-out transfer, update-stage response, and backward retention, thereby providing a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities.

Significance. If the simulation mechanics produce measurements that reflect transferable clinical behavior rather than artifacts, the framework fills a gap between fixed-input medical QA benchmarks and single-encounter interactive evaluations by enabling assessment of cross-episode evolution. The release of the full E&D artifact (700 episodes, runner, scoring scripts, provenance, and analysis code) is a clear strength that supports reproducibility and extension by other researchers.

major comments (1)

[Episode construction and trace recording (abstract and methods)] The central claim that episode traces support valid longitudinal analyses of improvement, transfer, and retention requires that the action-gated outpatient episodes and structured traces accurately model real clinical information gathering without introducing distorting artifacts. The manuscript describes conversion of source cases into role-specific views with evidence revealed only via valid actions and manager scoring, but reports no external anchoring such as correlation of trace statistics with real physician sequences or expert realism ratings. This is load-bearing for the claim in the abstract and the experiments section on longitudinal analyses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger grounding of the simulation's fidelity. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Episode construction and trace recording (abstract and methods)] The central claim that episode traces support valid longitudinal analyses of improvement, transfer, and retention requires that the action-gated outpatient episodes and structured traces accurately model real clinical information gathering without introducing distorting artifacts. The manuscript describes conversion of source cases into role-specific views with evidence revealed only via valid actions and manager scoring, but reports no external anchoring such as correlation of trace statistics with real physician sequences or expert realism ratings. This is load-bearing for the claim in the abstract and the experiments section on longitudinal analyses.

Authors: We agree that external anchoring (e.g., correlation of trace statistics with real physician sequences or expert realism ratings) is absent from the current manuscript and would strengthen claims about the absence of distorting artifacts. The framework prioritizes controlled, reproducible simulation with action-gated evidence release and manager scoring to enable longitudinal tracking; source cases are converted into role-specific views with provenance documented in the released artifact. However, this design choice means we do not claim direct ecological validity. In revision we will (1) add an explicit Limitations subsection in the Discussion that addresses simulation fidelity, the lack of real-world correlation data, and reliance on internal consistency mechanisms, and (2) qualify the abstract and experiments section to state that MedEvoEval supplies a controlled basis for evaluating agent evolution rather than asserting direct modeling of real clinical sequences. revision: yes

Circularity Check

0 steps flagged

Framework proposal exhibits no circularity in derivation chain

full rationale

The paper introduces MedEvoEval as an executable longitudinal evaluation framework based on action-gated simulated outpatient episodes, with released artifacts including 700 episodes, schemas, runner, and analysis code. No mathematical derivations, parameter fittings, predictions of quantities from fitted inputs, or load-bearing self-citations appear in the text. The central claim—that the framework provides a concrete basis for evaluating agent improvement, transfer, and retention—is presented as a direct consequence of the defined simulation mechanics and trace structures, without reducing to self-defined quantities or prior author results by construction. This is a standard framework proposal whose validity rests on external validation of simulation fidelity rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a simulation framework rather than new entities or parameters; the primary unverified premise is the fidelity of the simulated episodes to real clinical processes.

axioms (1)

domain assumption Converting source cases into role-specific patient, examination, and manager views with action-gated evidence produces traces that meaningfully reflect clinical decision dynamics.
This premise underpins the entire evaluation design described in the abstract.

pith-pipeline@v0.9.1-grok · 5794 in / 1263 out tokens · 43810 ms · 2026-06-30T09:30:47.780432+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions

Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 362–372. Springer, 2025

2025
[2]

Hierarchical multi-scale feature fusion network for multi- center major depressive disorder classification with T1-weighted MRI

Zhaoyang Cong, Ziyang Wang, Hao Zhang, Guowei Zheng, Keming Cao, Lina Zhao, Ruipeng Song, Jianqing Li, and Chengyu Liu. Hierarchical multi-scale feature fusion network for multi- center major depressive disorder classification with T1-weighted MRI. In2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), ...

2025
[3]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, and Fei Huang. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213. Association for Computational Linguistics, 2025

2025
[4]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Black, Gloria Geng, Danny Park, James Zou, Andrew Y

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and Jonathan H. Chen. Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents.arXiv preprint arXiv:2501.14654, 2025

work page arXiv 2025
[6]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

work page doi:10.3390/app11146421 2021
[7]

Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023

2023
[8]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S. Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W. Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

2024
[10]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kahn, Mike Power, Daniel Khashabi, Tushar Khot, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

2020
[11]

Mmedagent: Learning to use medical tools with multi-modal agent

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

2024
[12]

CAMEL: Communicative agents for “mind” exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large language model society. InAdvances in Neural Information Processing Systems, volume 36, pages 51991–52008, 2023

2023
[13]

Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024
[14]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, volume 30, 2017. 12

2017
[15]

arXiv preprint arXiv:2311.16452 , year=

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

work page arXiv 2023
[16]

Lungren, et al

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P. Lungren, et al. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

work page arXiv 2025
[17]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022

2022
[18]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019

2019
[19]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023. doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[20]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

2023
[21]

HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face. InAdvances in Neural Information Processing Systems, volume 36, pages 38154–38180, 2023

2023
[22]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

2025
[23]

Re- flexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, pages 8634–8652, 2023

2023
[24]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic RAG.arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Pfohl, Heather Cole-Lewis, et al

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

2025
[26]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguistics: ACL 2024, pages 599–621. Association for Computational Linguistics, 2024

2024
[27]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 13

2024
[28]

A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

2024
[29]

MedAide: Towards an omni medical aide via specialized LLM-based multi-agent collaboration.arXiv preprint arXiv:2410.12532, 2024

Jinjie Wei, Dingkang Yang, Yanshu Li, Qingyao Xu, Zhaoyu Chen, Mingcheng Li, Yue Jiang, Xiaolu Hou, and Lihua Zhang. MedAide: Towards an omni medical aide via specialized LLM-based multi-agent collaboration.arXiv preprint arXiv:2410.12532, 2024

work page arXiv 2024
[30]

PMC- LLaMA: Toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024

Chaoyi Wu, Weixiong Lin, Xiaoming Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. PMC- LLaMA: Toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024. doi: 10.1093/jamia/ocae045

work page doi:10.1093/jamia/ocae045 2024
[31]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

2024
[32]

Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

work page arXiv 2024
[33]

Failures pave the way: Enhancing large language models through tuning-free rule accumulation

Zeyuan Yang, Peng Li, and Yang Liu. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1751–1777, 2023

2023
[34]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[35]

Evaluation of retrieval-augmented generation: A survey

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of retrieval-augmented generation: A survey. InCCF Conference on Big Data, pages 102–120. Springer, 2024

2024
[36]

Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models.arXiv preprint arXiv:2509.09119, 2025

Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, and Hao Xu. Sensitivity-LoRA: Low-load sensitivity-based fine-tuning for large language models.arXiv preprint arXiv:2509.09119, 2025

work page arXiv 2025
[37]

Hyperadalora: Accelerating lora rank allo- cation during training via hypernetworks without sacrificing performance.arXiv preprint arXiv:2510.02630, 2025

Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Heng Zhang, Shuyang Zhang, Bo Huang, Yuhang Wu, Tianyang Wang, et al. HyperAdaLoRA: Accelerating LoRA rank allocation during training via hypernetworks without sacrificing performance.arXiv preprint arXiv:2510.02630, 2025

work page arXiv 2025
[38]

Pdtrim: Tar- geted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

Hao Zhang, Mengsi Lyu, Zhuo Chen, Xingrun Xing, Yulong Ao, and Yonghua Lin. Pdtrim: Targeted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

work page arXiv 2025
[39]

Trimtokenator: Towards adaptive visual token pruning for large multimodal models.arXiv preprint arXiv:2509.00320, 2025

Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, and Yonghua Lin. TrimTokenator: Towards adaptive visual token pruning for large multimodal models.arXiv preprint arXiv:2509.00320, 2025

work page arXiv 2025
[40]

Trimtokenator-lc: Towards adaptive visual token pruning for large multimodal models with long contexts.arXiv preprint arXiv:2512.22748, 2025

Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, and Yonghua Lin. TrimTokenator-LC: Towards adaptive visual token pruning for large multimodal models with long contexts.arXiv preprint arXiv:2512.22748, 2025. 14 A Artifact Inventory and Responsible Use This appendix documents the released executable evaluation artifact forMedEvoEval. The artifact contains the s...

work page arXiv 2025
[41]

Output valid JSON only
[42]

Choose exactly one action from: ASK, REQUEST_TEST, CALL_MDT, FINALIZE
[43]

Do not invent symptoms, examination results, laboratory values, imaging findings, or prior diagnoses
[44]

REQUEST_TEST must use an exact test_name from the allowed catalog
[45]

CALL_MDT may ask for diagnostic advice, but the final decision remains your responsibility
[46]

Provide diagnosis, evidence, plan, and followup

FINALIZE only when evidence is sufficient or the turn limit requires termination. Provide diagnosis, evidence, plan, and followup
[47]

Use retrieved memory cards as experience hints, not as patient facts. Episode state: - patient_opening: {patient_opening} - allowed_tests: {allowed_tests} - max_total_turns: {max_total_turns} - max_tests_per_visit: {max_tests_per_visit} - revealed_history: {revealed_history} - returned_tests: {returned_tests} - consultant_notes: {consultant_notes} - retri...
[48]

Do not diagnose yourself
[49]

Do not reveal examination, laboratory, imaging, or manager-only fields
[50]

If the doctor asks about a listed hidden_history_fact, answer directly
[51]

If the doctor asks about something not specified, say that you are not sure or that you do not recall
[52]

Keep the answer natural, brief, and patient-like
[53]

answer":

Output valid JSON only. hidden_patient_view: {patient_view} doctor_question: {question} Return: { 17 "answer": "string", "revealed_facts": ["fact id or short fact text"], "unanswered_reason": "none|not_in_patient_view|unclear_question" } Prompt B.3: Examination-service prompt You are an examination-result service for a simulated outpatient episode. Return...
[54]

Match the requested test to the closest allowed catalog item
[55]

If there is no allowed match, return returned=false and do not invent a result
[56]

If the test is available, return the stored result verbatim or as a concise paraphrase without adding new findings
[57]

returned

Output valid JSON only. allowed_tests: {allowed_tests} exam_view: {exam_view} requested_test: {requested_test} Return: { "returned": true, "matched_test_name": "string", "match_type": "exact|alias|none", "result": "string", "reason": "string" } Prompt B.4: MDT-consultant prompt You are an MDT consultant in a simulated diagnostic episode. Provide a second ...
[58]

You cannot order tests directly
[59]

You cannot access hidden labels or unrevealed examination results
[60]

Recommend at most one next question if more information is needed
[61]

If the evidence is sufficient, recommend FINALIZE and give a concise diagnosis suggestion
[62]

action":

Output valid JSON only. revealed_context: {revealed_context} attending_state: {attending_state} 18 B.3 Structured Output Templates Template B.1: Doctor structured output { "action": "ASK|REQUEST_TEST|CALL_MDT|FINALIZE", "utterance": "string", "question": "string", "test_name": "string", "rationale": ["string"], "working_diagnoses": ["string"], "needs_mdt"...

[1] [1]

Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions

Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 362–372. Springer, 2025

2025

[2] [2]

Hierarchical multi-scale feature fusion network for multi- center major depressive disorder classification with T1-weighted MRI

Zhaoyang Cong, Ziyang Wang, Hao Zhang, Guowei Zheng, Keming Cao, Lina Zhao, Ruipeng Song, Jianqing Li, and Chengyu Liu. Hierarchical multi-scale feature fusion network for multi- center major depressive disorder classification with T1-weighted MRI. In2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), ...

2025

[3] [3]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, and Fei Huang. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213. Association for Computational Linguistics, 2025

2025

[4] [4]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Black, Gloria Geng, Danny Park, James Zou, Andrew Y

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and Jonathan H. Chen. Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents.arXiv preprint arXiv:2501.14654, 2025

work page arXiv 2025

[6] [6]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

work page doi:10.3390/app11146421 2021

[7] [7]

Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023

2023

[8] [8]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S. Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W. Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

2024

[10] [10]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kahn, Mike Power, Daniel Khashabi, Tushar Khot, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

2020

[11] [11]

Mmedagent: Learning to use medical tools with multi-modal agent

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

2024

[12] [12]

CAMEL: Communicative agents for “mind” exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large language model society. InAdvances in Neural Information Processing Systems, volume 36, pages 51991–52008, 2023

2023

[13] [13]

Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024

[14] [14]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, volume 30, 2017. 12

2017

[15] [15]

arXiv preprint arXiv:2311.16452 , year=

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

work page arXiv 2023

[16] [16]

Lungren, et al

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P. Lungren, et al. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

work page arXiv 2025

[17] [17]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022

2022

[18] [18]

Parisi, Ronald Kemker, Jose L

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019

2019

[19] [19]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023. doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023

[20] [20]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

2023

[21] [21]

HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face. InAdvances in Neural Information Processing Systems, volume 36, pages 38154–38180, 2023

2023

[22] [22]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

2025

[23] [23]

Re- flexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, pages 8634–8652, 2023

2023

[24] [24]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic RAG.arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Pfohl, Heather Cole-Lewis, et al

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

2025

[26] [26]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguistics: ACL 2024, pages 599–621. Association for Computational Linguistics, 2024

2024

[27] [27]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 13

2024

[28] [28]

A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

2024

[29] [29]

MedAide: Towards an omni medical aide via specialized LLM-based multi-agent collaboration.arXiv preprint arXiv:2410.12532, 2024

Jinjie Wei, Dingkang Yang, Yanshu Li, Qingyao Xu, Zhaoyu Chen, Mingcheng Li, Yue Jiang, Xiaolu Hou, and Lihua Zhang. MedAide: Towards an omni medical aide via specialized LLM-based multi-agent collaboration.arXiv preprint arXiv:2410.12532, 2024

work page arXiv 2024

[30] [30]

PMC- LLaMA: Toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024

Chaoyi Wu, Weixiong Lin, Xiaoming Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. PMC- LLaMA: Toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024. doi: 10.1093/jamia/ocae045

work page doi:10.1093/jamia/ocae045 2024

[31] [31]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

2024

[32] [32]

Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

work page arXiv 2024

[33] [33]

Failures pave the way: Enhancing large language models through tuning-free rule accumulation

Zeyuan Yang, Peng Li, and Yang Liu. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1751–1777, 2023

2023

[34] [34]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023

[35] [35]

Evaluation of retrieval-augmented generation: A survey

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of retrieval-augmented generation: A survey. InCCF Conference on Big Data, pages 102–120. Springer, 2024

2024

[36] [36]

Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models.arXiv preprint arXiv:2509.09119, 2025

Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, and Hao Xu. Sensitivity-LoRA: Low-load sensitivity-based fine-tuning for large language models.arXiv preprint arXiv:2509.09119, 2025

work page arXiv 2025

[37] [37]

Hyperadalora: Accelerating lora rank allo- cation during training via hypernetworks without sacrificing performance.arXiv preprint arXiv:2510.02630, 2025

Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Heng Zhang, Shuyang Zhang, Bo Huang, Yuhang Wu, Tianyang Wang, et al. HyperAdaLoRA: Accelerating LoRA rank allocation during training via hypernetworks without sacrificing performance.arXiv preprint arXiv:2510.02630, 2025

work page arXiv 2025

[38] [38]

Pdtrim: Tar- geted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

Hao Zhang, Mengsi Lyu, Zhuo Chen, Xingrun Xing, Yulong Ao, and Yonghua Lin. Pdtrim: Targeted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

work page arXiv 2025

[39] [39]

Trimtokenator: Towards adaptive visual token pruning for large multimodal models.arXiv preprint arXiv:2509.00320, 2025

Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, and Yonghua Lin. TrimTokenator: Towards adaptive visual token pruning for large multimodal models.arXiv preprint arXiv:2509.00320, 2025

work page arXiv 2025

[40] [40]

Trimtokenator-lc: Towards adaptive visual token pruning for large multimodal models with long contexts.arXiv preprint arXiv:2512.22748, 2025

Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, and Yonghua Lin. TrimTokenator-LC: Towards adaptive visual token pruning for large multimodal models with long contexts.arXiv preprint arXiv:2512.22748, 2025. 14 A Artifact Inventory and Responsible Use This appendix documents the released executable evaluation artifact forMedEvoEval. The artifact contains the s...

work page arXiv 2025

[41] [41]

Output valid JSON only

[42] [42]

Choose exactly one action from: ASK, REQUEST_TEST, CALL_MDT, FINALIZE

[43] [43]

Do not invent symptoms, examination results, laboratory values, imaging findings, or prior diagnoses

[44] [44]

REQUEST_TEST must use an exact test_name from the allowed catalog

[45] [45]

CALL_MDT may ask for diagnostic advice, but the final decision remains your responsibility

[46] [46]

Provide diagnosis, evidence, plan, and followup

FINALIZE only when evidence is sufficient or the turn limit requires termination. Provide diagnosis, evidence, plan, and followup

[47] [47]

Use retrieved memory cards as experience hints, not as patient facts. Episode state: - patient_opening: {patient_opening} - allowed_tests: {allowed_tests} - max_total_turns: {max_total_turns} - max_tests_per_visit: {max_tests_per_visit} - revealed_history: {revealed_history} - returned_tests: {returned_tests} - consultant_notes: {consultant_notes} - retri...

[48] [48]

Do not diagnose yourself

[49] [49]

Do not reveal examination, laboratory, imaging, or manager-only fields

[50] [50]

If the doctor asks about a listed hidden_history_fact, answer directly

[51] [51]

If the doctor asks about something not specified, say that you are not sure or that you do not recall

[52] [52]

Keep the answer natural, brief, and patient-like

[53] [53]

answer":

Output valid JSON only. hidden_patient_view: {patient_view} doctor_question: {question} Return: { 17 "answer": "string", "revealed_facts": ["fact id or short fact text"], "unanswered_reason": "none|not_in_patient_view|unclear_question" } Prompt B.3: Examination-service prompt You are an examination-result service for a simulated outpatient episode. Return...

[54] [54]

Match the requested test to the closest allowed catalog item

[55] [55]

If there is no allowed match, return returned=false and do not invent a result

[56] [56]

If the test is available, return the stored result verbatim or as a concise paraphrase without adding new findings

[57] [57]

returned

Output valid JSON only. allowed_tests: {allowed_tests} exam_view: {exam_view} requested_test: {requested_test} Return: { "returned": true, "matched_test_name": "string", "match_type": "exact|alias|none", "result": "string", "reason": "string" } Prompt B.4: MDT-consultant prompt You are an MDT consultant in a simulated diagnostic episode. Provide a second ...

[58] [58]

You cannot order tests directly

[59] [59]

You cannot access hidden labels or unrevealed examination results

[60] [60]

Recommend at most one next question if more information is needed

[61] [61]

If the evidence is sufficient, recommend FINALIZE and give a concise diagnosis suggestion

[62] [62]

action":

Output valid JSON only. revealed_context: {revealed_context} attending_state: {attending_state} 18 B.3 Structured Output Templates Template B.1: Doctor structured output { "action": "ASK|REQUEST_TEST|CALL_MDT|FINALIZE", "utterance": "string", "question": "string", "test_name": "string", "rationale": ["string"], "working_diagnoses": ["string"], "needs_mdt"...