pith. sign in

arxiv: 2606.28900 · v1 · pith:SAWPJNJMnew · submitted 2026-06-27 · 💻 cs.AI · cs.CL

MedEvoEval: Evaluating Continual Evolution of Doctor Agents through Simulated Clinical Episodes

Pith reviewed 2026-06-30 09:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords MedEvoEvaldoctor agentscontinual evolutionsimulated clinical episodeslongitudinal evaluationaction-gated evidencemedical decision systemsmemory retention
0
0 comments X

The pith

MedEvoEval supplies an executable framework of action-gated simulated episodes to measure whether doctor agents improve, transfer skills, and retain capabilities across repeated clinical encounters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedEvoEval as a longitudinal evaluation setup that turns source cases into role-specific patient, examination, and manager views. Evidence appears only after valid agent actions, and every episode produces a structured trace connecting observations, decisions, scores, and optional memory updates. Existing medical QA benchmarks score final answers on complete inputs, while most interactive tests stay limited to single encounters, so they cannot track how experience changes behavior over time. If the framework works, researchers can run controlled tests of memory maturation, held-out transfer, and backward retention that current methods miss.

Core claim

MedEvoEval provides a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities over time. Each source case becomes executable outpatient episodes with action-gated evidence access; structured traces record the full sequence of observations, actions, final outputs, and manager scores, plus optional experience write-back. Experiments with the released artifact of 700 episodes demonstrate that these traces surface process costs invisible to final-answer scoring, show resource reallocation under MDT-style consultation, and enable direct longitudinal analyses of memory, transfer, update response, and retention.

What carries the argument

MedEvoEval, an executable longitudinal evaluation framework that converts cases into role-specific views, gates evidence behind valid actions, and records structured traces of observations, actions, scores, and experience updates.

If this is right

  • Episode traces reveal process costs such as unnecessary consultations or repeated tests that final-answer scoring conceals.
  • MDT-style consultation changes how agents allocate examination and manager resources within an episode.
  • Longitudinal runs can quantify memory maturation, held-out transfer of behaviors, response to update stages, and backward retention of prior capabilities.
  • The released artifact supplies schemas, runner, scoring scripts, and derivatives that let others replicate or extend the same measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If agents demonstrate measurable improvement on the framework, the same trace format could be adapted to test continual learning in other sequential decision domains such as legal case handling.
  • The action-gated design makes it possible to insert controlled interventions at specific steps and measure their downstream effect on later episodes.
  • Release of the full E&D artifact lowers the barrier for comparing different memory or reflection mechanisms under identical longitudinal conditions.

Load-bearing premise

The simulated outpatient episodes with action-gated evidence access and structured traces accurately model real clinical information gathering and decision evolution without introducing artifacts that distort measurements of agent improvement or retention.

What would settle it

Run the same set of agents on MedEvoEval and on a matched set of real outpatient cases; if the ranking of agents by improvement and retention differs substantially between the two, the simulation's validity for measuring continual evolution is challenged.

Figures

Figures reproduced from arXiv: 2606.28900 by Hui Zhang.

Figure 1
Figure 1. Figure 1: Overview of MedEvoEval. Source cases are transformed into role-specific episode views, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MedEvoEval experiment overview. (A) Episode traces expose outcome and process metrics beyond answer-only scoring. (B) MDT-style consultation tests resource reallocation with the primary doctor fixed. (C) Mature memory is evaluated on the longitudinal stream and held-out transfer cases. (D) Update-retention diagnostics separate adaptation from backward degradation. Statistical comparisons. Most comparisons … view at source ↗
Figure 3
Figure 3. Figure 3: Role-specific case views and visibility constraints. Patient and examination information is [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model-comparison evidence on 80 shared episodes. (A) Outcome components separate [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MDT decomposition on 100 shared episodes. (A) Outcome and utility compare no [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Stage-level memory curves. (A) Total score improves mainly after memory matures. (B) [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: External transfer comparison between H0 and H3 on 100 held-out episodes. [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Update and retention diagnostics. (A) Stage total score separates pre-update, update, [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Memory ablations on the 100-case mechanism subset. (A) Memory-content comparison [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
read the original abstract

Doctor agents are moving beyond single-turn answer generation toward evolving clinical decision systems. Within an outpatient episode, they acquire evidence, use examination and consultation resources, and decide when to finalize a diagnosis and management plan. Across episodes, their behavior may change through memory, retrieval, reflection, or other update mechanisms. Current evaluations only partially cover this setting. Fixed-input medical QA benchmarks score final answers from complete inputs, whereas many interactive benchmarks still focus on individual encounters or fixed runs, providing limited support for evaluating how episode-level decisions interact with cross-episode experience. We introduce MedEvoEval, an executable longitudinal evaluation framework based on action-gated simulated outpatient episodes. Each source case is converted into role-specific patient, examination, and manager views; evidence is revealed only through valid actions; and each episode records a structured trace that links observations, actions, final outputs, manager scores, and optional experience write-back. We release a runnable E&D artifact with 700 processed episodes, provenance notes, schemas, an episode runner, scoring scripts, configurations, example logs, analysis code, and trajectory- and step-level derivatives. Experiments show that episode traces expose process costs hidden by final-answer scoring, show how MDT-style consultation reallocates resources, and support longitudinal analyses of memory maturation, held-out transfer, update-stage response, and backward retention. Together, these results show that MedEvoEval provides a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces MedEvoEval, an executable longitudinal evaluation framework for doctor agents based on action-gated simulated outpatient episodes. Each source case is converted into role-specific patient, examination, and manager views with evidence revealed only through valid actions; episodes record structured traces linking observations, actions, final outputs, manager scores, and optional experience write-back. The authors release a runnable artifact with 700 processed episodes, provenance notes, schemas, an episode runner, scoring scripts, configurations, example logs, and analysis code. Experiments illustrate that traces expose process costs hidden by final-answer scoring, show resource reallocation under MDT-style consultation, and support longitudinal analyses of memory maturation, held-out transfer, update-stage response, and backward retention, thereby providing a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities.

Significance. If the simulation mechanics produce measurements that reflect transferable clinical behavior rather than artifacts, the framework fills a gap between fixed-input medical QA benchmarks and single-encounter interactive evaluations by enabling assessment of cross-episode evolution. The release of the full E&D artifact (700 episodes, runner, scoring scripts, provenance, and analysis code) is a clear strength that supports reproducibility and extension by other researchers.

major comments (1)
  1. [Episode construction and trace recording (abstract and methods)] The central claim that episode traces support valid longitudinal analyses of improvement, transfer, and retention requires that the action-gated outpatient episodes and structured traces accurately model real clinical information gathering without introducing distorting artifacts. The manuscript describes conversion of source cases into role-specific views with evidence revealed only via valid actions and manager scoring, but reports no external anchoring such as correlation of trace statistics with real physician sequences or expert realism ratings. This is load-bearing for the claim in the abstract and the experiments section on longitudinal analyses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger grounding of the simulation's fidelity. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Episode construction and trace recording (abstract and methods)] The central claim that episode traces support valid longitudinal analyses of improvement, transfer, and retention requires that the action-gated outpatient episodes and structured traces accurately model real clinical information gathering without introducing distorting artifacts. The manuscript describes conversion of source cases into role-specific views with evidence revealed only via valid actions and manager scoring, but reports no external anchoring such as correlation of trace statistics with real physician sequences or expert realism ratings. This is load-bearing for the claim in the abstract and the experiments section on longitudinal analyses.

    Authors: We agree that external anchoring (e.g., correlation of trace statistics with real physician sequences or expert realism ratings) is absent from the current manuscript and would strengthen claims about the absence of distorting artifacts. The framework prioritizes controlled, reproducible simulation with action-gated evidence release and manager scoring to enable longitudinal tracking; source cases are converted into role-specific views with provenance documented in the released artifact. However, this design choice means we do not claim direct ecological validity. In revision we will (1) add an explicit Limitations subsection in the Discussion that addresses simulation fidelity, the lack of real-world correlation data, and reliance on internal consistency mechanisms, and (2) qualify the abstract and experiments section to state that MedEvoEval supplies a controlled basis for evaluating agent evolution rather than asserting direct modeling of real clinical sequences. revision: yes

Circularity Check

0 steps flagged

Framework proposal exhibits no circularity in derivation chain

full rationale

The paper introduces MedEvoEval as an executable longitudinal evaluation framework based on action-gated simulated outpatient episodes, with released artifacts including 700 episodes, schemas, runner, and analysis code. No mathematical derivations, parameter fittings, predictions of quantities from fitted inputs, or load-bearing self-citations appear in the text. The central claim—that the framework provides a concrete basis for evaluating agent improvement, transfer, and retention—is presented as a direct consequence of the defined simulation mechanics and trace structures, without reducing to self-defined quantities or prior author results by construction. This is a standard framework proposal whose validity rests on external validation of simulation fidelity rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a simulation framework rather than new entities or parameters; the primary unverified premise is the fidelity of the simulated episodes to real clinical processes.

axioms (1)
  • domain assumption Converting source cases into role-specific patient, examination, and manager views with action-gated evidence produces traces that meaningfully reflect clinical decision dynamics.
    This premise underpins the entire evaluation design described in the abstract.

pith-pipeline@v0.9.1-grok · 5794 in / 1263 out tokens · 43810 ms · 2026-06-30T09:30:47.780432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions

    Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 362–372. Springer, 2025

  2. [2]

    Hierarchical multi-scale feature fusion network for multi- center major depressive disorder classification with T1-weighted MRI

    Zhaoyang Cong, Ziyang Wang, Hao Zhang, Guowei Zheng, Keming Cao, Lina Zhao, Ruipeng Song, Jianqing Li, and Chengyu Liu. Hierarchical multi-scale feature fusion network for multi- center major depressive disorder classification with T1-weighted MRI. In2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), ...

  3. [3]

    Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

    Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, and Fei Huang. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213. Association for Computational Linguistics, 2025

  4. [4]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023

  5. [5]

    Black, Gloria Geng, Danny Park, James Zou, Andrew Y

    Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and Jonathan H. Chen. Medagentbench: A realistic virtual ehr environment to benchmark medical llm agents.arXiv preprint arXiv:2501.14654, 2025

  6. [6]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

  7. [7]

    Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023

  8. [8]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445, 2022

  9. [9]

    Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S. Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W. Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  10. [10]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kahn, Mike Power, Daniel Khashabi, Tushar Khot, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  11. [11]

    Mmedagent: Learning to use medical tools with multi-modal agent

    Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

  12. [12]

    CAMEL: Communicative agents for “mind” exploration of large language model society

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large language model society. InAdvances in Neural Information Processing Systems, volume 36, pages 51991–52008, 2023

  13. [13]

    Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

    Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

  14. [14]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, volume 30, 2017. 12

  15. [15]

    arXiv preprint arXiv:2311.16452 , year=

    Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

  16. [16]

    Lungren, et al

    Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P. Lungren, et al. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

  17. [17]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning, pages 248–260. PMLR, 2022

  18. [18]

    Parisi, Ronald Kemker, Jose L

    German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks, 113:54–71, 2019

  19. [19]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023. doi: 10.1145/3586183.3606763

  20. [20]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36: 68539–68551, 2023

  21. [21]

    HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face. InAdvances in Neural Information Processing Systems, volume 36, pages 38154–38180, 2023

  22. [22]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  23. [23]

    Re- flexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, pages 8634–8652, 2023

  24. [24]

    Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic RAG.arXiv preprint arXiv:2501.09136, 2025

  25. [25]

    Pfohl, Heather Cole-Lewis, et al

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature Medicine, 31(3):943–950, 2025

  26. [26]

    Medagents: Large language models as collaborators for zero-shot medical reasoning

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguistics: ACL 2024, pages 599–621. Association for Computational Linguistics, 2024

  27. [27]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 13

  28. [28]

    A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

  29. [29]

    MedAide: Towards an omni medical aide via specialized LLM-based multi-agent collaboration.arXiv preprint arXiv:2410.12532, 2024

    Jinjie Wei, Dingkang Yang, Yanshu Li, Qingyao Xu, Zhaoyu Chen, Mingcheng Li, Yue Jiang, Xiaolu Hou, and Lihua Zhang. MedAide: Towards an omni medical aide via specialized LLM-based multi-agent collaboration.arXiv preprint arXiv:2410.12532, 2024

  30. [30]

    PMC- LLaMA: Toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024

    Chaoyi Wu, Weixiong Lin, Xiaoming Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. PMC- LLaMA: Toward building open-source language models for medicine.Journal of the American Medical Informatics Association, 31(9):1833–1843, 2024. doi: 10.1093/jamia/ocae045

  31. [31]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

  32. [32]

    Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

    Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

  33. [33]

    Failures pave the way: Enhancing large language models through tuning-free rule accumulation

    Zeyuan Yang, Peng Li, and Yang Liu. Failures pave the way: Enhancing large language models through tuning-free rule accumulation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1751–1777, 2023

  34. [34]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  35. [35]

    Evaluation of retrieval-augmented generation: A survey

    Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of retrieval-augmented generation: A survey. InCCF Conference on Big Data, pages 102–120. Springer, 2024

  36. [36]

    Sensitivity-lora: Low-load sensitivity-based fine-tuning for large language models.arXiv preprint arXiv:2509.09119, 2025

    Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, and Hao Xu. Sensitivity-LoRA: Low-load sensitivity-based fine-tuning for large language models.arXiv preprint arXiv:2509.09119, 2025

  37. [37]

    Hyperadalora: Accelerating lora rank allo- cation during training via hypernetworks without sacrificing performance.arXiv preprint arXiv:2510.02630, 2025

    Hao Zhang, Zhenjia Li, Runfeng Bao, Yifan Gao, Xi Xiao, Heng Zhang, Shuyang Zhang, Bo Huang, Yuhang Wu, Tianyang Wang, et al. HyperAdaLoRA: Accelerating LoRA rank allocation during training via hypernetworks without sacrificing performance.arXiv preprint arXiv:2510.02630, 2025

  38. [38]

    Pdtrim: Tar- geted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

    Hao Zhang, Mengsi Lyu, Zhuo Chen, Xingrun Xing, Yulong Ao, and Yonghua Lin. Pdtrim: Targeted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

  39. [39]

    Trimtokenator: Towards adaptive visual token pruning for large multimodal models.arXiv preprint arXiv:2509.00320, 2025

    Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, and Yonghua Lin. TrimTokenator: Towards adaptive visual token pruning for large multimodal models.arXiv preprint arXiv:2509.00320, 2025

  40. [40]

    Trimtokenator-lc: Towards adaptive visual token pruning for large multimodal models with long contexts.arXiv preprint arXiv:2512.22748, 2025

    Hao Zhang, Mengsi Lyu, Bo Huang, Yulong Ao, and Yonghua Lin. TrimTokenator-LC: Towards adaptive visual token pruning for large multimodal models with long contexts.arXiv preprint arXiv:2512.22748, 2025. 14 A Artifact Inventory and Responsible Use This appendix documents the released executable evaluation artifact forMedEvoEval. The artifact contains the s...

  41. [41]

    Output valid JSON only

  42. [42]

    Choose exactly one action from: ASK, REQUEST_TEST, CALL_MDT, FINALIZE

  43. [43]

    Do not invent symptoms, examination results, laboratory values, imaging findings, or prior diagnoses

  44. [44]

    REQUEST_TEST must use an exact test_name from the allowed catalog

  45. [45]

    CALL_MDT may ask for diagnostic advice, but the final decision remains your responsibility

  46. [46]

    Provide diagnosis, evidence, plan, and followup

    FINALIZE only when evidence is sufficient or the turn limit requires termination. Provide diagnosis, evidence, plan, and followup

  47. [47]

    Use retrieved memory cards as experience hints, not as patient facts. Episode state: - patient_opening: {patient_opening} - allowed_tests: {allowed_tests} - max_total_turns: {max_total_turns} - max_tests_per_visit: {max_tests_per_visit} - revealed_history: {revealed_history} - returned_tests: {returned_tests} - consultant_notes: {consultant_notes} - retri...

  48. [48]

    Do not diagnose yourself

  49. [49]

    Do not reveal examination, laboratory, imaging, or manager-only fields

  50. [50]

    If the doctor asks about a listed hidden_history_fact, answer directly

  51. [51]

    If the doctor asks about something not specified, say that you are not sure or that you do not recall

  52. [52]

    Keep the answer natural, brief, and patient-like

  53. [53]

    answer":

    Output valid JSON only. hidden_patient_view: {patient_view} doctor_question: {question} Return: { 17 "answer": "string", "revealed_facts": ["fact id or short fact text"], "unanswered_reason": "none|not_in_patient_view|unclear_question" } Prompt B.3: Examination-service prompt You are an examination-result service for a simulated outpatient episode. Return...

  54. [54]

    Match the requested test to the closest allowed catalog item

  55. [55]

    If there is no allowed match, return returned=false and do not invent a result

  56. [56]

    If the test is available, return the stored result verbatim or as a concise paraphrase without adding new findings

  57. [57]

    returned

    Output valid JSON only. allowed_tests: {allowed_tests} exam_view: {exam_view} requested_test: {requested_test} Return: { "returned": true, "matched_test_name": "string", "match_type": "exact|alias|none", "result": "string", "reason": "string" } Prompt B.4: MDT-consultant prompt You are an MDT consultant in a simulated diagnostic episode. Provide a second ...

  58. [58]

    You cannot order tests directly

  59. [59]

    You cannot access hidden labels or unrevealed examination results

  60. [60]

    Recommend at most one next question if more information is needed

  61. [61]

    If the evidence is sufficient, recommend FINALIZE and give a concise diagnosis suggestion

  62. [62]

    action":

    Output valid JSON only. revealed_context: {revealed_context} attending_state: {attending_state} 18 B.3 Structured Output Templates Template B.1: Doctor structured output { "action": "ASK|REQUEST_TEST|CALL_MDT|FINALIZE", "utterance": "string", "question": "string", "test_name": "string", "rationale": ["string"], "working_diagnoses": ["string"], "needs_mdt"...