pith. sign in

arxiv: 2606.11702 · v1 · pith:YGPZAUOInew · submitted 2026-06-10 · 💻 cs.CV · cs.AI· cs.CL

MedCTA: A Benchmark for Clinical Tool Agents

Pith reviewed 2026-06-27 10:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords MedCTAclinical tool agentsmultimodal modelsagent reliabilitymedical AItool use benchmarkprocess-aware evaluation
0
0 comments X

The pith

Even frontier multimodal models remain brittle in multi-step clinical tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedCTA, a benchmark of 107 clinician-verified clinical tasks that require agents to retrieve tools, acquire evidence, and integrate information across radiology images, pathology slides, and reports. It evaluates 18 open- and closed-source models and shows that autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment. Gold-standard tool routing produces large gains yet still leaves substantial shortfalls. The work establishes that strong perception capabilities in backbone models do not produce reliable agentic behavior in clinical workflows.

Core claim

MedCTA shows that autonomous clinical tool agents fail primarily through protocol violations, early termination, and wrong tool selection, while even perfect tool routing yields large but incomplete improvements in trajectory fidelity and outcome quality.

What carries the argument

MedCTA benchmark of 107 tasks with clinician-verified executable trajectories over 5 tools, together with process-aware metrics for tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality.

If this is right

  • Strong perception alone is insufficient for reliable multi-step medical tool use.
  • Protocol adherence and correct tool recruitment are primary bottlenecks in clinical agent rollouts.
  • Gold-standard routing improves performance substantially but does not achieve full reliability.
  • Process-aware evaluation is required to diagnose agent failures beyond final-answer accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to measure how explicit planning modules reduce protocol failures.
  • Hybrid perception-plus-reasoning systems may be needed before clinical deployment becomes feasible.
  • Similar task sets could be created for other high-stakes domains such as legal or financial tool use.

Load-bearing premise

The 107 clinician-verified tasks and trajectories form a representative sample of real clinical workflows.

What would settle it

A model that sustains high trajectory fidelity and outcome quality across the full set of 107 tasks in fully autonomous rollouts without gold-standard routing would falsify the brittleness finding.

Figures

Figures reproduced from arXiv: 2606.11702 by Bernard Ghanem, Fida Mohammad Thoker, Hyewon Jeong, Tajamul Ashraf.

Figure 1
Figure 1. Figure 1: MedCTA curation pipeline. ➊ Query construction: Annotators expand expert exemplars into structured, executable, and tool-aware clinical queries, followed by clinical verification for correctness and relevance. ➋ Tool chain construction: GPT-generated tool trajectories are refined and corrected by annotators. Technical verification ensures executability and format compliance, while clinical experts validate… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of GPT-generated vs. human-annotated trajectories. Blue: technical fixes (tool compliance, stability, redundancy removal); Green: clinical corrections (sound reasoning, evidence grounding). The annotated version is clinically coherent, avoiding unnecessary steps while reaching the same diagnosis. query ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation with final accuracy (Gacc). Step-level metrics show moderate correlation, while clinical reasoning metrics (especially Scomp and Facc) strongly correlate with final performance. This indicates that controller instability prevents models from reaching stable reasoning. intermediate execution. Reasoning metrics correlate most with final answer quality, while rollout diagnostics show controller fa… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative failure analysis on sentinel-node metastasis detection. The gold trajectory grounds reasoning in extracted visual evidence via tool execution. GPT-5.4 fails due to premature answering, while Qwen3.5-9B exhibits semantic drift by ignoring OCR evidence and relying on unrelated priors. This shows that accurate perception alone does not ensure grounded reasoning. occur first at protocol/API handlin… view at source ↗
Figure 5
Figure 5. Figure 5: Annotation protocol for validating executable tool-based reasoning trajectories in [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Clinical validation protocol for assessing medical correctness and reasoning quality in [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stepwise construction of an executable and clinically validated trajectory for anatomical [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stepwise construction of an executable and clinically validated trajectory for identifying [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ReAct-style execution prompt used in the [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template used to synthesize a single executable ground-truth trajectory for each [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Anatomical and modality coverage of MedCTA. The benchmark spans diverse body regions and clinical inputs, including eye, brain, oral cavity, lungs, heart, liver, kidney, intestine, gallbladder, pelvis, blood, and tissue-level pathology. Each example shows a step-implicit clinical query and its target answer, illustrating that MedCTA evaluates broad cross-specialty medical tool use rather than a single org… view at source ↗
Figure 12
Figure 12. Figure 12: Rollout diagnostics across models. (Left) Proprietary models show lower API/protocol failures and stronger oracle answerability. (Right) Failures are dominated by protocol/API handling and tool recruitment [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pooled conditional advantage analysis. The strongest positive descriptive signals come from schema-clean matched arguments and the absence of API/format errors. These are descriptive associations, not causal interventions. H.4 Trajectory length, prefix survival, and horizon effects [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Trajectory-length diagnostics. Left: prefix-conditioned answer accuracy. Middle: answer accuracy as a function of gold tool-step count. Right: prefix survival by step index. Together these plots show that performance drops with horizon and collapses rapidly after the first action. H.5 Tool-level bottlenecks and localized grounding [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: shows that autonomous tool matching is concentrated on relatively easy global-perception tools, especially ImageDescription. The weakest alignment is on RegionAttributeDescription, which requires localized visual grounding and is rarely reached correctly [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Agentic reasoning example for anatomical variant identification 42 [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Agentic reasoning example for imaging plane identification 43 [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Agentic reasoning example for epithelial cell and stain identification 44 [PITH_FULL_IMAGE:figures/full_fig_p044_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Agentic reasoning example for numerical extraction and computation 45 [PITH_FULL_IMAGE:figures/full_fig_p045_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Agentic reasoning example for ultrastructural cell type identification 46 [PITH_FULL_IMAGE:figures/full_fig_p046_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Agentic reasoning example for thoracic CT abnormality detection 47 [PITH_FULL_IMAGE:figures/full_fig_p047_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Agentic reasoning example for identifying cell groups in a disease plot 48 [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗
read the original abstract

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces MedCTA, a benchmark for clinical tool agents consisting of 107 clinician-verified tasks grounded in realistic multimodal clinical inputs (radiology images, pathology slides, reports). Tasks include executable trajectories over 5 deployed tools and support process-aware metrics on tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. Evaluation of 18 open- and closed-source multimodal models shows that frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but incomplete gains. The work concludes that strong backbone perception does not translate to reliable agentic behavior and releases the dataset publicly.

Significance. If the task construction, clinician verification, and metric definitions are robust, MedCTA provides a valuable diagnostic testbed that moves beyond single-turn perception or QA benchmarks to expose failures in planning, tool recruitment, and rollout reliability in clinical settings. The public release of tasks and evaluation suite supports reproducibility and community progress on trustworthy medical agents. The reported performance gap between autonomous and gold-standard routing is a concrete, actionable finding for model developers.

major comments (2)
  1. [§3] §3 (MedCTA composition): The claim that the 107 clinician-verified tasks form a representative sample for diagnosing agent reliability across real clinical workflows requires explicit reporting of selection criteria, exclusion rules, and inter-clinician agreement statistics; without these, the central brittleness result cannot be fully audited for selection bias or coverage.
  2. [§4.3] §4.3 (evaluation protocol): The distinction between autonomous rollouts and gold-standard tool routing is central to the main claim, yet the manuscript does not specify how gold-standard trajectories were constructed or whether they were held out from model prompting; this detail is load-bearing for interpreting the size of the reported gains.
minor comments (3)
  1. [Results] Table 1 or equivalent results table: clarify whether the 18 models include any fine-tuned variants or only zero-shot prompting, as this affects interpretation of the brittleness findings.
  2. [Figure 2] Figure 2 (failure mode breakdown): axis labels and legend should explicitly define 'protocol failure' and 'premature stopping' to match the textual definitions.
  3. [Methods] The abstract states 'step-implicit tasks' but the methods section should add a short example trajectory to illustrate what 'step-implicit' means operationally.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of MedCTA's potential value. We address each major comment below and will revise the manuscript accordingly to improve clarity and transparency.

read point-by-point responses
  1. Referee: [§3] §3 (MedCTA composition): The claim that the 107 clinician-verified tasks form a representative sample for diagnosing agent reliability across real clinical workflows requires explicit reporting of selection criteria, exclusion rules, and inter-clinician agreement statistics; without these, the central brittleness result cannot be fully audited for selection bias or coverage.

    Authors: We agree that explicit documentation of these elements is necessary for full auditability. In the revised manuscript we will expand §3 with: (i) the precise selection criteria used to sample the 107 tasks from real clinical workflows, (ii) the exclusion rules applied during curation, and (iii) inter-clinician agreement statistics (including Cohen’s kappa) obtained during verification. These additions will allow readers to evaluate coverage and potential selection bias directly. revision: yes

  2. Referee: [§4.3] §4.3 (evaluation protocol): The distinction between autonomous rollouts and gold-standard tool routing is central to the main claim, yet the manuscript does not specify how gold-standard trajectories were constructed or whether they were held out from model prompting; this detail is load-bearing for interpreting the size of the reported gains.

    Authors: We acknowledge the omission and will clarify the protocol. Gold-standard trajectories were authored by an independent panel of clinicians according to the task specifications; they were never supplied in the prompts used for autonomous model rollouts and serve exclusively as the reference for the gold-standard routing baseline. We will insert this description into §4.3 so that the magnitude of the reported gains can be interpreted unambiguously. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new benchmark (MedCTA) with 107 clinician-verified tasks and executable trajectories over 5 tools, then reports empirical results from running 18 models under autonomous and gold-standard routing conditions. No equations, parameter fits, or derivations appear in the provided text. The central claims rest on new task collection and external model evaluations rather than any self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are introduced; the work relies on standard practices of clinician validation and tool-interface definition.

pith-pipeline@v0.9.1-grok · 5764 in / 1128 out tokens · 19521 ms · 2026-06-27T10:29:48.806435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 16 linked inside Pith

  1. [1]

    Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

  4. [4]

    Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

    Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, et al. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

  5. [5]

    Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  6. [6]

    Healthadmin- bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

    Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, et al. Healthadmin- bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026. 10

  7. [7]

    Vqa-med: Overview of the medical visual question answering task at imageclef 2019

    Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

  8. [8]

    Castro, Anton Schwaighofer, Stephanie L

    Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie L. Hyland, Maria T. Wetscherek, Tristan Naumann, Harsha Nori, Neeraj Ahuja, et al. Making the most of text semantics to improve biomedical vision–language processing, 2022

  9. [9]

    Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine, 184(5):581–583, 2024

    Stephanie Cabral, Daniel Restrepo, Zahir Kanjee, Philip Wilson, Byron Crowe, Raja-Elie Abdulnour, and Adam Rodman. Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine, 184(5):581–583, 2024

  10. [10]

    Langchain, October 2022

    Harrison Chase. Langchain, October 2022

  11. [11]

    Review of medical image quality assessment

    Li Sze Chow and Raveendran Paramesran. Review of medical image quality assessment. Biomedical signal processing and control, 27:145–154, 2016

  12. [12]

    Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

  13. [13]

    Medmo: Grounding and understanding multimodal large language model for medical images.arXiv preprint arXiv:2602.06965, 2026

    Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, and Imran Razzak. Medmo: Grounding and understanding multimodal large language model for medical images.arXiv preprint arXiv:2602.06965, 2026

  14. [14]

    Duncan and Nicholas Ayache

    James S. Duncan and Nicholas Ayache. Medical image analysis: Progress over two decades and the challenges ahead.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):85–106, 2000

  15. [15]

    Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

    Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

  16. [16]

    Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

    Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xie, Fei Huang, and Jingren Zhou. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213. Association for Computational Linguistics, January 2025

  17. [17]

    Autogpt, 2023

    Significant Gravitas. Autogpt, 2023

  18. [18]

    Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

  19. [19]

    Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024

  20. [20]

    Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

  21. [21]

    Medagentbench: a virtual ehr environment to benchmark medical llm agents

    Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents. Nejm Ai, 2(9):AIdbp2500144, 2025

  22. [22]

    Black, Gloria Geng, Danny Park, James Zou, Andrew Y

    Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. A virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2025

  23. [23]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world GitHub issues?, 2023. 11

  24. [24]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  25. [25]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

  26. [26]

    Behaviorsft: Behavioral token conditioning for health agents across the proactivity spectrum

    Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene W Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, et al. Behaviorsft: Behavioral token conditioning for health agents across the proactivity spectrum. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 9789–9817, 2025

  27. [27]

    Tiered agentic oversight: A hierarchical multi-agent system for healthcare safety.arXiv preprint arXiv:2506.12482, 2025

    Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, et al. Tiered agentic oversight: A hierarchical multi-agent system for healthcare safety.arXiv preprint arXiv:2506.12482, 2025

  28. [28]

    Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  29. [29]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

  30. [30]

    Mmedagent: Learning to use medical tools with multi-modal agent

    Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

  31. [31]

    Modelscope-agent: Building your cus- tomizable agent system with open-source large language models

    Chenliang Li, He Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, et al. Modelscope-agent: Building your cus- tomizable agent system with open-source large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 566–578, 2023

  32. [32]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day, 2023

  33. [33]

    A comprehensive benchmark for tool-augmented llms

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li API-bank. A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, 2023

  34. [34]

    Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

    Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking LLMs and a benchmark for reliable interactive clinical reasoning, 2024

  35. [35]

    Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

  36. [36]

    Llava-plus: Learning to use tools for creating multimodal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024

  37. [37]

    Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 12

  38. [38]

    Chameleon: Plug-and-play compositional reasoning with large language models, 2023

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models, 2023

  39. [39]

    Clibench: A multifaceted and multigranular evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions, 2024

    Mingyu Derek Ma, Chenchen Ye, et al. Clibench: A multifaceted and multigranular evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions, 2024

  40. [40]

    Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog (accessed 2024–04–20)

    AI Meta. Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog (accessed 2024–04–20). There is no corresponding record for this reference, 2024

  41. [41]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  42. [42]

    Huang, Yadong Wu, T

    Michael Moor, Shraey B. Huang, Yadong Wu, T. Kapur, et al. Med-flamingo: a multimodal medical few-shot learner. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 225 ofProceedings of Machine Learning Research, 2023

  43. [43]

    Babyagi, 2023

    Yohei Nakajima. Babyagi, 2023

  44. [44]

    Webgpt: Browser-assisted question-answering with human feedback, 2021

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2021

  45. [45]

    OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

  46. [46]

    Webcpm: Interactive web search for chinese long-form question answering

    Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8968–8988, 2023

  47. [47]

    Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

  48. [48]

    Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  49. [49]

    Deep learning in medical image analysis

    Dinggang Shen, Guorong Wu, and Heung-Il Suk. Deep learning in medical image analysis. Annual review of biomedical engineering, 19(1):221–248, 2017. 13

  50. [50]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024

  51. [51]

    Fleming-vl: Towards universal medical visual reasoning with multimodal llms.arXiv preprint arXiv:2511.00916, 2025

    Yan Shu, Chi Liu, Robin Chen, Derek Li, and Bryan Dai. Fleming-vl: Towards universal medical visual reasoning with multimodal llms.arXiv preprint arXiv:2511.00916, 2025

  52. [52]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

  53. [53]

    Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023

    Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023

  54. [54]

    Lesion guided explainable few weak-shot medical report generation

    Jinghan Sun, Dong Wei, Liansheng Wang, and Yefeng Zheng. Lesion guided explainable few weak-shot medical report generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 615–625. Springer, 2022

  55. [55]

    Docagent: An agentic framework for multi-modal long-context document understanding

    Lin Sun et al. Docagent: An agentic framework for multi-modal long-context document understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  56. [56]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  57. [57]

    Enhance llm agents with versatile tool apis

    AgentLego Developer Team. Enhance llm agents with versatile tool apis. https://github. com/InternLM/agentlego, 2023

  58. [58]

    Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  59. [59]

    Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents

    Lagent Developer Team. Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents. https://github.com/ InternLM/lagent, 2023

  60. [60]

    Mllm-tool: A multimodal large language model for tool agent learning.arXiv preprint arXiv:2401.10727, 4, 2024

    C Wang, W Luo, Q Chen, H Mai, J Guo, S Dong, XM Xuan, Z Li, L Ma, and S Gao. Mllm-tool: A multimodal large language model for tool agent learning.arXiv preprint arXiv:2401.10727, 4, 2024

  61. [61]

    Gta: A benchmark for general tool agents

    Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: A benchmark for general tool agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 75749–75790. Curran Associates, Inc., 2024

  62. [62]

    Medclip: Contrastive learning from unpaired medical images and text, 2022

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text, 2022

  63. [63]

    Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025

    Ziyue Wang, Junde Wu, Linghan Cai, Chang Han Low, Xihong Yang, Qiaxuan Li, and Yueming Jin. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025

  64. [64]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data, 2023

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data, 2023

  65. [65]

    Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

  66. [66]

    Autogen: Enabling next-gen llm applications via multi-agent conversation framework, 2023

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework, 2023. 14

  67. [67]

    Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

    Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

  68. [68]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024

  69. [69]

    A comprehensive survey of ai agents in healthcare.TechRxiv, 2025

    Gelei Xu, Xueyang Li, Yixiong Chen, Yuying Duan, Shuqing Wu, Alexander Yu, Ching-Hao Chiu, Juntong Ni, Ningzhi Tang, Toby Jia-Jun Li, et al. A comprehensive survey of ai agents in healthcare.TechRxiv, 2025

  70. [70]

    Medagentgym: Training llm agents for code-based medical reasoning at scale

    Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May Dongmei Wang, Peifeng Ruan, Donghan Yang, Tao Wang, et al. Medagentgym: Training llm agents for code-based medical reasoning at scale. InThe Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance, 2025

  71. [71]

    Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

  72. [72]

    Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation.arXiv preprint arXiv:2602.10367, 2026

    Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, and Lichao Sun. Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation.arXiv preprint arXiv:2602.10367, 2026

  73. [73]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024

  74. [74]

    Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

  75. [75]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  76. [76]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  77. [77]

    Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553, 2024

    Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, et al. Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553, 2024

  78. [78]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  79. [79]

    Appagent: Multimodal agents as smartphone users, 2023

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

  80. [80]

    Pmc-vqa: Visual instruction tuning for medical visual question answering, 2024.URL https://arxiv

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2024.URL https://arxiv. org/abs/2305.10415, 40, 2024

Showing first 80 references.