MedCTA: A Benchmark for Clinical Tool Agents

Bernard Ghanem; Fida Mohammad Thoker; Hyewon Jeong; Tajamul Ashraf

arxiv: 2606.11702 · v1 · pith:YGPZAUOInew · submitted 2026-06-10 · 💻 cs.CV · cs.AI· cs.CL

MedCTA: A Benchmark for Clinical Tool Agents

Tajamul Ashraf , Hyewon Jeong , Fida Mohammad Thoker , Bernard Ghanem This is my paper

Pith reviewed 2026-06-27 10:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords MedCTAclinical tool agentsmultimodal modelsagent reliabilitymedical AItool use benchmarkprocess-aware evaluation

0 comments

The pith

Even frontier multimodal models remain brittle in multi-step clinical tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedCTA, a benchmark of 107 clinician-verified clinical tasks that require agents to retrieve tools, acquire evidence, and integrate information across radiology images, pathology slides, and reports. It evaluates 18 open- and closed-source models and shows that autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment. Gold-standard tool routing produces large gains yet still leaves substantial shortfalls. The work establishes that strong perception capabilities in backbone models do not produce reliable agentic behavior in clinical workflows.

Core claim

MedCTA shows that autonomous clinical tool agents fail primarily through protocol violations, early termination, and wrong tool selection, while even perfect tool routing yields large but incomplete improvements in trajectory fidelity and outcome quality.

What carries the argument

MedCTA benchmark of 107 tasks with clinician-verified executable trajectories over 5 tools, together with process-aware metrics for tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality.

If this is right

Strong perception alone is insufficient for reliable multi-step medical tool use.
Protocol adherence and correct tool recruitment are primary bottlenecks in clinical agent rollouts.
Gold-standard routing improves performance substantially but does not achieve full reliability.
Process-aware evaluation is required to diagnose agent failures beyond final-answer accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to measure how explicit planning modules reduce protocol failures.
Hybrid perception-plus-reasoning systems may be needed before clinical deployment becomes feasible.
Similar task sets could be created for other high-stakes domains such as legal or financial tool use.

Load-bearing premise

The 107 clinician-verified tasks and trajectories form a representative sample of real clinical workflows.

What would settle it

A model that sustains high trajectory fidelity and outcome quality across the full set of 107 tasks in fully autonomous rollouts without gold-standard routing would falsify the brittleness finding.

Figures

Figures reproduced from arXiv: 2606.11702 by Bernard Ghanem, Fida Mohammad Thoker, Hyewon Jeong, Tajamul Ashraf.

**Figure 1.** Figure 1: MedCTA curation pipeline. ➊ Query construction: Annotators expand expert exemplars into structured, executable, and tool-aware clinical queries, followed by clinical verification for correctness and relevance. ➋ Tool chain construction: GPT-generated tool trajectories are refined and corrected by annotators. Technical verification ensures executability and format compliance, while clinical experts validate… view at source ↗

**Figure 2.** Figure 2: Comparison of GPT-generated vs. human-annotated trajectories. Blue: technical fixes (tool compliance, stability, redundancy removal); Green: clinical corrections (sound reasoning, evidence grounding). The annotated version is clinically coherent, avoiding unnecessary steps while reaching the same diagnosis. query ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Correlation with final accuracy (Gacc). Step-level metrics show moderate correlation, while clinical reasoning metrics (especially Scomp and Facc) strongly correlate with final performance. This indicates that controller instability prevents models from reaching stable reasoning. intermediate execution. Reasoning metrics correlate most with final answer quality, while rollout diagnostics show controller fa… view at source ↗

**Figure 4.** Figure 4: Qualitative failure analysis on sentinel-node metastasis detection. The gold trajectory grounds reasoning in extracted visual evidence via tool execution. GPT-5.4 fails due to premature answering, while Qwen3.5-9B exhibits semantic drift by ignoring OCR evidence and relying on unrelated priors. This shows that accurate perception alone does not ensure grounded reasoning. occur first at protocol/API handlin… view at source ↗

**Figure 5.** Figure 5: Annotation protocol for validating executable tool-based reasoning trajectories in [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Clinical validation protocol for assessing medical correctness and reasoning quality in [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Stepwise construction of an executable and clinically validated trajectory for anatomical [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Stepwise construction of an executable and clinically validated trajectory for identifying [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: ReAct-style execution prompt used in the [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template used to synthesize a single executable ground-truth trajectory for each [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Anatomical and modality coverage of MedCTA. The benchmark spans diverse body regions and clinical inputs, including eye, brain, oral cavity, lungs, heart, liver, kidney, intestine, gallbladder, pelvis, blood, and tissue-level pathology. Each example shows a step-implicit clinical query and its target answer, illustrating that MedCTA evaluates broad cross-specialty medical tool use rather than a single org… view at source ↗

**Figure 12.** Figure 12: Rollout diagnostics across models. (Left) Proprietary models show lower API/protocol failures and stronger oracle answerability. (Right) Failures are dominated by protocol/API handling and tool recruitment [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗

**Figure 13.** Figure 13: Pooled conditional advantage analysis. The strongest positive descriptive signals come from schema-clean matched arguments and the absence of API/format errors. These are descriptive associations, not causal interventions. H.4 Trajectory length, prefix survival, and horizon effects [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗

**Figure 14.** Figure 14: Trajectory-length diagnostics. Left: prefix-conditioned answer accuracy. Middle: answer accuracy as a function of gold tool-step count. Right: prefix survival by step index. Together these plots show that performance drops with horizon and collapses rapidly after the first action. H.5 Tool-level bottlenecks and localized grounding [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗

**Figure 15.** Figure 15: shows that autonomous tool matching is concentrated on relatively easy global-perception tools, especially ImageDescription. The weakest alignment is on RegionAttributeDescription, which requires localized visual grounding and is rarely reached correctly [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗

**Figure 17.** Figure 17: Agentic reasoning example for anatomical variant identification 42 [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗

**Figure 18.** Figure 18: Agentic reasoning example for imaging plane identification 43 [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗

**Figure 19.** Figure 19: Agentic reasoning example for epithelial cell and stain identification 44 [PITH_FULL_IMAGE:figures/full_fig_p044_19.png] view at source ↗

**Figure 20.** Figure 20: Agentic reasoning example for numerical extraction and computation 45 [PITH_FULL_IMAGE:figures/full_fig_p045_20.png] view at source ↗

**Figure 21.** Figure 21: Agentic reasoning example for ultrastructural cell type identification 46 [PITH_FULL_IMAGE:figures/full_fig_p046_21.png] view at source ↗

**Figure 22.** Figure 22: Agentic reasoning example for thoracic CT abnormality detection 47 [PITH_FULL_IMAGE:figures/full_fig_p047_22.png] view at source ↗

**Figure 23.** Figure 23: Agentic reasoning example for identifying cell groups in a disease plot 48 [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗

read the original abstract

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedCTA gives a new benchmark of 107 clinician-verified tasks that shows frontier models still fail at reliable multi-step clinical tool use, with useful process metrics and released data.

read the letter

The main takeaway is that this paper moves evaluation from single-turn perception or QA to full agent trajectories on realistic multimodal clinical inputs. The 107 tasks come with executable trajectories over five tools and support metrics on tool selection, argument validity, trajectory fidelity, and outcome quality.

It does the basics right. They run 18 open- and closed-source models, separate autonomous rollouts from gold-standard routing, and document clear patterns: protocol failures, premature stopping, and wrong tool recruitment dominate when agents are left on their own. Gold routing closes much of the gap but leaves residual errors. Releasing the dataset and evaluation suite is the right step.

The evidence looks proportionate. The distinction between autonomous and routed performance is straightforward and matches what the abstract reports. No circular derivations or heavy self-citation load.

The soft spot is coverage. The claim that these tasks diagnose broader clinical workflows rests on clinician verification, but the abstract gives limited detail on selection criteria, exclusion rules, or inter-clinician agreement. Five tools is a sensible starting scope, yet it narrows what the benchmark can say about larger toolsets.

This is for researchers working on medical agents or building domain-specific benchmarks. Anyone testing reliability in high-stakes settings will get concrete failure cases to work from.

It deserves peer review. The core setup is timely, the model runs are concrete, and the gaps they flag are actionable.

Referee Report

2 major / 3 minor

Summary. The paper introduces MedCTA, a benchmark for clinical tool agents consisting of 107 clinician-verified tasks grounded in realistic multimodal clinical inputs (radiology images, pathology slides, reports). Tasks include executable trajectories over 5 deployed tools and support process-aware metrics on tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. Evaluation of 18 open- and closed-source multimodal models shows that frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but incomplete gains. The work concludes that strong backbone perception does not translate to reliable agentic behavior and releases the dataset publicly.

Significance. If the task construction, clinician verification, and metric definitions are robust, MedCTA provides a valuable diagnostic testbed that moves beyond single-turn perception or QA benchmarks to expose failures in planning, tool recruitment, and rollout reliability in clinical settings. The public release of tasks and evaluation suite supports reproducibility and community progress on trustworthy medical agents. The reported performance gap between autonomous and gold-standard routing is a concrete, actionable finding for model developers.

major comments (2)

[§3] §3 (MedCTA composition): The claim that the 107 clinician-verified tasks form a representative sample for diagnosing agent reliability across real clinical workflows requires explicit reporting of selection criteria, exclusion rules, and inter-clinician agreement statistics; without these, the central brittleness result cannot be fully audited for selection bias or coverage.
[§4.3] §4.3 (evaluation protocol): The distinction between autonomous rollouts and gold-standard tool routing is central to the main claim, yet the manuscript does not specify how gold-standard trajectories were constructed or whether they were held out from model prompting; this detail is load-bearing for interpreting the size of the reported gains.

minor comments (3)

[Results] Table 1 or equivalent results table: clarify whether the 18 models include any fine-tuned variants or only zero-shot prompting, as this affects interpretation of the brittleness findings.
[Figure 2] Figure 2 (failure mode breakdown): axis labels and legend should explicitly define 'protocol failure' and 'premature stopping' to match the textual definitions.
[Methods] The abstract states 'step-implicit tasks' but the methods section should add a short example trajectory to illustrate what 'step-implicit' means operationally.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of MedCTA's potential value. We address each major comment below and will revise the manuscript accordingly to improve clarity and transparency.

read point-by-point responses

Referee: [§3] §3 (MedCTA composition): The claim that the 107 clinician-verified tasks form a representative sample for diagnosing agent reliability across real clinical workflows requires explicit reporting of selection criteria, exclusion rules, and inter-clinician agreement statistics; without these, the central brittleness result cannot be fully audited for selection bias or coverage.

Authors: We agree that explicit documentation of these elements is necessary for full auditability. In the revised manuscript we will expand §3 with: (i) the precise selection criteria used to sample the 107 tasks from real clinical workflows, (ii) the exclusion rules applied during curation, and (iii) inter-clinician agreement statistics (including Cohen’s kappa) obtained during verification. These additions will allow readers to evaluate coverage and potential selection bias directly. revision: yes
Referee: [§4.3] §4.3 (evaluation protocol): The distinction between autonomous rollouts and gold-standard tool routing is central to the main claim, yet the manuscript does not specify how gold-standard trajectories were constructed or whether they were held out from model prompting; this detail is load-bearing for interpreting the size of the reported gains.

Authors: We acknowledge the omission and will clarify the protocol. Gold-standard trajectories were authored by an independent panel of clinicians according to the task specifications; they were never supplied in the prompts used for autonomous model rollouts and serve exclusively as the reference for the gold-standard routing baseline. We will insert this description into §4.3 so that the magnitude of the reported gains can be interpreted unambiguously. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new benchmark (MedCTA) with 107 clinician-verified tasks and executable trajectories over 5 tools, then reports empirical results from running 18 models under autonomous and gold-standard routing conditions. No equations, parameter fits, or derivations appear in the provided text. The central claims rest on new task collection and external model evaluations rather than any self-referential definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are introduced; the work relies on standard practices of clinician validation and tool-interface definition.

pith-pipeline@v0.9.1-grok · 5764 in / 1128 out tokens · 19521 ms · 2026-06-27T10:29:48.806435+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 16 linked inside Pith

[1]

Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

Pith/arXiv arXiv 2024
[2]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[3]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

2024
[4]

Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, et al. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

Pith/arXiv arXiv 2025
[5]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[6]

Healthadmin- bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, et al. Healthadmin- bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026. 10

Pith/arXiv arXiv 2026
[7]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

2019
[8]

Castro, Anton Schwaighofer, Stephanie L

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie L. Hyland, Maria T. Wetscherek, Tristan Naumann, Harsha Nori, Neeraj Ahuja, et al. Making the most of text semantics to improve biomedical vision–language processing, 2022

2022
[9]

Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine, 184(5):581–583, 2024

Stephanie Cabral, Daniel Restrepo, Zahir Kanjee, Philip Wilson, Byron Crowe, Raja-Elie Abdulnour, and Adam Rodman. Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine, 184(5):581–583, 2024

2024
[10]

Langchain, October 2022

Harrison Chase. Langchain, October 2022

2022
[11]

Review of medical image quality assessment

Li Sze Chow and Raveendran Paramesran. Review of medical image quality assessment. Biomedical signal processing and control, 27:145–154, 2016

2016
[12]

Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

2023
[13]

Medmo: Grounding and understanding multimodal large language model for medical images.arXiv preprint arXiv:2602.06965, 2026

Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, and Imran Razzak. Medmo: Grounding and understanding multimodal large language model for medical images.arXiv preprint arXiv:2602.06965, 2026

arXiv 2026
[14]

Duncan and Nicholas Ayache

James S. Duncan and Nicholas Ayache. Medical image analysis: Progress over two decades and the challenges ahead.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):85–106, 2000

2000
[15]

Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

arXiv 2025
[16]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xie, Fei Huang, and Jingren Zhou. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213. Association for Computational Linguistics, January 2025

2025
[17]

Autogpt, 2023

Significant Gravitas. Autogpt, 2023

2023
[18]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Pith/arXiv arXiv 2003
[19]

Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024

2024
[20]

Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

Pith/arXiv arXiv 2023
[21]

Medagentbench: a virtual ehr environment to benchmark medical llm agents

Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents. Nejm Ai, 2(9):AIdbp2500144, 2025

2025
[22]

Black, Gloria Geng, Danny Park, James Zou, Andrew Y

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. A virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2025

2025
[23]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world GitHub issues?, 2023. 11

2023
[24]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[25]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

2019
[26]

Behaviorsft: Behavioral token conditioning for health agents across the proactivity spectrum

Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene W Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, et al. Behaviorsft: Behavioral token conditioning for health agents across the proactivity spectrum. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 9789–9817, 2025

2025
[27]

Tiered agentic oversight: A hierarchical multi-agent system for healthcare safety.arXiv preprint arXiv:2506.12482, 2025

Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, et al. Tiered agentic oversight: A hierarchical multi-agent system for healthcare safety.arXiv preprint arXiv:2506.12482, 2025

arXiv 2025
[28]

Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

2024
[29]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

2018
[30]

Mmedagent: Learning to use medical tools with multi-modal agent

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

2024
[31]

Modelscope-agent: Building your cus- tomizable agent system with open-source large language models

Chenliang Li, He Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, et al. Modelscope-agent: Building your cus- tomizable agent system with open-source large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 566–578, 2023

2023
[32]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day, 2023

2023
[33]

A comprehensive benchmark for tool-augmented llms

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li API-bank. A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, 2023

2023
[34]

Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking LLMs and a benchmark for reliable interactive clinical reasoning, 2024

2024
[35]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021
[36]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024

2024
[37]

Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 12

Pith/arXiv arXiv 2024
[38]

Chameleon: Plug-and-play compositional reasoning with large language models, 2023

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models, 2023

2023
[39]

Clibench: A multifaceted and multigranular evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions, 2024

Mingyu Derek Ma, Chenchen Ye, et al. Clibench: A multifaceted and multigranular evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions, 2024

2024
[40]

Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog (accessed 2024–04–20)

AI Meta. Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog (accessed 2024–04–20). There is no corresponding record for this reference, 2024

2024
[41]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023
[42]

Huang, Yadong Wu, T

Michael Moor, Shraey B. Huang, Yadong Wu, T. Kapur, et al. Med-flamingo: a multimodal medical few-shot learner. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 225 ofProceedings of Machine Learning Research, 2023

2023
[43]

Babyagi, 2023

Yohei Nakajima. Babyagi, 2023

2023
[44]

Webgpt: Browser-assisted question-answering with human feedback, 2021

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2021

2021
[45]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

2025
[46]

Webcpm: Interactive web search for chinese long-form question answering

Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8968–8988, 2023

2023
[47]

Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

2024
[48]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025
[49]

Deep learning in medical image analysis

Dinggang Shen, Guorong Wu, and Heung-Il Suk. Deep learning in medical image analysis. Annual review of biomedical engineering, 19(1):221–248, 2017. 13

2017
[50]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024

2024
[51]

Fleming-vl: Towards universal medical visual reasoning with multimodal llms.arXiv preprint arXiv:2511.00916, 2025

Yan Shu, Chi Liu, Robin Chen, Derek Li, and Bryan Dai. Fleming-vl: Towards universal medical visual reasoning with multimodal llms.arXiv preprint arXiv:2511.00916, 2025

arXiv 2025
[52]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Pith/arXiv arXiv 2026
[53]

Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023

Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023

arXiv 2023
[54]

Lesion guided explainable few weak-shot medical report generation

Jinghan Sun, Dong Wei, Liansheng Wang, and Yefeng Zheng. Lesion guided explainable few weak-shot medical report generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 615–625. Springer, 2022

2022
[55]

Docagent: An agentic framework for multi-modal long-context document understanding

Lin Sun et al. Docagent: An agentic framework for multi-modal long-context document understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025
[56]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[57]

Enhance llm agents with versatile tool apis

AgentLego Developer Team. Enhance llm agents with versatile tool apis. https://github. com/InternLM/agentlego, 2023

2023
[58]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[59]

Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents

Lagent Developer Team. Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents. https://github.com/ InternLM/lagent, 2023

2023
[60]

Mllm-tool: A multimodal large language model for tool agent learning.arXiv preprint arXiv:2401.10727, 4, 2024

C Wang, W Luo, Q Chen, H Mai, J Guo, S Dong, XM Xuan, Z Li, L Ma, and S Gao. Mllm-tool: A multimodal large language model for tool agent learning.arXiv preprint arXiv:2401.10727, 4, 2024

arXiv 2024
[61]

Gta: A benchmark for general tool agents

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: A benchmark for general tool agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 75749–75790. Curran Associates, Inc., 2024

2024
[62]

Medclip: Contrastive learning from unpaired medical images and text, 2022

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text, 2022

2022
[63]

Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025

Ziyue Wang, Junde Wu, Linghan Cai, Chang Han Low, Xihong Yang, Qiaxuan Li, and Yueming Jin. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025

arXiv 2025
[64]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data, 2023

2023
[65]

Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Pith/arXiv arXiv 2023
[66]

Autogen: Enabling next-gen llm applications via multi-agent conversation framework, 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework, 2023. 14

2023
[67]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

arXiv 2024
[68]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024

Pith/arXiv arXiv 2024
[69]

A comprehensive survey of ai agents in healthcare.TechRxiv, 2025

Gelei Xu, Xueyang Li, Yixiong Chen, Yuying Duan, Shuqing Wu, Alexander Yu, Ching-Hao Chiu, Juntong Ni, Ningzhi Tang, Toby Jia-Jun Li, et al. A comprehensive survey of ai agents in healthcare.TechRxiv, 2025

2025
[70]

Medagentgym: Training llm agents for code-based medical reasoning at scale

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May Dongmei Wang, Peifeng Ruan, Donghan Yang, Tao Wang, et al. Medagentgym: Training llm agents for code-based medical reasoning at scale. InThe Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance, 2025

2025
[71]

Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Pith/arXiv arXiv 2025
[72]

Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation.arXiv preprint arXiv:2602.10367, 2026

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, and Lichao Sun. Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation.arXiv preprint arXiv:2602.10367, 2026

arXiv 2026
[73]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024

2024
[74]

Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

Pith/arXiv arXiv 2023
[75]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022
[76]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[77]

Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553, 2024

Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, et al. Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553, 2024

arXiv 2024
[78]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024
[79]

Appagent: Multimodal agents as smartphone users, 2023

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

2023
[80]

Pmc-vqa: Visual instruction tuning for medical visual question answering, 2024.URL https://arxiv

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2024.URL https://arxiv. org/abs/2305.10415, 40, 2024

Pith/arXiv arXiv 2024

Showing first 80 references.

[1] [1]

Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

Pith/arXiv arXiv 2024

[2] [2]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[3] [3]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024

2024

[4] [4]

Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, et al. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876, 2025

Pith/arXiv arXiv 2025

[5] [5]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[6] [6]

Healthadmin- bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, et al. Healthadmin- bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026. 10

Pith/arXiv arXiv 2026

[7] [7]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

2019

[8] [8]

Castro, Anton Schwaighofer, Stephanie L

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie L. Hyland, Maria T. Wetscherek, Tristan Naumann, Harsha Nori, Neeraj Ahuja, et al. Making the most of text semantics to improve biomedical vision–language processing, 2022

2022

[9] [9]

Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine, 184(5):581–583, 2024

Stephanie Cabral, Daniel Restrepo, Zahir Kanjee, Philip Wilson, Byron Crowe, Raja-Elie Abdulnour, and Adam Rodman. Clinical reasoning of a generative artificial intelligence model compared with physicians.JAMA internal medicine, 184(5):581–583, 2024

2024

[10] [10]

Langchain, October 2022

Harrison Chase. Langchain, October 2022

2022

[11] [11]

Review of medical image quality assessment

Li Sze Chow and Raveendran Paramesran. Review of medical image quality assessment. Biomedical signal processing and control, 27:145–154, 2016

2016

[12] [12]

Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023

2023

[13] [13]

Medmo: Grounding and understanding multimodal large language model for medical images.arXiv preprint arXiv:2602.06965, 2026

Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, and Imran Razzak. Medmo: Grounding and understanding multimodal large language model for medical images.arXiv preprint arXiv:2602.06965, 2026

arXiv 2026

[14] [14]

Duncan and Nicholas Ayache

James S. Duncan and Nicholas Ayache. Medical image analysis: Progress over two decades and the challenges ahead.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):85–106, 2000

2000

[15] [15]

Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray.arXiv preprint arXiv:2502.02673, 2025

arXiv 2025

[16] [16]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator

Zhihao Fan, Lai Wei, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xie, Fei Huang, and Jingren Zhou. Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator. InProceedings of the 31st International Conference on Computational Linguistics, pages 10183–10213. Association for Computational Linguistics, January 2025

2025

[17] [17]

Autogpt, 2023

Significant Gravitas. Autogpt, 2023

2023

[18] [18]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Pith/arXiv arXiv 2003

[19] [19]

Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024

2024

[20] [20]

Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

Pith/arXiv arXiv 2023

[21] [21]

Medagentbench: a virtual ehr environment to benchmark medical llm agents

Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents. Nejm Ai, 2(9):AIdbp2500144, 2025

2025

[22] [22]

Black, Gloria Geng, Danny Park, James Zou, Andrew Y

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. A virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2025

2025

[23] [23]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world GitHub issues?, 2023. 11

2023

[24] [24]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021

[25] [25]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

2019

[26] [26]

Behaviorsft: Behavioral token conditioning for health agents across the proactivity spectrum

Yubin Kim, Zhiyuan Hu, Hyewon Jeong, Eugene W Park, Shuyue Stella Li, Chanwoo Park, Shiyun Xiong, MingYu Lu, Hyeonhoon Lee, Xin Liu, et al. Behaviorsft: Behavioral token conditioning for health agents across the proactivity spectrum. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 9789–9817, 2025

2025

[27] [27]

Tiered agentic oversight: A hierarchical multi-agent system for healthcare safety.arXiv preprint arXiv:2506.12482, 2025

Yubin Kim, Hyewon Jeong, Chanwoo Park, Eugene Park, Haipeng Zhang, Xin Liu, Hyeonhoon Lee, Daniel McDuff, Marzyeh Ghassemi, Cynthia Breazeal, et al. Tiered agentic oversight: A hierarchical multi-agent system for healthcare safety.arXiv preprint arXiv:2506.12482, 2025

arXiv 2025

[28] [28]

Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

2024

[29] [29]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

2018

[30] [30]

Mmedagent: Learning to use medical tools with multi-modal agent

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

2024

[31] [31]

Modelscope-agent: Building your cus- tomizable agent system with open-source large language models

Chenliang Li, He Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, et al. Modelscope-agent: Building your cus- tomizable agent system with open-source large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 566–578, 2023

2023

[32] [32]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day, 2023

2023

[33] [33]

A comprehensive benchmark for tool-augmented llms

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li API-bank. A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, 2023

2023

[34] [34]

Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. Mediq: Question-asking LLMs and a benchmark for reliable interactive clinical reasoning, 2024

2024

[35] [35]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021

[36] [36]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024

2024

[37] [37]

Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 12

Pith/arXiv arXiv 2024

[38] [38]

Chameleon: Plug-and-play compositional reasoning with large language models, 2023

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models, 2023

2023

[39] [39]

Clibench: A multifaceted and multigranular evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions, 2024

Mingyu Derek Ma, Chenchen Ye, et al. Clibench: A multifaceted and multigranular evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions, 2024

2024

[40] [40]

Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog (accessed 2024–04–20)

AI Meta. Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog (accessed 2024–04–20). There is no corresponding record for this reference, 2024

2024

[41] [41]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023

[42] [42]

Huang, Yadong Wu, T

Michael Moor, Shraey B. Huang, Yadong Wu, T. Kapur, et al. Med-flamingo: a multimodal medical few-shot learner. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 225 ofProceedings of Machine Learning Research, 2023

2023

[43] [43]

Babyagi, 2023

Yohei Nakajima. Babyagi, 2023

2023

[44] [44]

Webgpt: Browser-assisted question-answering with human feedback, 2021

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2021

2021

[45] [45]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

2025

[46] [46]

Webcpm: Interactive web search for chinese long-form question answering

Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8968–8988, 2023

2023

[47] [47]

Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

2024

[48] [48]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025

[49] [49]

Deep learning in medical image analysis

Dinggang Shen, Guorong Wu, and Heung-Il Suk. Deep learning in medical image analysis. Annual review of biomedical engineering, 19(1):221–248, 2017. 13

2017

[50] [50]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024

2024

[51] [51]

Fleming-vl: Towards universal medical visual reasoning with multimodal llms.arXiv preprint arXiv:2511.00916, 2025

Yan Shu, Chi Liu, Robin Chen, Derek Li, and Bryan Dai. Fleming-vl: Towards universal medical visual reasoning with multimodal llms.arXiv preprint arXiv:2511.00916, 2025

arXiv 2025

[52] [52]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

Pith/arXiv arXiv 2026

[53] [53]

Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023

Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis.arXiv preprint arXiv:2306.06624, 2023

arXiv 2023

[54] [54]

Lesion guided explainable few weak-shot medical report generation

Jinghan Sun, Dong Wei, Liansheng Wang, and Yefeng Zheng. Lesion guided explainable few weak-shot medical report generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 615–625. Springer, 2022

2022

[55] [55]

Docagent: An agentic framework for multi-modal long-context document understanding

Lin Sun et al. Docagent: An agentic framework for multi-modal long-context document understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025

[56] [56]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[57] [57]

Enhance llm agents with versatile tool apis

AgentLego Developer Team. Enhance llm agents with versatile tool apis. https://github. com/InternLM/agentlego, 2023

2023

[58] [58]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[59] [59]

Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents

Lagent Developer Team. Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents. https://github.com/ InternLM/lagent, 2023

2023

[60] [60]

Mllm-tool: A multimodal large language model for tool agent learning.arXiv preprint arXiv:2401.10727, 4, 2024

C Wang, W Luo, Q Chen, H Mai, J Guo, S Dong, XM Xuan, Z Li, L Ma, and S Gao. Mllm-tool: A multimodal large language model for tool agent learning.arXiv preprint arXiv:2401.10727, 4, 2024

arXiv 2024

[61] [61]

Gta: A benchmark for general tool agents

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: A benchmark for general tool agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 75749–75790. Curran Associates, Inc., 2024

2024

[62] [62]

Medclip: Contrastive learning from unpaired medical images and text, 2022

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text, 2022

2022

[63] [63]

Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025

Ziyue Wang, Junde Wu, Linghan Cai, Chang Han Low, Xihong Yang, Qiaxuan Li, and Yueming Jin. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025

arXiv 2025

[64] [64]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data, 2023

2023

[65] [65]

Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

Pith/arXiv arXiv 2023

[66] [66]

Autogen: Enabling next-gen llm applications via multi-agent conversation framework, 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework, 2023. 14

2023

[67] [67]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

arXiv 2024

[68] [68]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024

Pith/arXiv arXiv 2024

[69] [69]

A comprehensive survey of ai agents in healthcare.TechRxiv, 2025

Gelei Xu, Xueyang Li, Yixiong Chen, Yuying Duan, Shuqing Wu, Alexander Yu, Ching-Hao Chiu, Juntong Ni, Ningzhi Tang, Toby Jia-Jun Li, et al. A comprehensive survey of ai agents in healthcare.TechRxiv, 2025

2025

[70] [70]

Medagentgym: Training llm agents for code-based medical reasoning at scale

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May Dongmei Wang, Peifeng Ruan, Donghan Yang, Tao Wang, et al. Medagentgym: Training llm agents for code-based medical reasoning at scale. InThe Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance, 2025

2025

[71] [71]

Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Pith/arXiv arXiv 2025

[72] [72]

Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation.arXiv preprint arXiv:2602.10367, 2026

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, and Lichao Sun. Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation.arXiv preprint arXiv:2602.10367, 2026

arXiv 2026

[73] [73]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024

2024

[74] [74]

Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

Pith/arXiv arXiv 2023

[75] [75]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022

[76] [76]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023

[77] [77]

Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553, 2024

Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, et al. Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework.arXiv preprint arXiv:2410.01553, 2024

arXiv 2024

[78] [78]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024

[79] [79]

Appagent: Multimodal agents as smartphone users, 2023

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

2023

[80] [80]

Pmc-vqa: Visual instruction tuning for medical visual question answering, 2024.URL https://arxiv

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2024.URL https://arxiv. org/abs/2305.10415, 40, 2024

Pith/arXiv arXiv 2024