AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Andriy Myronenko; Can Zhao; Daguang Xu; Dong Yang; Hardy Chen; Jiawei Mao; Junqi Liu; Pengfei Guo; Selena Song; Tianhao Qi

arxiv: 2606.01961 · v2 · pith:533A6DAVnew · submitted 2026-06-01 · 💻 cs.AI

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Junqi Liu , Selena Song , Yuhan Wang , Jiawei Mao , Hardy Chen , Xiaoke Huang , Tianhao Qi , Pengfei Guo

show 7 more authors

Yucheng Tang Yufan He Can Zhao Andriy Myronenko Dong Yang Daguang Xu Yuyin Zhou

This is my paper

Pith reviewed 2026-06-28 14:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords AutoMedBenchmedical AI researchautonomous agentsbenchmarkworkflow stagesagent errorsmedical imaging tasks

0 comments

The pith

A benchmark shows medical AI agents struggle more with validation than setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates AutoMedBench to assess how well autonomous agents can carry out full medical AI research projects instead of just answering questions or making predictions. It structures the work into five stages and tests agents on tasks involving medical images and reports, scoring both the final result and performance at each stage. The key finding is that agents do best at the setup stage but worst at validation, and that most errors happen during verification and submission rather than understanding the task. This provides a way to see inside the research process and identify specific weaknesses in current agent systems.

Core claim

AutoMedBench organizes agent runs on medical imaging tasks into Plan, Setup, Validate, Inference, and Submit stages, with stage-level scoring across thousands of runs showing Validate as the weakest on average and Setup the strongest, while verification and submission failures account for 37.7% and 38.1% of errors.

What carries the argument

The five-stage workflow that breaks down medical AI research into sequential steps for granular evaluation of agent behavior.

If this is right

Agents are stronger at creating executable code than at checking its correctness.
Verification and submission are the main sources of failure in agent workflows.
Error-free runs achieve much higher scores than those with errors.
Task understanding errors occur in less than 1% of cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving self-verification capabilities in agents could address the main performance gap.
The benchmark could be adapted to test agents in non-medical scientific research areas.
Comparing Lite and Standard tiers may show how much human guidance agents still require.

Load-bearing premise

The chosen five-stage workflow and medical imaging tasks represent the key challenges in actual medical AI research without missing important elements or adding unrealistic limits.

What would settle it

Finding that in repeated experiments with new agents or tasks, the validation stage is no longer the weakest or that error distributions change significantly would challenge the claim.

Figures

Figures reproduced from arXiv: 2606.01961 by Andriy Myronenko, Can Zhao, Daguang Xu, Dong Yang, Hardy Chen, Jiawei Mao, Junqi Liu, Pengfei Guo, Selena Song, Tianhao Qi, Xiaoke Huang, Yucheng Tang, Yufan He, Yuhan Wang, Yuyin Zhou.

**Figure 1.** Figure 1: Overall leaderboard. Overall, agentic, and task scores for the 6 evaluated agents. Agents are ranked by overall scores. The overall score averages the workflow-based agentic score and the held-out task score. Per-track leaderboards are in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: AUTOMEDBENCH: a workflow-aware benchmark for autonomous medical AI research. Left: Tasks are sourced from 20+ public challenges (e.g., KiTS19 [25]) spanning diverse modalities (CT, MRI, X-ray, ultrasound, video) and task types (segmentation, detection, VQA, report generation, and image enhancement). Each task provides a natural-language description and deliverable target, with two difficulty tiers: Lite (… view at source ↗

**Figure 3.** Figure 3: AUTOMEDBENCH scoring rubrics. The overall score is the equal-weighted average of Task Score and Agentic Score (×0.5 each). Task Score is computed deterministically from agent predictions or answers. Agentic Score combines deterministic checks and LLM judge scores across the S1–S5 workflow stages, weighted as: S1 Plan (25%), S2 Setup (15%), S3 Validate (35%), S4 Inference (15%), and S5 Submit (10%). S1, S2,… view at source ↗

**Figure 4.** Figure 4: AUTOMEDBENCH provides stage-level evaluation for medical research agents. Unlike most prior benchmarks that only measure the final output, AUTOMEDBENCH tracks the full workflow from planning to submission, making it possible to identify where agents fail during the research process. This process-aware evaluation reveals hidden failure modes, workflow weaknesses, and error patterns that are not visible fr… view at source ↗

**Figure 5.** Figure 5: Step-level workflow scoring across agents. Scores are shown for the six evaluated agents at each workflow stage: S1 Plan, S2 Setup, S3 Validate, S4 Inference, and S5 Submit. The dashed line marks the mean score across agents for each stage. Setup is the strongest stage on average, while validation is the weakest, showing that agents are better at making pipelines runnable than at checking whether those pip… view at source ↗

**Figure 6.** Figure 6: Higher cost does not reliably translate into better performance. Bars show the mean cost per run for each task track. Insets plot each agent’s track-level cost against score, with Pearson correlation r summarizing the cost–performance relationship within that track. The weak and trackdependent correlations indicate that raw spending is not the main driver of success. The API-price snapshot is listed in [… view at source ↗

**Figure 7.** Figure 7: Error codes can sharply derail a run. (a) Distribution of fired error-code types. (b) Mean overall score by the number of fired error codes in a run. Verification and submission errors dominate tagged failures. The first fired error produces a large score drop, and runs with two or more fired errors remain in a low-score regime. Fired error codes mark a performance cliff. Figure 7b shows that runs with fir… view at source ↗

**Figure 8.** Figure 8: Strong agents both avoid errors and recover from them. Left: total fired error-code counts across agents. Right: recovery rate after two or more fired error codes, defined as the percentage of such runs that still reach end-to-end completion and receive an evaluation score. Strong agents are better at recovering after multiple fired errors [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Track-wise leaderboard breakdown. Overall, agentic, and task scores are shown for each evaluated agent across segmentation, image enhancement, VQA, report generation, and lesion detection. The breakdown shows that overall rank masks task-track specialization: Opus 4.6 leads most tracks, while GLM-5 leads VQA and several agents remain competitive on detection. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

read the original abstract

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoMedBench gives stage-level scoring on medical agent runs but the five-stage split looks imposed rather than derived from real workflows.

read the letter

The paper's main new piece is a benchmark that splits agent execution on medical imaging tasks into Plan, Setup, Validate, Inference, and Submit, then scores each stage separately instead of only the final output. They run agents across segmentation, enhancement, VQA, report generation, and detection, with lite and standard versions of each task, and collect stage scores plus error tags over thousands of runs. The clearest observation is that Validate scores lowest on average while Setup scores highest, and that verification plus submission errors account for most of the tagged failures.

This setup does something useful by making the internal process visible. The error counts and the comparison of scores with versus without fired errors are concrete and easy to interpret. The dual difficulty tiers also let them check how much scaffolding affects the results.

The soft spot is the workflow itself. The five stages are defined by the authors, yet the abstract gives no indication that the boundaries or scoring rules were checked against actual researcher logs or that they cover steps like data collection or hyperparameter search that sit outside Validate. If the tasks force execution into these buckets, the Setup-Validate gap could simply reflect the design rather than a general limit on current agents. The stress-test concern lands here because nothing in the provided material shows the stages were validated externally.

The numbers rest on volume of runs rather than on fitted models, so circularity is not an issue. Baseline selection and exact task prompts would need checking in the full methods, but the scale helps.

This is for people building or evaluating autonomous agents in medical imaging. A reader who needs tools to diagnose where agents break in longer workflows would get direct value from the stage scores. It deserves peer review because the benchmark supplies a workable framework even if the workflow justification needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks. It organizes execution into a fixed five-stage workflow (Plan, Setup, Validate, Inference, Submit) across five imaging tracks (segmentation, enhancement, VQA, report generation, lesion detection) with Lite/Standard difficulty tiers. Using thousands of agent runs (average 33 turns each), it reports stage-level scores showing Validate as weakest and Setup as strongest on average, plus error tagging where verification (37.7%) and submission (38.1%) failures dominate while task-understanding errors are rare (0.9%).

Significance. If the imposed workflow and task definitions are representative, the stage-level and error-tagged results provide actionable diagnostics for agent development by pinpointing verification as a bottleneck. The scale of experimentation and dual scoring (final performance + per-stage) are strengths that enable reproducible comparisons. The benchmark could support future work on long-horizon medical agents if the decomposition is shown to align with actual research practice.

major comments (2)

[Abstract and §3] Abstract and §3 (Workflow Definition): The central interpretive claim—that agents are better at executable pipelines (Setup) than verifying reliability (Validate)—rests on the assumption that the S1-S5 decomposition accurately mirrors real medical-AI research without omitting steps such as data acquisition or hyperparameter search outside Validate. The manuscript provides no derivation, expert validation, or comparison to researcher logs for the stage boundaries or rubrics, raising the risk that the observed score gap is an artifact of how execution is bucketed rather than an intrinsic agent limitation.
[§4] §4 (Experimental Setup) and error analysis: The post-run error tagging reports verification and submission failures as dominant (37.7% and 38.1%), but the manuscript does not detail the tagging protocol, inter-annotator agreement, or how "fired codes" are assigned across agent traces. Without this, it is unclear whether the 48% score drop for runs with one error is robust or sensitive to tagging choices, which directly affects the reliability of the error-dominance conclusion.

minor comments (2)

[Abstract] The abstract states each run averages 33 turns but does not specify variance or distribution across Lite vs. Standard tiers; adding this would clarify the long-horizon claim.
[Results] Table or figure presenting per-stage scores should include confidence intervals or statistical tests for the Validate < Setup difference to support the "on average" claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving transparency and justification in the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Workflow Definition): The central interpretive claim—that agents are better at executable pipelines (Setup) than verifying reliability (Validate)—rests on the assumption that the S1-S5 decomposition accurately mirrors real medical-AI research without omitting steps such as data acquisition or hyperparameter search outside Validate. The manuscript provides no derivation, expert validation, or comparison to researcher logs for the stage boundaries or rubrics, raising the risk that the observed score gap is an artifact of how execution is bucketed rather than an intrinsic agent limitation.

Authors: We agree that the manuscript lacks an explicit derivation or external validation of the S1-S5 boundaries. In the revision, we will add a dedicated subsection in §3 that provides a step-by-step rationale for each stage, grounded in common medical-AI research practices (e.g., planning precedes implementation, validation precedes inference). We will also include a limitations paragraph noting that steps such as raw data acquisition are treated as inputs to the task brief rather than agent responsibilities, and that the decomposition is a pragmatic abstraction rather than a claim of exhaustive coverage. This will make the interpretive claim more clearly scoped to the defined workflow. revision: yes
Referee: [§4] §4 (Experimental Setup) and error analysis: The post-run error tagging reports verification and submission failures as dominant (37.7% and 38.1%), but the manuscript does not detail the tagging protocol, inter-annotator agreement, or how "fired codes" are assigned across agent traces. Without this, it is unclear whether the 48% score drop for runs with one error is robust or sensitive to tagging choices, which directly affects the reliability of the error-dominance conclusion.

Authors: We concur that the error-tagging methodology requires fuller documentation. In the revised §4 we will insert a complete description of the tagging protocol, the full error code taxonomy with definitions and examples, the assignment procedure across traces, and any consistency checks performed. If tagging was conducted by a single annotator we will state this explicitly and note it as a limitation; if multiple annotators were used we will report agreement statistics. These additions will allow readers to evaluate the sensitivity of the reported error distributions and the 48% score drop. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark reporting direct measurements

full rationale

The paper defines a new benchmark (AutoMedBench) with an author-specified five-stage workflow and five imaging tasks, then reports empirical observations from thousands of agent runs on those tasks. No equations, derivations, or fitted parameters are present that reduce any claimed result to its own inputs by construction. Stage-level scores (Validate weakest, Setup strongest) and error percentages are direct aggregates from the defined evaluation rubric, not predictions or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the abstract or described content. The work is self-contained as an empirical study against its own benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a benchmark definition and reports empirical results without fitted parameters or new physical entities; it relies on the domain assumption that the staged workflow models research processes.

axioms (1)

domain assumption The five-stage workflow accurately models medical-AI research processes.
The entire benchmark and scoring system is constructed around this division of agent execution.

pith-pipeline@v0.9.1-grok · 5875 in / 1243 out tokens · 31437 ms · 2026-06-28T14:36:15.826907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 19 canonical work pages

[1]

AAPM low-dose CT grand challenge (LDCT-SimNICT).https://www.aapm

AAPM. AAPM low-dose CT grand challenge (LDCT-SimNICT).https://www.aapm. org/GrandChallenge/LowDoseCT/, 2016

2016
[2]

Claude Opus 4.6 system card.https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, February 2026

Anthropic. Claude Opus 4.6 system card.https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, February 2026. 14

2026
[3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433, 2015

2015
[4]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui ˜nonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models to- wards improved human health, 2025. URLhttps://arxiv.org/abs/2505.08775

Pith/arXiv arXiv 2025
[5]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, 2005

2005
[6]

Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

arXiv 2025
[7]

Holistic eval- uation of large language models for medical tasks with MedHELM.Nature Medicine, 32(3): 943–951, 2026

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, et al. Holistic eval- uation of large language models for medical tasks with MedHELM.Nature Medicine, 32(3): 943–951, 2026. doi: 10.1038/s41591-025-04151-2. URLhttps://www.nature.com/ articles/s41591-025-04151-2

work page doi:10.1038/s41591-025-04151-2 2026
[8]

HealthAdmin- Bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, et al. HealthAdmin- Bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

Pith/arXiv arXiv 2026
[9]

PANTHER challenge: Public training dataset, 2025

Amparo Soeli Betancourt Tarifa, Faisal Mahmood, Uffe Bernchou, and Peter Jan Koopmans. PANTHER challenge: Public training dataset, 2025. URLhttps://doi.org/10.5281/ zenodo.15192302

2025
[10]

PanTS: Pancreatic tumor segmentation.https://huggingface.co/ datasets/BodyMaps/PanTSMini, 2024

BodyMaps. PanTS: Pancreatic tumor segmentation.https://huggingface.co/ datasets/BodyMaps/PanTSMini, 2024

2024
[11]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision, pp. 213–229. Springer, 2020

2020
[12]

Langlotz

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P. Langlotz. CheXpert Plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats, 2024. URLhttps://arxiv.org/abs/2405.19538

arXiv 2024
[13]

MLE-bench: Evaluating machine learning agents on machine learning engineering,

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. MLE-bench: Evaluating machine learning agents on machine learning engineering,
[14]

URLhttps://arxiv.org/abs/2410.07095

Pith/arXiv arXiv
[15]

IEEE Transactions on Medi- cal Imaging36(8), 1597–1606 (Aug 2017).https://doi.org/10.1109/TMI.2017

Hu Chen, Yi Zhang, Mannudeep K. Kalra, Feng Lin, Yang Chen, Peixi Liao, Jiliu Zhou, and Ge Wang. Low-dose CT with a residual encoder-decoder convolutional neural network. IEEE Transactions on Medical Imaging, 36(12):2524–2535, 2017. doi: 10.1109/TMI.2017. 2715284

work page doi:10.1109/tmi.2017 2017
[16]

Generating radiology reports via memory-driven transformer

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven transformer. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 1439–1449, 2020

2020
[17]

Kohli, Marc B

Dina Demner-Fushman, Marc D. Kohli, Marc B. Rosenman, Sonya E. Shooshan, Laritza Rodriguez, Sameer Antani, George R. Thoma, and Clement J. McDonald. Preparing a col- lection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, 2016. doi: 10.1093/jamia/ocv080. URL https://doi.org/1...

work page doi:10.1093/jamia/ocv080 2016
[18]

Lee R. Dice. Measures of the amount of ecologic association between species.Ecology, 26 (3):297–302, 1945. doi: 10.2307/1932409

work page doi:10.2307/1932409 1945
[19]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisser- man. The PASCAL visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. doi: 10.1007/s11263-009-0275-4

work page doi:10.1007/s11263-009-0275-4 2010
[20]

FeTA challenge: Fetal brain tissue annotation and segmentation.https: //fetachallenge.github.io/, 2021

FeTA Organizers. FeTA challenge: Fetal brain tissue annotation and segmentation.https: //fetachallenge.github.io/, 2021

2021
[21]

Camyla: Scaling autonomous research in medical image segmentation.arXiv preprint arXiv:2604.10696, 2026

Yifan Gao, Haoyue Li, Feng Yuan, Xin Gao, Weiran Huang, and Xiaosong Wang. Camyla: Scaling autonomous research in medical image segmentation.arXiv preprint arXiv:2604.10696, 2026

Pith/arXiv arXiv 2026
[22]

GLM-5: From vibe coding to agentic engineering, 2026

GLM-5 Team. GLM-5: From vibe coding to agentic engineering, 2026. URLhttps:// arxiv.org/abs/2602.15763

Pith/arXiv arXiv 2026
[23]

Gemini 3.1 Pro model card.https://deepmind.google/ models/model-cards/gemini-3-1-pro/, February 2026

Google DeepMind. Gemini 3.1 Pro model card.https://deepmind.google/ models/model-cards/gemini-3-1-pro/, February 2026

2026
[24]

DENTEX: Dental enumeration and diagnosis on panoramic x-rays.https://huggingface.co/datasets/ibrahimhamamci/DENTEX, 2023

Ibrahim Ethem Hamamci et al. DENTEX: Dental enumeration and diagnosis on panoramic x-rays.https://huggingface.co/datasets/ibrahimhamamci/DENTEX, 2023

2023
[25]

Xing, and Pengtao Xie

Xuehai He, Yichen Zhang, Luntian Mou, Eric P. Xing, and Pengtao Xie. PathVQA: 30000+ questions for medical visual question answering, 2020. URLhttps://arxiv.org/abs/ 2003.10286

Pith/arXiv arXiv 2020
[26]

KiTS19: Kidney tumor segmentation challenge.https://kits19

Nicholas Heller et al. KiTS19: Kidney tumor segmentation challenge.https://kits19. grand-challenge.org/, 2019

2019
[27]

Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsim- pourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, and Karan Sing- hal. Healthbench professional: Evaluating large language models on real clinician chats, 2026. URL...

Pith/arXiv arXiv 2026
[28]

MLAgentBench: Evaluating lan- guage agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating lan- guage agents on machine learning experimentation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 20271–20309. PMLR, 2024. URLhttps://proceedings.mlr.press/v235/ huang24y.html

2024
[29]

Nat Methods18(2), 203–211 (Feb 2021)

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18:203–211, 2021. doi: 10.1038/s41592-020-01008-z

work page doi:10.1038/s41592-020-01008-z 2021
[30]

RadGraph: Extracting clinical entities and relations from radiology reports.PhysioNet, 2021

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Matthew Lungren, Andrew Ng, Curtis Langlotz, and Pranav Rajpurkar. RadGraph: Extracting clinical entities and relations from radiology reports.PhysioNet, 2021. doi: 10.13026/hm87-5p47. Version 1.0.0

work page doi:10.13026/hm87-5p47 2021
[31]

DiscoveryWorld: A virtual environment for developing and evaluating automated scientific discovery agents

Peter Jansen, Marc-Alexandre C ˆot´e, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DiscoveryWorld: A virtual environment for developing and evaluating automated scientific discovery agents. InAdvances in Neural Information Processing Systems, volume 37, 2024. doi: 10.52202/079017-0324. URLh...

work page doi:10.52202/079017-0324 2024
[32]

Black, Gloria Geng, Danny Park, James Zou, Andrew Y

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. MedAgentBench: A realistic virtual EHR environment to benchmark med- ical LLM agents, 2025. URLhttps://arxiv.org/abs/2501.14654. 16

arXiv 2025
[33]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub is- sues? InInternational Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=VTF8yNQM66

2024
[34]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421. URL https://www.mdpi.com/2076-3417/11/14/6421

work page doi:10.3390/app11146421 2021
[35]

URLhttps://doi.org/10.18653/v1/D19-1259

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing, pp. 2567–2577, 2019. doi: 10.18653/v1/D19-1259. URLhttp...

work page doi:10.18653/v1/d19-1259 2019
[36]

Johnson et al

Alistair E.W. Johnson et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs.https://physionet.org/content/mimic-cxr/, 2019

2019
[37]

BCCD: Blood cell count and detection dataset.https://huggingface

Kerem Berke. BCCD: Blood cell count and detection dataset.https://huggingface. co/datasets/keremberke/blood-cell-object-detection, 2018

2018
[38]

Scientific Data5, 180251 (2018)

Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5:180251, 2018. doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/ articles/sdata2018251

work page doi:10.1038/sdata.2018.251 2018
[39]

FHIR-AgentBench: Benchmarking LLM agents for realistic interoperable EHR question answering.arXiv preprint arXiv:2509.19319, 2025

Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Jong Ha Lee, et al. FHIR-AgentBench: Benchmarking LLM agents for realistic interoperable EHR question answering.arXiv preprint arXiv:2509.19319, 2025

arXiv 2025
[40]

AgentEHR: Advancing autonomous clinical decision-making via retrospective summarization

Yusheng Liao, Chuan Xuan, Yutong Cai, Lina Yang, Zhe Chen, Yanfeng Wang, and Yu Wang. AgentEHR: Advancing autonomous clinical decision-making via retrospective summarization. arXiv preprint arXiv:2601.13918, 2026

arXiv 2026
[41]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out, pp. 74–81, 2004

2004
[42]

Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

Zhihong Lin, Donghao Zhang, Qingyi Tao, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023. doi: 10.1016/j.artmed.2023.102611

work page doi:10.1016/j.artmed.2023.102611 2023
[43]

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. S ´anchez. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017. doi: 10.1016/j.media.2017.07.005

work page doi:10.1016/j.media.2017.07.005 2017
[44]

SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654,

2021
[45]

URLhttps://doi.org/10.1109/ ISBI48211.2021.9434010

doi: 10.1109/ISBI48211.2021.9434010. URLhttps://doi.org/10.1109/ ISBI48211.2021.9434010

work page doi:10.1109/isbi48211.2021.9434010 2021
[46]

PhysicianBench: Evaluating LLM agents in real-world EHR environments.arXiv preprint arXiv:2605.02240, 2026

Ruoqi Liu, Imran Q Mohiuddin, Austin J Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L Vemu, Shivam C Vedak, Kameron C Black, John L Havlik, Isaac Ogunmola, et al. PhysicianBench: Evaluating LLM agents in real-world EHR environments.arXiv preprint arXiv:2605.02240, 2026

Pith/arXiv arXiv 2026
[47]

Agentbench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...

2024
[48]

The AI scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024
[49]

Agentboard: An analytical evaluation board of multi-turn LLM agents, 2024

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn LLM agents, 2024. URLhttps://arxiv.org/abs/2401.13178

arXiv 2024
[50]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps: //arxiv.org/abs/2601.11868

Pith/arXiv arXiv 2026
[51]

GAIA: a benchmark for general AI assistants

Gr ´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= fibxvahvs3

2024
[52]

The MiniMax-M2 series: Mini activations unleashing max real-world intelligence,

MiniMax. The MiniMax-M2 series: Mini activations unleashing max real-world intelligence,
[53]

URLhttps://arxiv.org/abs/2605.26494

Pith/arXiv arXiv
[54]

GRAZPEDWRI-DX: A pediatric wrist radiograph dataset.https:// figshare.com/articles/dataset/GRAZPEDWRI-DX/14825193, 2022

Eszter Nagy et al. GRAZPEDWRI-DX: A pediatric wrist radiograph dataset.https:// figshare.com/articles/dataset/GRAZPEDWRI-DX/14825193, 2022

arXiv 2022
[55]

MLGym: A new framework and benchmark for advancing AI research agents

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Mikhail Plekhanov, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Nicolaus Foer- ster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. MLGym: A new framework and b...

2025
[56]

Nguyen et al

Ha Q. Nguyen et al. VinDr-CXR: An open dataset of chest x-rays with radiologist’s annota- tions.https://vindr.ai/datasets/vindr-cxr, 2022

2022
[57]

GPT-5.4 Thinking system card.https://deploymentsafety.openai

OpenAI. GPT-5.4 Thinking system card.https://deploymentsafety.openai. com/gpt-5-4-thinking/gpt-5-4-thinking.pdf, March 2026

2026
[58]

MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering. InProceed- ings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learning Research, pp. 248–260. PMLR, 2022. URLhttps://proceedings. mlr.press/v174/pal22a.html

2022
[59]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , year =

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for auto- matic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the As- sociation for Computational Linguistics, pp. 311–318, 2002. doi: 10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[60]

Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog? id=qwen3.5, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog? id=qwen3.5, February 2026

2026
[61]

AeroPath: Airway segmentation dataset.https://github.com/ raidionics/AeroPath, 2023

Raidionics. AeroPath: Airway segmentation dataset.https://github.com/ raidionics/AeroPath, 2023

2023
[62]

You Only Look Once: Unified, Real -Time Object Detection[J].IEEE, 2016.DOI:10.1109/CVPR.2016.91

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uni- fied, real-time object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788, 2016. doi: 10.1109/CVPR.2016.91

work page doi:10.1109/cvpr.2016.91 2016
[63]

Faster R-CNN: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Processing Systems, volume 28, pp. 91–99, 2015

2015
[64]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, volume 9351 ofLecture Notes in Computer Science, pp. 234–241. Springer, 2015. 18

2015
[65]

Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical en- vironments.arXiv preprint arXiv:2405.07960, 2024

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical en- vironments.arXiv preprint arXiv:2405.07960, 2024

Pith/arXiv arXiv 2024
[66]

Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 22315–22339, 2024

2024
[67]

Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-Bench: Fostering the credibility of published research through a computational re- producibility agent benchmark.Transactions on Machine Learning Research, 2025. URL https://openreview.net/forum?id=BsMMc4MEGS

2025
[68]

Large language models encode clinical knowledge.Nature, 620:172–180, 2023

Karan Singhal, Shekoofeh Azizi Tu, Julia Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Peter Schuh, Karan Sareen, David Winer, Denny Wilson, et al. Large language models encode clinical knowledge.Nature, 620:172–180, 2023. doi: 10.1038/s41586-023-06291-2. URL https://www.nature.com/articles/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023
[69]

PaperBench: Evaluating AI’s ability to replicate AI research,

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research,
[70]

URLhttps://arxiv.org/abs/2504.01848

Pith/arXiv arXiv
[71]

PathAsst: A generative foundation AI assistant towards artificial general intelligence of pathology, 2023

Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, and Lin Yang. PathAsst: A generative foundation AI assistant towards artificial general intelligence of pathology, 2023. URLhttps://arxiv.org/abs/2305.15072

arXiv 2023
[72]

MedXpertQA-MM: Multimodal expert medical question answering.https: //huggingface.co/datasets/TsinghuaC3I/MedXpertQA, 2024

TsinghuaC3I. MedXpertQA-MM: Multimodal expert medical question answering.https: //huggingface.co/datasets/TsinghuaC3I/MedXpertQA, 2024

2024
[73]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assess- ment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004
[74]

TotalSegmentator: Robust segmentation of 104 anatomical structures in CT.https://github.com/wasserth/TotalSegmentator, 2023

Jakob Wasserthal et al. TotalSegmentator: Robust segmentation of 104 anatomical structures in CT.https://github.com/wasserth/TotalSegmentator, 2023

2023
[75]

OSWorld: Benchmark- ing multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmark- ing multimodal agents for open-ended tasks in real computer environments, 2024. URL https://arxiv.org/abs/...

Pith/arXiv arXiv 2024
[76]

MedMNIST v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10:41, 2023

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. MedMNIST v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10:41, 2023. doi: 10.1038/s41597-022-01721-8. URL https://www.nature.com/articles/s41597-022-01721-8

work page doi:10.1038/s41597-022-01721-8 2023
[77]

Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024. URL https://arxiv.org/abs/2407.15711

arXiv 2024
[78]

MedFrameQA: A multi-image medical VQA benchmark for clinical reasoning, 2025

Suhao Yu, Haojin Wang, Juncheng Wu, Luyang Luo, Jingshen Wang, Cihang Xie, Pranav Rajpurkar, Carl Yang, Yang Yang, Kang Wang, Yannan Yu, and Yuyin Zhou. MedFrameQA: A multi-image medical VQA benchmark for clinical reasoning, 2025. URLhttps://arxiv. org/abs/2505.16964

arXiv 2025
[79]

Muckley, Mary Bruno, Aaron De- fazio, Marc Parente, Krzysztof J

Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron De- fazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, James Pinkerton, Duo Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. 19 Sodick...

Pith/arXiv arXiv 2018
[80]

open-source

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realis- tic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=oKn9c6ytLx. 20 APPENDIXCONTENT...

2024

[1] [1]

AAPM low-dose CT grand challenge (LDCT-SimNICT).https://www.aapm

AAPM. AAPM low-dose CT grand challenge (LDCT-SimNICT).https://www.aapm. org/GrandChallenge/LowDoseCT/, 2016

2016

[2] [2]

Claude Opus 4.6 system card.https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, February 2026

Anthropic. Claude Opus 4.6 system card.https://www-cdn.anthropic.com/ 14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, February 2026. 14

2026

[3] [3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433, 2015

2015

[4] [4]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui ˜nonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models to- wards improved human health, 2025. URLhttps://arxiv.org/abs/2505.08775

Pith/arXiv arXiv 2025

[5] [5]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, 2005

2005

[6] [6]

Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

arXiv 2025

[7] [7]

Holistic eval- uation of large language models for medical tasks with MedHELM.Nature Medicine, 32(3): 943–951, 2026

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, et al. Holistic eval- uation of large language models for medical tasks with MedHELM.Nature Medicine, 32(3): 943–951, 2026. doi: 10.1038/s41591-025-04151-2. URLhttps://www.nature.com/ articles/s41591-025-04151-2

work page doi:10.1038/s41591-025-04151-2 2026

[8] [8]

HealthAdmin- Bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, et al. HealthAdmin- Bench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

Pith/arXiv arXiv 2026

[9] [9]

PANTHER challenge: Public training dataset, 2025

Amparo Soeli Betancourt Tarifa, Faisal Mahmood, Uffe Bernchou, and Peter Jan Koopmans. PANTHER challenge: Public training dataset, 2025. URLhttps://doi.org/10.5281/ zenodo.15192302

2025

[10] [10]

PanTS: Pancreatic tumor segmentation.https://huggingface.co/ datasets/BodyMaps/PanTSMini, 2024

BodyMaps. PanTS: Pancreatic tumor segmentation.https://huggingface.co/ datasets/BodyMaps/PanTSMini, 2024

2024

[11] [11]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision, pp. 213–229. Springer, 2020

2020

[12] [12]

Langlotz

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P. Langlotz. CheXpert Plus: Augmenting a large chest x-ray dataset with text radiology reports, patient demographics and additional image formats, 2024. URLhttps://arxiv.org/abs/2405.19538

arXiv 2024

[13] [13]

MLE-bench: Evaluating machine learning agents on machine learning engineering,

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. MLE-bench: Evaluating machine learning agents on machine learning engineering,

[14] [14]

URLhttps://arxiv.org/abs/2410.07095

Pith/arXiv arXiv

[15] [15]

IEEE Transactions on Medi- cal Imaging36(8), 1597–1606 (Aug 2017).https://doi.org/10.1109/TMI.2017

Hu Chen, Yi Zhang, Mannudeep K. Kalra, Feng Lin, Yang Chen, Peixi Liao, Jiliu Zhou, and Ge Wang. Low-dose CT with a residual encoder-decoder convolutional neural network. IEEE Transactions on Medical Imaging, 36(12):2524–2535, 2017. doi: 10.1109/TMI.2017. 2715284

work page doi:10.1109/tmi.2017 2017

[16] [16]

Generating radiology reports via memory-driven transformer

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven transformer. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 1439–1449, 2020

2020

[17] [17]

Kohli, Marc B

Dina Demner-Fushman, Marc D. Kohli, Marc B. Rosenman, Sonya E. Shooshan, Laritza Rodriguez, Sameer Antani, George R. Thoma, and Clement J. McDonald. Preparing a col- lection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, 2016. doi: 10.1093/jamia/ocv080. URL https://doi.org/1...

work page doi:10.1093/jamia/ocv080 2016

[18] [18]

Lee R. Dice. Measures of the amount of ecologic association between species.Ecology, 26 (3):297–302, 1945. doi: 10.2307/1932409

work page doi:10.2307/1932409 1945

[19] [19]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisser- man. The PASCAL visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. doi: 10.1007/s11263-009-0275-4

work page doi:10.1007/s11263-009-0275-4 2010

[20] [20]

FeTA challenge: Fetal brain tissue annotation and segmentation.https: //fetachallenge.github.io/, 2021

FeTA Organizers. FeTA challenge: Fetal brain tissue annotation and segmentation.https: //fetachallenge.github.io/, 2021

2021

[21] [21]

Camyla: Scaling autonomous research in medical image segmentation.arXiv preprint arXiv:2604.10696, 2026

Yifan Gao, Haoyue Li, Feng Yuan, Xin Gao, Weiran Huang, and Xiaosong Wang. Camyla: Scaling autonomous research in medical image segmentation.arXiv preprint arXiv:2604.10696, 2026

Pith/arXiv arXiv 2026

[22] [22]

GLM-5: From vibe coding to agentic engineering, 2026

GLM-5 Team. GLM-5: From vibe coding to agentic engineering, 2026. URLhttps:// arxiv.org/abs/2602.15763

Pith/arXiv arXiv 2026

[23] [23]

Gemini 3.1 Pro model card.https://deepmind.google/ models/model-cards/gemini-3-1-pro/, February 2026

Google DeepMind. Gemini 3.1 Pro model card.https://deepmind.google/ models/model-cards/gemini-3-1-pro/, February 2026

2026

[24] [24]

DENTEX: Dental enumeration and diagnosis on panoramic x-rays.https://huggingface.co/datasets/ibrahimhamamci/DENTEX, 2023

Ibrahim Ethem Hamamci et al. DENTEX: Dental enumeration and diagnosis on panoramic x-rays.https://huggingface.co/datasets/ibrahimhamamci/DENTEX, 2023

2023

[25] [25]

Xing, and Pengtao Xie

Xuehai He, Yichen Zhang, Luntian Mou, Eric P. Xing, and Pengtao Xie. PathVQA: 30000+ questions for medical visual question answering, 2020. URLhttps://arxiv.org/abs/ 2003.10286

Pith/arXiv arXiv 2020

[26] [26]

KiTS19: Kidney tumor segmentation challenge.https://kits19

Nicholas Heller et al. KiTS19: Kidney tumor segmentation challenge.https://kits19. grand-challenge.org/, 2019

2019

[27] [27]

Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsim- pourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, and Karan Sing- hal. Healthbench professional: Evaluating large language models on real clinician chats, 2026. URL...

Pith/arXiv arXiv 2026

[28] [28]

MLAgentBench: Evaluating lan- guage agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating lan- guage agents on machine learning experimentation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pp. 20271–20309. PMLR, 2024. URLhttps://proceedings.mlr.press/v235/ huang24y.html

2024

[29] [29]

Nat Methods18(2), 203–211 (Feb 2021)

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18:203–211, 2021. doi: 10.1038/s41592-020-01008-z

work page doi:10.1038/s41592-020-01008-z 2021

[30] [30]

RadGraph: Extracting clinical entities and relations from radiology reports.PhysioNet, 2021

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Matthew Lungren, Andrew Ng, Curtis Langlotz, and Pranav Rajpurkar. RadGraph: Extracting clinical entities and relations from radiology reports.PhysioNet, 2021. doi: 10.13026/hm87-5p47. Version 1.0.0

work page doi:10.13026/hm87-5p47 2021

[31] [31]

DiscoveryWorld: A virtual environment for developing and evaluating automated scientific discovery agents

Peter Jansen, Marc-Alexandre C ˆot´e, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, and Peter Clark. DiscoveryWorld: A virtual environment for developing and evaluating automated scientific discovery agents. InAdvances in Neural Information Processing Systems, volume 37, 2024. doi: 10.52202/079017-0324. URLh...

work page doi:10.52202/079017-0324 2024

[32] [32]

Black, Gloria Geng, Danny Park, James Zou, Andrew Y

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. MedAgentBench: A realistic virtual EHR environment to benchmark med- ical LLM agents, 2025. URLhttps://arxiv.org/abs/2501.14654. 16

arXiv 2025

[33] [33]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub is- sues? InInternational Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=VTF8yNQM66

2024

[34] [34]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421. URL https://www.mdpi.com/2076-3417/11/14/6421

work page doi:10.3390/app11146421 2021

[35] [35]

URLhttps://doi.org/10.18653/v1/D19-1259

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing, pp. 2567–2577, 2019. doi: 10.18653/v1/D19-1259. URLhttp...

work page doi:10.18653/v1/d19-1259 2019

[36] [36]

Johnson et al

Alistair E.W. Johnson et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs.https://physionet.org/content/mimic-cxr/, 2019

2019

[37] [37]

BCCD: Blood cell count and detection dataset.https://huggingface

Kerem Berke. BCCD: Blood cell count and detection dataset.https://huggingface. co/datasets/keremberke/blood-cell-object-detection, 2018

2018

[38] [38]

Scientific Data5, 180251 (2018)

Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5:180251, 2018. doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/ articles/sdata2018251

work page doi:10.1038/sdata.2018.251 2018

[39] [39]

FHIR-AgentBench: Benchmarking LLM agents for realistic interoperable EHR question answering.arXiv preprint arXiv:2509.19319, 2025

Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Jong Ha Lee, et al. FHIR-AgentBench: Benchmarking LLM agents for realistic interoperable EHR question answering.arXiv preprint arXiv:2509.19319, 2025

arXiv 2025

[40] [40]

AgentEHR: Advancing autonomous clinical decision-making via retrospective summarization

Yusheng Liao, Chuan Xuan, Yutong Cai, Lina Yang, Zhe Chen, Yanfeng Wang, and Yu Wang. AgentEHR: Advancing autonomous clinical decision-making via retrospective summarization. arXiv preprint arXiv:2601.13918, 2026

arXiv 2026

[41] [41]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out, pp. 74–81, 2004

2004

[42] [42]

Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

Zhihong Lin, Donghao Zhang, Qingyi Tao, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023. doi: 10.1016/j.artmed.2023.102611

work page doi:10.1016/j.artmed.2023.102611 2023

[43] [43]

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A. W. M. van der Laak, Bram van Ginneken, and Clara I. S ´anchez. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017. doi: 10.1016/j.media.2017.07.005

work page doi:10.1016/j.media.2017.07.005 2017

[44] [44]

SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654,

2021

[45] [45]

URLhttps://doi.org/10.1109/ ISBI48211.2021.9434010

doi: 10.1109/ISBI48211.2021.9434010. URLhttps://doi.org/10.1109/ ISBI48211.2021.9434010

work page doi:10.1109/isbi48211.2021.9434010 2021

[46] [46]

PhysicianBench: Evaluating LLM agents in real-world EHR environments.arXiv preprint arXiv:2605.02240, 2026

Ruoqi Liu, Imran Q Mohiuddin, Austin J Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L Vemu, Shivam C Vedak, Kameron C Black, John L Havlik, Isaac Ogunmola, et al. PhysicianBench: Evaluating LLM agents in real-world EHR environments.arXiv preprint arXiv:2605.02240, 2026

Pith/arXiv arXiv 2026

[47] [47]

Agentbench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InInternational Conference on Learning Representatio...

2024

[48] [48]

The AI scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292

Pith/arXiv arXiv 2024

[49] [49]

Agentboard: An analytical evaluation board of multi-turn LLM agents, 2024

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn LLM agents, 2024. URLhttps://arxiv.org/abs/2401.13178

arXiv 2024

[50] [50]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps: //arxiv.org/abs/2601.11868

Pith/arXiv arXiv 2026

[51] [51]

GAIA: a benchmark for general AI assistants

Gr ´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= fibxvahvs3

2024

[52] [52]

The MiniMax-M2 series: Mini activations unleashing max real-world intelligence,

MiniMax. The MiniMax-M2 series: Mini activations unleashing max real-world intelligence,

[53] [53]

URLhttps://arxiv.org/abs/2605.26494

Pith/arXiv arXiv

[54] [54]

GRAZPEDWRI-DX: A pediatric wrist radiograph dataset.https:// figshare.com/articles/dataset/GRAZPEDWRI-DX/14825193, 2022

Eszter Nagy et al. GRAZPEDWRI-DX: A pediatric wrist radiograph dataset.https:// figshare.com/articles/dataset/GRAZPEDWRI-DX/14825193, 2022

arXiv 2022

[55] [55]

MLGym: A new framework and benchmark for advancing AI research agents

Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Mikhail Plekhanov, Amar Budhiraja, Despoina Magka, Vladislav V orotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Nicolaus Foer- ster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. MLGym: A new framework and b...

2025

[56] [56]

Nguyen et al

Ha Q. Nguyen et al. VinDr-CXR: An open dataset of chest x-rays with radiologist’s annota- tions.https://vindr.ai/datasets/vindr-cxr, 2022

2022

[57] [57]

GPT-5.4 Thinking system card.https://deploymentsafety.openai

OpenAI. GPT-5.4 Thinking system card.https://deploymentsafety.openai. com/gpt-5-4-thinking/gpt-5-4-thinking.pdf, March 2026

2026

[58] [58]

MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A large- scale multi-subject multi-choice dataset for medical domain question answering. InProceed- ings of the Conference on Health, Inference, and Learning, volume 174 ofProceedings of Machine Learning Research, pp. 248–260. PMLR, 2022. URLhttps://proceedings. mlr.press/v174/pal22a.html

2022

[59] [59]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , year =

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for auto- matic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the As- sociation for Computational Linguistics, pp. 311–318, 2002. doi: 10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[60] [60]

Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog? id=qwen3.5, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog? id=qwen3.5, February 2026

2026

[61] [61]

AeroPath: Airway segmentation dataset.https://github.com/ raidionics/AeroPath, 2023

Raidionics. AeroPath: Airway segmentation dataset.https://github.com/ raidionics/AeroPath, 2023

2023

[62] [62]

You Only Look Once: Unified, Real -Time Object Detection[J].IEEE, 2016.DOI:10.1109/CVPR.2016.91

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uni- fied, real-time object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788, 2016. doi: 10.1109/CVPR.2016.91

work page doi:10.1109/cvpr.2016.91 2016

[63] [63]

Faster R-CNN: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Processing Systems, volume 28, pp. 91–99, 2015

2015

[64] [64]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention, volume 9351 ofLecture Notes in Computer Science, pp. 234–241. Springer, 2015. 18

2015

[65] [65]

Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical en- vironments.arXiv preprint arXiv:2405.07960, 2024

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical en- vironments.arXiv preprint arXiv:2405.07960, 2024

Pith/arXiv arXiv 2024

[66] [66]

Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 22315–22339, 2024

2024

[67] [67]

Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. CORE-Bench: Fostering the credibility of published research through a computational re- producibility agent benchmark.Transactions on Machine Learning Research, 2025. URL https://openreview.net/forum?id=BsMMc4MEGS

2025

[68] [68]

Large language models encode clinical knowledge.Nature, 620:172–180, 2023

Karan Singhal, Shekoofeh Azizi Tu, Julia Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Peter Schuh, Karan Sareen, David Winer, Denny Wilson, et al. Large language models encode clinical knowledge.Nature, 620:172–180, 2023. doi: 10.1038/s41586-023-06291-2. URL https://www.nature.com/articles/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023

[69] [69]

PaperBench: Evaluating AI’s ability to replicate AI research,

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s ability to replicate AI research,

[70] [70]

URLhttps://arxiv.org/abs/2504.01848

Pith/arXiv arXiv

[71] [71]

PathAsst: A generative foundation AI assistant towards artificial general intelligence of pathology, 2023

Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, and Lin Yang. PathAsst: A generative foundation AI assistant towards artificial general intelligence of pathology, 2023. URLhttps://arxiv.org/abs/2305.15072

arXiv 2023

[72] [72]

MedXpertQA-MM: Multimodal expert medical question answering.https: //huggingface.co/datasets/TsinghuaC3I/MedXpertQA, 2024

TsinghuaC3I. MedXpertQA-MM: Multimodal expert medical question answering.https: //huggingface.co/datasets/TsinghuaC3I/MedXpertQA, 2024

2024

[73] [73]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assess- ment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004

[74] [74]

TotalSegmentator: Robust segmentation of 104 anatomical structures in CT.https://github.com/wasserth/TotalSegmentator, 2023

Jakob Wasserthal et al. TotalSegmentator: Robust segmentation of 104 anatomical structures in CT.https://github.com/wasserth/TotalSegmentator, 2023

2023

[75] [75]

OSWorld: Benchmark- ing multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmark- ing multimodal agents for open-ended tasks in real computer environments, 2024. URL https://arxiv.org/abs/...

Pith/arXiv arXiv 2024

[76] [76]

MedMNIST v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10:41, 2023

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. MedMNIST v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification.Scientific Data, 10:41, 2023. doi: 10.1038/s41597-022-01721-8. URL https://www.nature.com/articles/s41597-022-01721-8

work page doi:10.1038/s41597-022-01721-8 2023

[77] [77]

Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024. URL https://arxiv.org/abs/2407.15711

arXiv 2024

[78] [78]

MedFrameQA: A multi-image medical VQA benchmark for clinical reasoning, 2025

Suhao Yu, Haojin Wang, Juncheng Wu, Luyang Luo, Jingshen Wang, Cihang Xie, Pranav Rajpurkar, Carl Yang, Yang Yang, Kang Wang, Yannan Yu, and Yuyin Zhou. MedFrameQA: A multi-image medical VQA benchmark for clinical reasoning, 2025. URLhttps://arxiv. org/abs/2505.16964

arXiv 2025

[79] [79]

Muckley, Mary Bruno, Aaron De- fazio, Marc Parente, Krzysztof J

Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron De- fazio, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, James Pinkerton, Duo Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. 19 Sodick...

Pith/arXiv arXiv 2018

[80] [80]

open-source

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realis- tic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=oKn9c6ytLx. 20 APPENDIXCONTENT...

2024