arxiv: 2603.24649 · v2 · submitted 2026-03-25 · 💻 cs.CV

Recognition: no theorem link

MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows

Weixiang Shen , Chengzhi Shen , Yanzhu Hu , Che Liu , Junde Wu , Jiayuan Zhu , Xiao Han , Zongyue Li

show 7 more authors

Jingpei Wu Min Xu Daguang Xu Yueming Jin Benedikt Wiestler Daniel Rueckert Jiazhen Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical imaging agentsfull-study benchmarksevidence auditingVLM evaluationradiology workflowspathology imagingagent benchmarks

0 comments

The pith

Medical agents must navigate full imaging studies and submit auditable evidence, where performance drops sharply compared to answer-only evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current medical imaging benchmarks evaluate models on isolated images or patches, which simplifies the task to visual recognition. In practice, clinicians review complete studies using imaging software, search across slices, and must document evidence that can be verified. This paper introduces MedFlowBench, a benchmark requiring agents to operate viewers like 3D Slicer on full radiology or pathology studies, return answers, and provide structured evidence checked automatically against hidden annotations. Results show that requiring correct evidence alongside answers reveals much lower performance on complex tasks. Adding analysis tools helps only when they simplify procedures, but agents still falter on managing multi-step interactions.

Core claim

The paper establishes that medical imaging agents need to produce auditable evidence from complete studies rather than just plausible answers from pre-selected images, and that MedFlowBench with MedOpenClaw reveals substantial performance gaps in evidence-supported accuracy.

What carries the argument

MedOpenClaw, a controlled runtime allowing agents to operate medical imaging viewers, combined with MedFlowBench episodes that require full study inspection and submission of verifiable evidence like key slices and regions of interest.

If this is right

Answer-only scoring overestimates agent capabilities in real workflows.
Agents require better mechanisms for choosing inputs and verifying intermediate outputs across steps.
Tool integration alone does not resolve the challenges of complex multi-step procedures.
Benchmarks must include evidence auditing to accurately assess readiness for clinical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to testing agents in other domains requiring navigation and documentation, such as legal or financial analysis.
Developers might focus on building reliable state management for software interfaces.
If scaled, such benchmarks could inform regulatory standards for medical AI tools.

Load-bearing premise

The assumption that automated checks against withheld masks and annotations fully capture what real clinical auditing would require, and that the controlled runtime accurately represents the complexity of actual medical imaging software.

What would settle it

Observing whether agents achieve high rates of correct evidence-supported answers when tested in live clinical software environments on unseen full studies, rather than the simulated runtime.

Figures

Figures reproduced from arXiv: 2603.24649 by Benedikt Wiestler, Che Liu, Chengzhi Shen, Daguang Xu, Daniel Rueckert, Jiayuan Zhu, Jiazhen Pan, Jingpei Wu, Junde Wu, Min Xu, Weixiang Shen, Xiao Han, Yanzhu Hu, Yueming Jin, Zongyue Li.

**Figure 1.** Figure 1: Left: Conventional medical VQA benchmarks rely on pre-selected, diagnostically relevant 2D images as inputs. They evaluate black-box models, where neither the decision-making process nor the supporting evidence is observable. Right: In contrast, our proposed benchmark is built on our runtime, MEDOPENCLAW, which interacts with 3D Slicer through a bounded viewer interface. This setup produces a transparent r… view at source ↗

**Figure 2.** Figure 2: Representative auditable execution traces from the Brain MRI ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core contribution is a benchmark that makes medical agents navigate full studies and submit checkable evidence, exposing how answer-only scores overstate real capability.

read the letter

The main thing to know is that this work builds MedFlowBench and MedOpenClaw to test VLM agents on complete radiology studies and whole-slide images instead of pre-selected slices. Agents must navigate the viewer, locate relevant views, and submit structured evidence such as slices or ROIs that gets automatically verified against hidden annotations. The reported result is that requiring correct evidence causes substantial performance drops compared with answer-only scoring, and simply adding image tools does not close the gap on complex multi-step tasks. That setup is new and directly targets a real mismatch between current benchmarks and clinical practice. The controlled runtime and replayable episodes are a practical step forward for reproducible agent evaluation. The soft spot is the automated evidence checker itself. Matching a mask or coordinate is a necessary filter, but it may still accept submissions that lack clinical context or point to the wrong lesion at the wrong scale. The abstract does not describe any human audit of the checker outputs, so it is unclear how closely the metric tracks what a radiologist would accept. If the full paper shows only technical overlap without that validation step, the performance-drop claim rests on a narrower foundation than it first appears. This paper is aimed at groups building medical imaging agents who need harder, more auditable tests. It is worth sending to peer review because the benchmark direction is useful and the basic implementation looks workable, even if reviewers will likely press on the clinical alignment of the verification method.

Referee Report

3 major / 2 minor

Summary. The paper introduces MedOpenClaw, a controlled runtime for VLM agents to operate full medical imaging viewers (e.g., 3D Slicer, QuPath), and MedFlowBench, a benchmark requiring agents to inspect complete radiology studies or whole-slide images, return answers, and submit structured evidence (slices, coordinates, ROIs) that is automatically verified against withheld masks and annotations. The central claim is that answer-only scoring overestimates performance, with substantial drops when evidence correctness is also required, and that tool augmentation alone does not resolve difficulties in multi-step viewer navigation and state management.

Significance. If the automated evidence verification reliably proxies clinical auditability, the benchmark would expose a key limitation in current medical agents and push evaluation toward workflow realism rather than isolated recognition tasks. The work provides a reproducible runtime and falsifiable setup for testing evidence-producing agents, which is a concrete strength.

major comments (3)

[§4] §4 (Evidence Verification): The automated checker (mask/ROI overlap, coordinate matching) is presented as capturing clinical auditing needs, but no validation against human expert judgments is reported; technical correctness of a slice or bounding box does not guarantee clinical sufficiency, relevance, or completeness, directly undermining the performance-drop claim.
[§5.2] §5.2 (Results on complex workflows): The reported 'substantial' drops under evidence-augmented scoring lack accompanying tables with per-task breakdowns, confidence intervals, or ablation on checker thresholds; without these, it is impossible to assess whether the drops are driven by the benchmark design or by genuine agent failures.
[§3.1] §3.1 (MedOpenClaw runtime): The claim that the replayable environment faithfully represents real clinical software complexity rests on an untested assumption that controlled viewer state transitions match actual radiologist navigation patterns; no user-study or timing comparison is provided to support this.

minor comments (2)

[Figure 2] Figure 2 caption: the legend for tool-augmented vs. baseline agents is ambiguous about whether 'tools' include only image-analysis functions or also viewer controls.
[§2] §2 (Related Work): missing citation to prior full-study VLM benchmarks (e.g., those using MIMIC-CXR full reports) that already attempt multi-slice navigation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and have revised the manuscript accordingly where feasible to strengthen the presentation of our claims.

read point-by-point responses

Referee: [§4] §4 (Evidence Verification): The automated checker (mask/ROI overlap, coordinate matching) is presented as capturing clinical auditing needs, but no validation against human expert judgments is reported; technical correctness of a slice or bounding box does not guarantee clinical sufficiency, relevance, or completeness, directly undermining the performance-drop claim.

Authors: We agree that technical verification of evidence (e.g., mask overlap or coordinate matching) is a necessary but not sufficient proxy for clinical auditability. In the revised manuscript we have added a dedicated paragraph in §4 explicitly acknowledging this limitation, clarifying that MedFlowBench evaluates verifiable evidence production rather than full clinical sufficiency, and outlining plans for future expert radiologist validation studies. The reported performance drops remain meaningful because they demonstrate that current agents frequently fail to produce even technically correct evidence on full studies. revision: yes
Referee: [§5.2] §5.2 (Results on complex workflows): The reported 'substantial' drops under evidence-augmented scoring lack accompanying tables with per-task breakdowns, confidence intervals, or ablation on checker thresholds; without these, it is impossible to assess whether the drops are driven by the benchmark design or by genuine agent failures.

Authors: We have revised §5.2 to include new tables with per-task breakdowns for both answer-only and evidence-augmented scoring, along with 95% confidence intervals computed across repeated runs. An ablation on checker thresholds (varying IoU and coordinate tolerance values) has been added to the supplementary material, confirming that the magnitude of the performance drops is robust across reasonable threshold choices and is driven by agent limitations in multi-step navigation and evidence submission rather than benchmark parameterization. revision: yes
Referee: [§3.1] §3.1 (MedOpenClaw runtime): The claim that the replayable environment faithfully represents real clinical software complexity rests on an untested assumption that controlled viewer state transitions match actual radiologist navigation patterns; no user-study or timing comparison is provided to support this.

Authors: MedOpenClaw is implemented directly atop the public APIs and state machines of 3D Slicer and QuPath, so agents must execute the identical sequence of viewer commands required in clinical use. While a dedicated user study comparing navigation patterns would be valuable, it lies outside the scope of the current work, whose primary contribution is a reproducible, controlled runtime for benchmarking evidence-producing agents. The environment's fidelity is evidenced by the fact that successful episodes require the same multi-step state management that real clinical workflows demand. revision: no

Circularity Check

0 steps flagged

Benchmark introduction paper with no derivations, fitted parameters, or self-referential predictions

full rationale

This is a benchmark paper introducing MedFlowBench and MedOpenClaw for auditing medical agents on full-study workflows. It reports empirical performance drops when requiring evidence support versus answer-only scoring. No equations, derivations, parameter fitting, or predictions appear in the abstract or described content. Claims rest on the new evaluation setup and observed results rather than any chain that reduces to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The paper is self-contained against external benchmarks and scores 0 for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions in AI benchmarking about the value of realistic evaluation environments and the feasibility of automated evidence verification, with no free parameters, new axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5625 in / 1046 out tokens · 43608 ms · 2026-05-15T00:04:26.523916+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI
cs.LG 2026-04 unverdicted novelty 6.0

GAZE framework with viewer tools and literature retrieval achieves 58.2 mAP@0.3 lesion localization and 34.9% top-1 diagnostic accuracy on 906 rare brain MRI cases in zero-shot setting, with larger gains on rarest pat...
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
cs.AI 2026-04 unverdicted novelty 5.0

Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5: 180251, 2018. doi: 10.1038/sdata.2018.251

work page doi:10.1038/sdata.2018.251 2018
[2]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. Ininternational symposium on biomedical imaging (ISBI), pages 1650–1654, 2021

work page 2021
[3]

Anthropic

Jiahao Chen et al. Huatuogpt-vision: Towards injecting medical visual knowledge into multi- modal LLMs at scale, 2024. arXiv:2406.19280

work page arXiv 2024
[4]

Medxpertqa: Benchmarking expert-level medical reasoning and understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 80961–80990, 2025

work page 2025
[5]

Medxpertqa dataset card

TsinghuaC3I. Medxpertqa dataset card. Hugging Face Datasets, 2025.https://huggingface. co/datasets/TsinghuaC3I/MedXpertQA(accessed March 2026)

work page 2025
[6]

The last mile: Where artificial intelligence meets reality.Journal of Medical Internet Research, 21(11):e16323, 2019

Enrico Coiera. The last mile: Where artificial intelligence meets reality.Journal of Medical Internet Research, 21(11):e16323, 2019. doi: 10.2196/16323

work page doi:10.2196/16323 2019
[7]

van Genderen

Davy van de Sande, Eline Fung Fen Chung, Jacobien Oosterhoff, Jasper van Bommel, Diederik Gommers, and Michel E. van Genderen. To warrant clinical adoption AI models require a multi-faceted implementation evaluation.npj Digital Medicine, 7:58, 2024. doi: 10.1038/ s41746-024-01064-1. 8

work page 2024
[8]

Implementing artificial intelligence algorithms in the radiology workflow: Challenges and considerations.Mayo Clinic Proceedings: Digital Health, 3(1):100188, 2025

Panagiotis Korfiatis, Timothy Kline, et al. Implementing artificial intelligence algorithms in the radiology workflow: Challenges and considerations.Mayo Clinic Proceedings: Digital Health, 3(1):100188, 2025. doi: 10.1016/j.mcpdig.2024.100188

work page doi:10.1016/j.mcpdig.2024.100188 2025
[9]

two-slice-touch

Arthur A. De Smet, Michael J. Tuite, and Mark A. Norris. Use of the “two-slice-touch” rule for the MRI diagnosis of meniscal tears.American Journal of Roentgenology, 187(4):911–914,

work page
[10]

doi: 10.2214/AJR.05.1354

work page doi:10.2214/ajr.05.1354
[11]

Qureshi, Andrew Shah, Rosemary J

Nagmi R. Qureshi, Andrew Shah, Rosemary J. Eaton, Ken Miles, and Fiona J. Gilbert. Dynamic contrast enhanced CT in nodule characterization: How we review and report.Cancer Imaging, 16(1):16, 2016. doi: 10.1186/s40644-016-0074-4

work page doi:10.1186/s40644-016-0074-4 2016
[12]

Yanyu Li, Lu Lin, Jian Wang, Likun Cao, Yajing Liu, Jianing Pang, Jing An, Zhengyu Jin, and Yining Wang. Cardiac cine with compressed sensing real-time imaging and retrospective motion correction for free-breathing assessment of left ventricular function and strain in clinical practice.Quantitative Imaging in Medicine and Surgery, 13(4):2262–2277, 2023. d...

work page 2023
[13]

3d slicer as an image computing platform for the quantitative imaging network.Magnetic Resonance Imaging, 30(9):1323–1341, 2012

Andriy Fedorov, Steve Pieper, Ron Kikinis, et al. 3d slicer as an image computing platform for the quantitative imaging network.Magnetic Resonance Imaging, 30(9):1323–1341, 2012. doi: 10.1016/j.mri.2012.05.001

work page doi:10.1016/j.mri.2012.05.001 2012
[14]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

work page internal anchor Pith review arXiv 2003
[15]

Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InConference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024

work page 2024
[16]

arXiv preprint arXiv:2305.10415 (2023)

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

work page arXiv 2023
[17]

Capabilities of GPT-5 on multimodal medical reasoning, 2025

Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, and Xiaofeng Yang. Capabilities of GPT-5 on multimodal medical reasoning, 2025. arXiv:2508.08224

work page arXiv 2025
[18]

Benchmarking the diagnostic performance of open source llms in 1933 eurorad case reports.NPJ digital medicine, 8(1):97, 2025

Su Hwan Kim, Severin Schramm, Lisa C Adams, Rickmer Braren, Keno K Bressem, Matthias Keicher, Paul-Sören Platzek, Karolin Johanna Paprottka, Claus Zimmer, Dennis M Hedderich, et al. Benchmarking the diagnostic performance of open source llms in 1933 eurorad case reports.NPJ digital medicine, 8(1):97, 2025

work page 1933
[19]

Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, et al. Medical thinking with multiple images. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026
[20]

Evaluating reasoning faithfulness in medical vision-language models using multimodal perturbations.arXiv preprint arXiv:2510.11196, 2025

Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C Adams, and Keno K Bressem. Evaluating reasoning faithfulness in medical vision-language models using multimodal perturbations.arXiv preprint arXiv:2510.11196, 2025

work page arXiv 2025
[21]

Medrax: Medical reasoning agent for chest x-ray, 2025

Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray, 2025. arXiv:2502.02673

work page arXiv 2025
[22]

Nova: A benchmark for anomaly localization and clinical reasoning in brain mri.arXiv preprint arXiv:2505.14064, 2025

Cosmin I Bercea, Jun Li, Philipp Raffler, Evamaria O Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer, Paula Roßmüller, Julian Canisius, Mirjam L Beyrle, et al. Nova: A benchmark for anomaly localization and clinical reasoning in brain mri.arXiv preprint arXiv:2505.14064, 2025

work page arXiv 2025
[23]

Med-glip: Advancing medical language-image pre-training with large-scale grounded dataset.arXiv preprint arXiv:2508.10528, 2025

Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, and Zuozhu Liu. Med-glip: Advancing medical language-image pre-training with large-scale grounded dataset.arXiv preprint arXiv:2508.10528, 2025. 9

work page arXiv 2025
[24]

Cxpmrg-bench: Pre-training and benchmarking for x-ray medical report generation on chexpert plus dataset

Xiao Wang, Fuling Wang, Yuehang Li, Qingchuan Ma, Shiao Wang, Bo Jiang, and Jin Tang. Cxpmrg-bench: Pre-training and benchmarking for x-ray medical report generation on chexpert plus dataset. InProceedings of the computer vision and pattern recognition conference, pages 5123–5133, 2025

work page 2025
[25]

The landscape of medical agents: A survey

Xiaobin Hu, Yunhang Qian, Jiaquan Yu, Jingjing Liu, Peng Tang, Xiaozhong Ji, Chengming Xu, Jiawei Liu, Xiaoxiao Yan, Xinlei Yu, et al. The landscape of medical agents: A survey. Authorea Preprints, 2025

work page 2025
[26]

Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

work page 2024
[27]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–347. Springer, 2025

work page 2025
[28]

Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025

work page arXiv 2025
[29]

Medical graph rag: Towards safe medical large language model via graph retrieval- augmented generation.arXiv preprint arXiv:2408.04187, 2024

Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, and Vicente Grau. Medical graph rag: Towards safe medical large language model via graph retrieval- augmented generation.arXiv preprint arXiv:2408.04187, 2024

work page arXiv 2024
[30]

Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2846–2857, 2025

work page 2025
[31]

An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026

Weike Zhao, Chaoyi Wu, Yanjie Fan, Pengcheng Qiu, Xiaoman Zhang, Yuze Sun, Xiao Zhou, Shuju Zhang, Yu Peng, Yanfeng Wang, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026. doi: 10.1038/s41586-025-10097-9

work page doi:10.1038/s41586-025-10097-9 2026
[32]

Cxr-agent: Vision-language models for chest x-ray interpretation with uncer- tainty aware radiology reporting, 2024

Naman Sharma. Cxr-agent: Vision-language models for chest x-ray interpretation with uncer- tainty aware radiology reporting, 2024. arXiv:2407.08811

work page arXiv 2024
[33]

Mmedagent: Learning to use medical tools with multi-modal agent

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024. doi: 10.18653/v1/2024.findings-emnlp.510

work page doi:10.18653/v1/2024.findings-emnlp.510 2024
[34]

et al.: MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow (Jul 2025)

Ziyue Wang, Junde Wu, Linghan Cai, Chang Han Low, Xihong Yang, Qiaxuan Li, and Yueming Jin. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025

work page arXiv 2025
[35]

Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025

Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P Mistry, et al. Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025

work page arXiv 2025
[36]

Guttag, and Adrian V

Andrew Hoopes, Neel Dey, Victor Ion Butoi, John V . Guttag, and Adrian V . Dalca. V oxelprompt: A vision agent for end-to-end medical image analysis, 2024. arXiv:2410.08397

work page arXiv 2024
[37]

Ct-agent: A multimodal-LLM agent for 3d CT radiology question answering, 2025

Yuren Mao, Wenyi Xu, Yuyang Qin, and Yunjun Gao. Ct-agent: A multimodal-LLM agent for 3d CT radiology question answering, 2025. arXiv:2505.16229

work page arXiv 2025
[38]

How well can modern LLMs act as agent cores in radiology environments?, 2024

Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, and Weidi Xie. How well can modern LLMs act as agent cores in radiology environments?, 2024. URL https://arxiv.org/abs/2412.09529. arXiv:2412.09529. 10

work page arXiv 2024
[39]

Medagentbench: A virtual ehr environment to benchmark medical llm agents

Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: A virtual ehr environment to benchmark medical llm agents. NEJM AI, page AIdbp2500144, 2025

work page 2025
[40]

Slicerchat: Building a local chatbot for 3d slicer, 2024

Colton Barr. Slicerchat: Building a local chatbot for 3d slicer, 2024. arXiv:2407.11987

work page arXiv 2024
[41]

3d and 2d radiology copilot integration in 3d slicer

NA-MIC Project Week. 3d and 2d radiology copilot integration in 3d slicer. Project Week page, 2025. https://projectweek.na-mic.org/PW42_2025_GranCanaria/Projects/ 3Dand2DRadiologyCopilotIntegrationin3DSlicer/(accessed March 2026)

work page 2025
[42]

OpenClaw: Personal ai assistant

OpenClaw Team. OpenClaw: Personal ai assistant. GitHub repository, 2026. https:// github.com/openclaw/openclaw(accessed March 24, 2026)

work page 2026
[43]

OpenClaw documentation: Tools and plugins

OpenClaw Team. OpenClaw documentation: Tools and plugins. Official documentation, 2026. https://docs.openclaw.ai/tools(accessed March 24, 2026)

work page 2026
[44]

OpenClaw documentation: Sandbox vs tool policy vs el- evated

OpenClaw Team. OpenClaw documentation: Sandbox vs tool policy vs el- evated. Official documentation, 2026. https://docs.openclaw.ai/gateway/ sandbox-vs-tool-policy-vs-elevated(accessed March 24, 2026)

work page 2026
[45]

OpenClaw documentation: Sandboxing

OpenClaw Team. OpenClaw documentation: Sandboxing. Official documentation, 2026. https://docs.openclaw.ai/gateway/sandboxing(accessed March 24, 2026)

work page 2026
[46]

MONAI: An open-source framework for deep learning in healthcare

M. Jorge Cardoso, Wenqi Li, Richard Brown, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint, 2022. arXiv:2211.02701

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Hasan, Vivek V

Asma Ben Abacha, Sadid A. Hasan, Vivek V . Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef

work page
[48]

CEUR Workshop Proceedings, 2019

InWorking Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, 2019

work page 2019
[49]

Drvd-bench: Do vision-language models reason like human doctors in medical image diagnosis?arXiv preprint arXiv:2505.24173, 2025

Tianhong Zhou, Yin Xu, Yingtao Zhu, Chuxi Xiao, Haiyang Bian, Lei Wei, and Xuegong Zhang. Drvd-bench: Do vision-language models reason like human doctors in medical image diagnosis?arXiv preprint arXiv:2505.24173, 2025

work page arXiv 2025
[50]

3d-rad: A com- prehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A com- prehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

work page arXiv 2025
[51]

Toward a vision-language foundation model for medical data: Multimodal dataset and benchmarks for vietnamese pet/ct report generation.arXiv preprint arXiv:2509.24739, 2025

Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, and Phi Le Nguyen. Toward a vision-language foundation model for medical data: Multimodal dataset and benchmarks for vietn...

work page arXiv 2025
[52]

Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening.arXiv preprint arXiv:2202.08994, 2022

Huihui Fang, Fei Li, Junde Wu, Huazhu Fu, Xu Sun, Jaemin Son, Shuang Yu, Menglu Zhang, Chenglang Yuan, Cheng Bian, Baiying Lei, Benjian Zhao, Xinxing Xu, Shaohua Li, Francisco Fumero, José Sigut, Haidar Almubarak, Yakoub Bazi, Yuanhao Guo, Yating Zhou, Ujjwal Baid, Shubham Innani, Tianjiao Guo, Jie Yang, José Ignacio Orlando, Hrvoje Bogunovi ´c, Xiulan Zh...

work page arXiv 2022
[53]

Sequential Diagnosis with Language Models

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan M. Carlson, Matthew P. Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

work page arXiv 2025
[54]

Villanueva-Meyer, Jeffrey D

Evan Calabrese, Javier E. Villanueva-Meyer, Jeffrey D. Rudie, Andreas M. Rauschecker, Ujjwal Baid, Spyridon Bakas, Soonmee Cha, John T. Mongan, and Christopher P. Hess. The university of california san francisco preoperative diffuse glioma MRI dataset.Radiology: Artificial Intelligence, 4(6):e220058, 2022. doi: 10.1148/ryai.220058

work page doi:10.1148/ryai.220058 2022
[55]

A radiogenomic dataset of non-small cell lung cancer.Scientific Data, 5:180202, 2018

Shaimaa Bakr, Olivier Gevaert, Sergio Echegaray, et al. A radiogenomic dataset of non-small cell lung cancer.Scientific Data, 5:180202, 2018. doi: 10.1038/sdata.2018.202. 11

work page doi:10.1038/sdata.2018.202 2018