Recognition: no theorem link
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
Pith reviewed 2026-05-15 00:04 UTC · model grok-4.3
The pith
Medical agents must navigate full imaging studies and submit auditable evidence, where performance drops sharply compared to answer-only evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that medical imaging agents need to produce auditable evidence from complete studies rather than just plausible answers from pre-selected images, and that MedFlowBench with MedOpenClaw reveals substantial performance gaps in evidence-supported accuracy.
What carries the argument
MedOpenClaw, a controlled runtime allowing agents to operate medical imaging viewers, combined with MedFlowBench episodes that require full study inspection and submission of verifiable evidence like key slices and regions of interest.
If this is right
- Answer-only scoring overestimates agent capabilities in real workflows.
- Agents require better mechanisms for choosing inputs and verifying intermediate outputs across steps.
- Tool integration alone does not resolve the challenges of complex multi-step procedures.
- Benchmarks must include evidence auditing to accurately assess readiness for clinical use.
Where Pith is reading between the lines
- This approach could extend to testing agents in other domains requiring navigation and documentation, such as legal or financial analysis.
- Developers might focus on building reliable state management for software interfaces.
- If scaled, such benchmarks could inform regulatory standards for medical AI tools.
Load-bearing premise
The assumption that automated checks against withheld masks and annotations fully capture what real clinical auditing would require, and that the controlled runtime accurately represents the complexity of actual medical imaging software.
What would settle it
Observing whether agents achieve high rates of correct evidence-supported answers when tested in live clinical software environments on unseen full studies, rather than the simulated runtime.
Figures
read the original abstract
Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedOpenClaw, a controlled runtime for VLM agents to operate full medical imaging viewers (e.g., 3D Slicer, QuPath), and MedFlowBench, a benchmark requiring agents to inspect complete radiology studies or whole-slide images, return answers, and submit structured evidence (slices, coordinates, ROIs) that is automatically verified against withheld masks and annotations. The central claim is that answer-only scoring overestimates performance, with substantial drops when evidence correctness is also required, and that tool augmentation alone does not resolve difficulties in multi-step viewer navigation and state management.
Significance. If the automated evidence verification reliably proxies clinical auditability, the benchmark would expose a key limitation in current medical agents and push evaluation toward workflow realism rather than isolated recognition tasks. The work provides a reproducible runtime and falsifiable setup for testing evidence-producing agents, which is a concrete strength.
major comments (3)
- [§4] §4 (Evidence Verification): The automated checker (mask/ROI overlap, coordinate matching) is presented as capturing clinical auditing needs, but no validation against human expert judgments is reported; technical correctness of a slice or bounding box does not guarantee clinical sufficiency, relevance, or completeness, directly undermining the performance-drop claim.
- [§5.2] §5.2 (Results on complex workflows): The reported 'substantial' drops under evidence-augmented scoring lack accompanying tables with per-task breakdowns, confidence intervals, or ablation on checker thresholds; without these, it is impossible to assess whether the drops are driven by the benchmark design or by genuine agent failures.
- [§3.1] §3.1 (MedOpenClaw runtime): The claim that the replayable environment faithfully represents real clinical software complexity rests on an untested assumption that controlled viewer state transitions match actual radiologist navigation patterns; no user-study or timing comparison is provided to support this.
minor comments (2)
- [Figure 2] Figure 2 caption: the legend for tool-augmented vs. baseline agents is ambiguous about whether 'tools' include only image-analysis functions or also viewer controls.
- [§2] §2 (Related Work): missing citation to prior full-study VLM benchmarks (e.g., those using MIMIC-CXR full reports) that already attempt multi-slice navigation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below and have revised the manuscript accordingly where feasible to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [§4] §4 (Evidence Verification): The automated checker (mask/ROI overlap, coordinate matching) is presented as capturing clinical auditing needs, but no validation against human expert judgments is reported; technical correctness of a slice or bounding box does not guarantee clinical sufficiency, relevance, or completeness, directly undermining the performance-drop claim.
Authors: We agree that technical verification of evidence (e.g., mask overlap or coordinate matching) is a necessary but not sufficient proxy for clinical auditability. In the revised manuscript we have added a dedicated paragraph in §4 explicitly acknowledging this limitation, clarifying that MedFlowBench evaluates verifiable evidence production rather than full clinical sufficiency, and outlining plans for future expert radiologist validation studies. The reported performance drops remain meaningful because they demonstrate that current agents frequently fail to produce even technically correct evidence on full studies. revision: yes
-
Referee: [§5.2] §5.2 (Results on complex workflows): The reported 'substantial' drops under evidence-augmented scoring lack accompanying tables with per-task breakdowns, confidence intervals, or ablation on checker thresholds; without these, it is impossible to assess whether the drops are driven by the benchmark design or by genuine agent failures.
Authors: We have revised §5.2 to include new tables with per-task breakdowns for both answer-only and evidence-augmented scoring, along with 95% confidence intervals computed across repeated runs. An ablation on checker thresholds (varying IoU and coordinate tolerance values) has been added to the supplementary material, confirming that the magnitude of the performance drops is robust across reasonable threshold choices and is driven by agent limitations in multi-step navigation and evidence submission rather than benchmark parameterization. revision: yes
-
Referee: [§3.1] §3.1 (MedOpenClaw runtime): The claim that the replayable environment faithfully represents real clinical software complexity rests on an untested assumption that controlled viewer state transitions match actual radiologist navigation patterns; no user-study or timing comparison is provided to support this.
Authors: MedOpenClaw is implemented directly atop the public APIs and state machines of 3D Slicer and QuPath, so agents must execute the identical sequence of viewer commands required in clinical use. While a dedicated user study comparing navigation patterns would be valuable, it lies outside the scope of the current work, whose primary contribution is a reproducible, controlled runtime for benchmarking evidence-producing agents. The environment's fidelity is evidenced by the fact that successful episodes require the same multi-step state management that real clinical workflows demand. revision: no
Circularity Check
Benchmark introduction paper with no derivations, fitted parameters, or self-referential predictions
full rationale
This is a benchmark paper introducing MedFlowBench and MedOpenClaw for auditing medical agents on full-study workflows. It reports empirical performance drops when requiring evidence support versus answer-only scoring. No equations, derivations, parameter fitting, or predictions appear in the abstract or described content. Claims rest on the new evaluation setup and observed results rather than any chain that reduces to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The paper is self-contained against external benchmarks and scores 0 for circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI
GAZE framework with viewer tools and literature retrieval achieves 58.2 mAP@0.3 lesion localization and 34.9% top-1 diagnostic accuracy on 906 rare brain MRI cases in zero-shot setting, with larger gains on rarest pat...
-
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
Reference graph
Works this paper leans on
-
[1]
Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman
Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5: 180251, 2018. doi: 10.1038/sdata.2018.251
-
[2]
Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. Ininternational symposium on biomedical imaging (ISBI), pages 1650–1654, 2021
work page 2021
- [3]
-
[4]
Medxpertqa: Benchmarking expert-level medical reasoning and understanding
Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 80961–80990, 2025
work page 2025
-
[5]
TsinghuaC3I. Medxpertqa dataset card. Hugging Face Datasets, 2025.https://huggingface. co/datasets/TsinghuaC3I/MedXpertQA(accessed March 2026)
work page 2025
-
[6]
Enrico Coiera. The last mile: Where artificial intelligence meets reality.Journal of Medical Internet Research, 21(11):e16323, 2019. doi: 10.2196/16323
-
[7]
Davy van de Sande, Eline Fung Fen Chung, Jacobien Oosterhoff, Jasper van Bommel, Diederik Gommers, and Michel E. van Genderen. To warrant clinical adoption AI models require a multi-faceted implementation evaluation.npj Digital Medicine, 7:58, 2024. doi: 10.1038/ s41746-024-01064-1. 8
work page 2024
-
[8]
Panagiotis Korfiatis, Timothy Kline, et al. Implementing artificial intelligence algorithms in the radiology workflow: Challenges and considerations.Mayo Clinic Proceedings: Digital Health, 3(1):100188, 2025. doi: 10.1016/j.mcpdig.2024.100188
-
[9]
Arthur A. De Smet, Michael J. Tuite, and Mark A. Norris. Use of the “two-slice-touch” rule for the MRI diagnosis of meniscal tears.American Journal of Roentgenology, 187(4):911–914,
-
[10]
doi: 10.2214/AJR.05.1354
-
[11]
Qureshi, Andrew Shah, Rosemary J
Nagmi R. Qureshi, Andrew Shah, Rosemary J. Eaton, Ken Miles, and Fiona J. Gilbert. Dynamic contrast enhanced CT in nodule characterization: How we review and report.Cancer Imaging, 16(1):16, 2016. doi: 10.1186/s40644-016-0074-4
-
[12]
Yanyu Li, Lu Lin, Jian Wang, Likun Cao, Yajing Liu, Jianing Pang, Jing An, Zhengyu Jin, and Yining Wang. Cardiac cine with compressed sensing real-time imaging and retrospective motion correction for free-breathing assessment of left ventricular function and strain in clinical practice.Quantitative Imaging in Medicine and Surgery, 13(4):2262–2277, 2023. d...
work page 2023
-
[13]
Andriy Fedorov, Steve Pieper, Ron Kikinis, et al. 3d slicer as an image computing platform for the quantitative imaging network.Magnetic Resonance Imaging, 30(9):1323–1341, 2012. doi: 10.1016/j.mri.2012.05.001
-
[14]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020
work page internal anchor Pith review arXiv 2003
-
[15]
Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InConference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024
work page 2024
-
[16]
arXiv preprint arXiv:2305.10415 (2023)
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023
-
[17]
Capabilities of GPT-5 on multimodal medical reasoning, 2025
Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, and Xiaofeng Yang. Capabilities of GPT-5 on multimodal medical reasoning, 2025. arXiv:2508.08224
-
[18]
Su Hwan Kim, Severin Schramm, Lisa C Adams, Rickmer Braren, Keno K Bressem, Matthias Keicher, Paul-Sören Platzek, Karolin Johanna Paprottka, Claus Zimmer, Dennis M Hedderich, et al. Benchmarking the diagnostic performance of open source llms in 1933 eurorad case reports.NPJ digital medicine, 8(1):97, 2025
work page 1933
-
[19]
Medical thinking with multiple images
Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, et al. Medical thinking with multiple images. In The Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[20]
Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C Adams, and Keno K Bressem. Evaluating reasoning faithfulness in medical vision-language models using multimodal perturbations.arXiv preprint arXiv:2510.11196, 2025
-
[21]
Medrax: Medical reasoning agent for chest x-ray, 2025
Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray, 2025. arXiv:2502.02673
-
[22]
Cosmin I Bercea, Jun Li, Philipp Raffler, Evamaria O Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer, Paula Roßmüller, Julian Canisius, Mirjam L Beyrle, et al. Nova: A benchmark for anomaly localization and clinical reasoning in brain mri.arXiv preprint arXiv:2505.14064, 2025
-
[23]
Ziye Deng, Ruihan He, Jiaxiang Liu, Yuan Wang, Zijie Meng, Songtao Jiang, Yong Xie, and Zuozhu Liu. Med-glip: Advancing medical language-image pre-training with large-scale grounded dataset.arXiv preprint arXiv:2508.10528, 2025. 9
-
[24]
Xiao Wang, Fuling Wang, Yuehang Li, Qingchuan Ma, Shiao Wang, Bo Jiang, and Jin Tang. Cxpmrg-bench: Pre-training and benchmarking for x-ray medical report generation on chexpert plus dataset. InProceedings of the computer vision and pattern recognition conference, pages 5123–5133, 2025
work page 2025
-
[25]
The landscape of medical agents: A survey
Xiaobin Hu, Yunhang Qian, Jiaquan Yu, Jingjing Liu, Peng Tang, Xiaozhong Ji, Chengming Xu, Jiawei Liu, Xiaoxiao Yan, Xinlei Yu, et al. The landscape of medical agents: A survey. Authorea Preprints, 2025
work page 2025
-
[26]
Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024
work page 2024
-
[27]
Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–347. Springer, 2025
work page 2025
-
[28]
Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J Tao, Min Woo Sun, Alejandro Lozano, and James Zou. Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports.arXiv preprint arXiv:2505.11733, 2025
-
[29]
Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, and Vicente Grau. Medical graph rag: Towards safe medical large language model via graph retrieval- augmented generation.arXiv preprint arXiv:2408.04187, 2024
-
[30]
Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning
Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, and Junde Wu. Ask patients with patience: Enabling llms for human-centric medical dialogue with grounded reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2846–2857, 2025
work page 2025
-
[31]
An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026
Weike Zhao, Chaoyi Wu, Yanjie Fan, Pengcheng Qiu, Xiaoman Zhang, Yuze Sun, Xiao Zhou, Shuju Zhang, Yu Peng, Yanfeng Wang, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, 651:775–784, 2026. doi: 10.1038/s41586-025-10097-9
-
[32]
Naman Sharma. Cxr-agent: Vision-language models for chest x-ray interpretation with uncer- tainty aware radiology reporting, 2024. arXiv:2407.08811
-
[33]
Mmedagent: Learning to use medical tools with multi-modal agent
Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024. doi: 10.18653/v1/2024.findings-emnlp.510
-
[34]
Ziyue Wang, Junde Wu, Linghan Cai, Chang Han Low, Xihong Yang, Qiaxuan Li, and Yueming Jin. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow.arXiv preprint arXiv:2503.18968, 2025
-
[35]
Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P Mistry, et al. Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025
-
[36]
Andrew Hoopes, Neel Dey, Victor Ion Butoi, John V . Guttag, and Adrian V . Dalca. V oxelprompt: A vision agent for end-to-end medical image analysis, 2024. arXiv:2410.08397
-
[37]
Ct-agent: A multimodal-LLM agent for 3d CT radiology question answering, 2025
Yuren Mao, Wenyi Xu, Yuyang Qin, and Yunjun Gao. Ct-agent: A multimodal-LLM agent for 3d CT radiology question answering, 2025. arXiv:2505.16229
-
[38]
How well can modern LLMs act as agent cores in radiology environments?, 2024
Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, and Weidi Xie. How well can modern LLMs act as agent cores in radiology environments?, 2024. URL https://arxiv.org/abs/2412.09529. arXiv:2412.09529. 10
-
[39]
Medagentbench: A virtual ehr environment to benchmark medical llm agents
Yixing Jiang, Kameron C Black, Gloria Geng, Danny Park, James Zou, Andrew Y Ng, and Jonathan H Chen. Medagentbench: A virtual ehr environment to benchmark medical llm agents. NEJM AI, page AIdbp2500144, 2025
work page 2025
-
[40]
Slicerchat: Building a local chatbot for 3d slicer, 2024
Colton Barr. Slicerchat: Building a local chatbot for 3d slicer, 2024. arXiv:2407.11987
-
[41]
3d and 2d radiology copilot integration in 3d slicer
NA-MIC Project Week. 3d and 2d radiology copilot integration in 3d slicer. Project Week page, 2025. https://projectweek.na-mic.org/PW42_2025_GranCanaria/Projects/ 3Dand2DRadiologyCopilotIntegrationin3DSlicer/(accessed March 2026)
work page 2025
-
[42]
OpenClaw: Personal ai assistant
OpenClaw Team. OpenClaw: Personal ai assistant. GitHub repository, 2026. https:// github.com/openclaw/openclaw(accessed March 24, 2026)
work page 2026
-
[43]
OpenClaw documentation: Tools and plugins
OpenClaw Team. OpenClaw documentation: Tools and plugins. Official documentation, 2026. https://docs.openclaw.ai/tools(accessed March 24, 2026)
work page 2026
-
[44]
OpenClaw documentation: Sandbox vs tool policy vs el- evated
OpenClaw Team. OpenClaw documentation: Sandbox vs tool policy vs el- evated. Official documentation, 2026. https://docs.openclaw.ai/gateway/ sandbox-vs-tool-policy-vs-elevated(accessed March 24, 2026)
work page 2026
-
[45]
OpenClaw documentation: Sandboxing
OpenClaw Team. OpenClaw documentation: Sandboxing. Official documentation, 2026. https://docs.openclaw.ai/gateway/sandboxing(accessed March 24, 2026)
work page 2026
-
[46]
MONAI: An open-source framework for deep learning in healthcare
M. Jorge Cardoso, Wenqi Li, Richard Brown, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint, 2022. arXiv:2211.02701
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Asma Ben Abacha, Sadid A. Hasan, Vivek V . Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef
-
[48]
CEUR Workshop Proceedings, 2019
InWorking Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, 2019
work page 2019
-
[49]
Tianhong Zhou, Yin Xu, Yingtao Zhu, Chuxi Xiao, Haiyang Bian, Lei Wei, and Xuegong Zhang. Drvd-bench: Do vision-language models reason like human doctors in medical image diagnosis?arXiv preprint arXiv:2505.24173, 2025
-
[50]
Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A com- prehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025
-
[51]
Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, and Phi Le Nguyen. Toward a vision-language foundation model for medical data: Multimodal dataset and benchmarks for vietn...
-
[52]
Huihui Fang, Fei Li, Junde Wu, Huazhu Fu, Xu Sun, Jaemin Son, Shuang Yu, Menglu Zhang, Chenglang Yuan, Cheng Bian, Baiying Lei, Benjian Zhao, Xinxing Xu, Shaohua Li, Francisco Fumero, José Sigut, Haidar Almubarak, Yakoub Bazi, Yuanhao Guo, Yating Zhou, Ujjwal Baid, Shubham Innani, Tianjiao Guo, Jie Yang, José Ignacio Orlando, Hrvoje Bogunovi ´c, Xiulan Zh...
-
[53]
Sequential Diagnosis with Language Models
Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan M. Carlson, Matthew P. Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025
-
[54]
Evan Calabrese, Javier E. Villanueva-Meyer, Jeffrey D. Rudie, Andreas M. Rauschecker, Ujjwal Baid, Spyridon Bakas, Soonmee Cha, John T. Mongan, and Christopher P. Hess. The university of california san francisco preoperative diffuse glioma MRI dataset.Radiology: Artificial Intelligence, 4(6):e220058, 2022. doi: 10.1148/ryai.220058
-
[55]
A radiogenomic dataset of non-small cell lung cancer.Scientific Data, 5:180202, 2018
Shaimaa Bakr, Olivier Gevaert, Sergio Echegaray, et al. A radiogenomic dataset of non-small cell lung cancer.Scientific Data, 5:180202, 2018. doi: 10.1038/sdata.2018.202. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.