pith. machine review for the scientific record. sign in

arxiv: 2605.10286 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsmultimodal dataclinical predictionmulti-agent systemsbenchmarkhealthcare AIrisk predictioncalibration
0
0 comments X

The pith

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper systematically tests LLM-based agents on clinical tasks that combine electronic health records, medical images, radiology reports, and notes from real hospital data. It compares how a single agent performs versus groups of agents working together, in both single-modality and mixed-modality setups. The evaluation finds that single agents achieve stronger results overall, manage the combination of different data types more effectively, and produce probability estimates that are better aligned with actual outcomes. This finding matters for healthcare because patient data often stays fragmented across systems due to privacy rules, making collaborative agents an attractive idea for joint analysis without direct data sharing. If the gaps hold, it points to the need for more advanced designs that let multiple agents coordinate effectively on diverse medical inputs.

Core claim

The paper establishes that single agent frameworks outperform naive multi-agent systems on large-scale real-world clinical prediction tasks, performing better when handling multimodal data and yielding better-calibrated outputs. This is shown through assessments in unimodal and multimodal settings using temporal EHR data, images, reports, and clinical notes. The results indicate limitations in current naive multi-agent collaboration for heterogeneous healthcare data.

What carries the argument

Benchmark evaluation comparing single-agent and multi-agent LLM frameworks on multimodal clinical risk prediction tasks using real-world hospital data.

Load-bearing premise

That the naive multi-agent systems tested accurately represent the potential of multi-agent LLM frameworks and that the selected metrics and data splits measure clinically meaningful differences.

What would settle it

Demonstration of a multi-agent LLM framework that achieves higher accuracy and better calibration than single agents on the same multimodal clinical datasets.

Figures

Figures reproduced from arXiv: 2605.10286 by Baraa Al Jorf, Farah E.Shamout.

Figure 1
Figure 1. Figure 1: Overview of the agentic evaluation frameworks considered within the AgentRx benchmark, span￾ning Single-Agent (Unimodal/Multimodal) and Multi-Agent settings. 2. MedPatch (Multimodal): For the multi￾modal setting, we employ MedPatch (Jorf and Shamout, 2025), a SOTA fusion architecture that utilizes a confidence-guided patching mech￾anism to effectively integrate heterogeneous modalities. 3.4.2. Single-Agent… view at source ↗
read the original abstract

Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a benchmark evaluation of LLM-based agents on multimodal clinical risk prediction tasks using large-scale real-world data (EHR, medical images, radiology reports, clinical notes). It compares single-agent frameworks against naive multi-agent systems in unimodal and multimodal settings, reporting that single agents outperform the multi-agent baselines, handle multimodal inputs more effectively, and exhibit better calibration. The work open-sources its code and evaluation framework as a new benchmark for agentic systems in healthcare.

Significance. If the central empirical findings hold after clarification of methods, this provides a useful open benchmark highlighting limitations of current naive multi-agent LLM setups for heterogeneous clinical data synthesis. The emphasis on real-world multimodal data and the call for improved collaboration mechanisms could guide future agent design in healthcare, where data fragmentation is common. Open-sourcing the code is a clear strength for reproducibility.

major comments (3)
  1. [Methods/Experimental Setup] Methods/Experimental Setup: The multi-agent systems are repeatedly labeled 'naive' (Abstract, §4), but the manuscript provides insufficient detail on the exact collaboration mechanism (e.g., independent agents with late fusion, voting, or absence of debate/tool-use/hierarchical routing). Without this, the performance gap cannot be confidently attributed to multi-agent frameworks in general rather than the specific implementation, directly undermining the claims that single agents are 'better at handling multimodal data' and 'better calibrated.'
  2. [Results] Results (§5, Tables 2-4): No statistical tests, confidence intervals, or error bars are reported for the performance differences between single- and multi-agent systems. Combined with missing details on data exclusion criteria and exact train/test splits, this leaves the central claim only partially supported and difficult to interpret for clinical relevance.
  3. [§4.1 and §5.2] §4.1 and §5.2: The calibration analysis and multimodal fusion strategy for the single-agent case are not described with sufficient precision (e.g., how probabilities are aggregated across modalities or which prompting strategy is used). This makes it impossible to assess whether the reported calibration advantage is robust or an artifact of unequal implementation effort between single- and multi-agent conditions.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly state the exact clinical prediction tasks (e.g., specific risk scores or outcomes) rather than referring generically to 'clinical prediction tasks.'
  2. [Figures/Tables] Figure captions and table footnotes should include the exact LLM backbone, temperature settings, and number of runs to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas where the manuscript can be strengthened for clarity and rigor. We address each major comment point-by-point below. Revisions will be incorporated in the next version to provide the requested details, statistical analyses, and precise descriptions while preserving the core empirical findings and benchmark contributions.

read point-by-point responses
  1. Referee: The multi-agent systems are repeatedly labeled 'naive' (Abstract, §4), but the manuscript provides insufficient detail on the exact collaboration mechanism (e.g., independent agents with late fusion, voting, or absence of debate/tool-use/hierarchical routing). Without this, the performance gap cannot be confidently attributed to multi-agent frameworks in general rather than the specific implementation, directly undermining the claims that single agents are 'better at handling multimodal data' and 'better calibrated.'

    Authors: We agree that the current description of the multi-agent setup lacks sufficient specificity. In the revised manuscript, we will expand Section 4 to explicitly define the 'naive' multi-agent system as one in which independent agents each process a single modality (EHR, images, reports, notes) in isolation, with no inter-agent debate, tool use, or hierarchical routing. Final predictions are obtained via late fusion by averaging the probability outputs across agents. This basic implementation was intentionally chosen to benchmark against more advanced collaboration strategies. We will update the abstract, Section 4, and discussion to frame our claims as applying to this naive setup, thereby underscoring the need for improved multi-agent mechanisms rather than generalizing against all multi-agent approaches. revision: yes

  2. Referee: No statistical tests, confidence intervals, or error bars are reported for the performance differences between single- and multi-agent systems. Combined with missing details on data exclusion criteria and exact train/test splits, this leaves the central claim only partially supported and difficult to interpret for clinical relevance.

    Authors: We acknowledge the omission of statistical support and data transparency details. In the revision, we will add 95% bootstrap confidence intervals and error bars to all metrics in Tables 2-4, along with appropriate statistical tests (McNemar's test for binary outcomes and paired t-tests for calibration metrics) to assess significance of differences. We will also expand Section 3 with a clear description of exclusion criteria (e.g., patients lacking complete multimodal records or with insufficient follow-up time) and the exact stratified 70/30 train/test splits used across datasets, including summary statistics on cohort sizes. These additions will improve interpretability and support clinical relevance assessment. revision: yes

  3. Referee: The calibration analysis and multimodal fusion strategy for the single-agent case are not described with sufficient precision (e.g., how probabilities are aggregated across modalities or which prompting strategy is used). This makes it impossible to assess whether the reported calibration advantage is robust or an artifact of unequal implementation effort between single- and multi-agent conditions.

    Authors: We will revise Sections 4.1 and 5.2 to include precise implementation details. For the single-agent multimodal case, modalities are fused by concatenating modality-specific textual representations (EHR summaries, vision-model image captions, full reports and notes) into one structured prompt; the LLM is prompted to output class probabilities directly, with post-hoc temperature scaling applied for calibration. Calibration is quantified via Expected Calibration Error (10 bins) and Brier score. The multi-agent condition uses per-modality agents with probability averaging, and we will confirm equivalent prompting and post-processing effort across conditions. Pseudocode and example prompts will be added to the appendix to ensure full reproducibility and allow readers to evaluate robustness. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no circular derivations

full rationale

This paper is a systematic empirical benchmark evaluating LLM agents on multimodal clinical prediction tasks with real-world data. All claims, including single-agent outperformance over naive multi-agent systems in handling multimodal inputs and calibration, rest on directly measured performance metrics from implemented experiments rather than any derivation chain, equations, fitted parameters presented as predictions, or self-referential reductions. No load-bearing steps invoke self-citations for uniqueness theorems, ansatzes, or renamings of known results; the work is self-contained against external benchmarks via open-sourced code and data splits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; the work relies on standard machine learning evaluation practices and assumptions about data representativeness in clinical settings.

axioms (1)
  • domain assumption Real-world clinical data is representative for evaluating general agent performance
    Implicit in using large-scale hospital data as the basis for claims about multimodal prediction

pith-pipeline@v0.9.0 · 5486 in / 1069 out tokens · 32231 ms · 2026-05-12T05:05:56.186329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · 15 internal anchors

  1. [1]

    arXiv.org , author =

    Single-agent or. arXiv.org , author =

  2. [2]

    arXiv.org , author =

    Why. arXiv.org , author =

  3. [3]

    arXiv.org , author =

    Towards a. arXiv.org , author =

  4. [4]

    arXiv.org , author =

  5. [6]

    and Gupta, Vinayak and Althoff, Tim and Hartvigsen, Thomas , month = dec, year =

    Tan, Mingtian and Merrill, Mike A. and Gupta, Vinayak and Althoff, Tim and Hartvigsen, Thomas , month = dec, year =. Are language models actually useful for time series forecasting? , volume =. Proceedings of the 38th

  6. [7]

    Zhang, Jiayi and Xiang, Jinyu and Yu, Zhaoyang and Teng, Fengwei and Chen, Xiong-Hui and Chen, Jiaqi and Zhuge, Mingchen and Cheng, Xin and Hong, Sirui and Wang, Jinlin and Zheng, Bingnan and Liu, Bang and Luo, Yuyu and Wu, Chenglin , year =

  7. [8]

    Proceedings of the 41st

    Zhuge, Mingchen and Wang, Wenyi and Kirsch, Louis and Faccio, Francesco and Khizbullin, Dmitrii and Schmidhuber, Jürgen , month = jul, year =. Proceedings of the 41st

  8. [9]

    Proceedings of the 10th Machine Learning for Healthcare Conference , year =

    MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data , author =. Proceedings of the 10th Machine Learning for Healthcare Conference , year =

  9. [10]

    and Shamout, Farah E

    Hayat, Nasir and Geras, Krzysztof J. and Shamout, Farah E. , month = dec, year =. Proceedings of the 7th

  10. [15]

    Frontiers in Digital Health , author =

    Large language models in real-world clinical workflows: a systematic review of applications and implementation , volume =. Frontiers in Digital Health , author =. 2025 , note =. doi:10.3389/fdgth.2025.1659134 , abstract =

  11. [17]

    Proceedings of the 38th

    Kim, Yubin and Park, Chanwoo and Jeong, Hyewon and Chan, Yik Siu and Xu, Xuhai and McDuff, Daniel and Lee, Hyeonhoon and Ghassemi, Marzyeh and Breazeal, Cynthia and Park, Hae Won , month = dec, year =. Proceedings of the 38th

  12. [18]

    Informatics and Health , author =

    Next-generation agentic. Informatics and Health , author =. 2025 , keywords =. doi:10.1016/j.infoh.2025.03.001 , abstract =

  13. [19]

    Trad, Fouad and Chehab, Ali , year =. To. doi:10.1007/978-3-031-82150-9_13 , note =

  14. [21]

    arXiv.org , author =

    Language. arXiv.org , author =

  15. [22]

    arXiv.org , author =

    Self-. arXiv.org , author =

  16. [23]

    and Le, Quoc V

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , month = nov, year =. Chain-of-thought prompting elicits reasoning in large language models , isbn =. Proceedings of the 36th

  17. [24]

    Retrieval-augmented generation for knowledge-intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , year =. Retrieval-augmented generation for knowledge-intensive. Proceedings of the 34th

  18. [25]

    Intelligence-Based Medicine , author =

    Large language models in radiology reporting -. Intelligence-Based Medicine , author =. 2025 , keywords =. doi:10.1016/j.ibmed.2025.100287 , abstract =

  19. [26]

    Hospitals , author =

    Clinician. Hospitals , author =. 2026 , note =. doi:10.3390/hospitals3010001 , abstract =

  20. [33]

    Proceedings of the 62nd

    Chen, Justin and Saha, Swarnadeep and Bansal, Mohit , editor =. Proceedings of the 62nd. 2024 , pages =. doi:10.18653/v1/2024.acl-long.381 , abstract =

  21. [34]

    and Mordatch, Igor , year =

    Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , year =. Improving factuality and reasoning in language models through multiagent debate , volume =. Proceedings of the 41st

  22. [35]

    MIMIC-IV- Note: Deidentified free-text clinical notes.PhysioNet, January 2023

    Johnson, Alistair and Pollard, Tom and Horng, Steven and Celi, Leo Anthony and Mark, Roger , year =. doi:10.13026/1N74-NE17 , abstract =

  23. [38]

    and Shamout, Farah E

    Elsharief, Shaza and Shurrab, Saeed and Jorf, Baraa Al and Lopez, Leopoldo Julian Lechuga and Geras, Krzysztof J. and Shamout, Farah E. , month = jul, year =. Proceedings of the sixth

  24. [39]

    2023 , pages =

    Advances in Neural Information Processing Systems , author =. 2023 , pages =

  25. [40]

    microsoft/llava-med-v1.5-mistral-7b ·

  26. [41]

    Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023

    Li, Chunyuan and Wong, Cliff and Zhang, Sheng and Usuyama, Naoto and Liu, Haotian and Yang, Jianwei and Naumann, Tristan and Poon, Hoifung and Gao, Jianfeng , month = jun, year =. doi:10.48550/arXiv.2306.00890 , abstract =

  27. [42]

    et al.: HuatuoGPT-Vision, To- wards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale (Sep 2024)

    Chen, Junying and Gui, Chi and Ouyang, Ruyi and Gao, Anningzhe and Chen, Shunian and Chen, Guiming Hardy and Wang, Xidong and Zhang, Ruifei and Cai, Zhenyang and Ji, Ke and Yu, Guangjun and Wan, Xiang and Wang, Benyou , month = sep, year =. doi:10.48550/arXiv.2406.19280 , abstract =

  28. [43]

    Qwen2.5-VL Technical Report

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

  29. [44]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and Gu, Lixin and Wang, Xuehui and Li, Qingyun and Ren, Yiming and Chen, Zixuan and Luo, Jiapeng and Wang, Jiahao and Jiang, Tan and Wang, Bo and He, Conghui and Shi, Botian and Zhang, Xingcheng and L...

  30. [45]

    Nori, Harsha and Lee, Yin Tat and Zhang, Sheng and Carignan, Dean and Edgar, Richard and Fusi, Nicolo and King, Nicholas and Larson, Jonathan and Li, Yuanzhi and Liu, Weishung and Luo, Renqian and McKinney, Scott Mayer and Ness, Robert Osazuwa and Poon, Hoifung and Qin, Tao and Usuyama, Naoto and White, Chris and Horvitz, Eric , month = nov, year =. Can. ...

  31. [46]

    Language

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  32. [47]

    Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , month = may, year =. Self-. doi:10.48550/ar...

  33. [48]

    Traj-coa: Patient trajectory modeling via chain-of-agents for lung cancer risk prediction, 2025

    Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction , author =. 2025 , eprint =. doi:10.48550/arXiv.2510.10454 , url =

  34. [49]

    Proceedings of the 29th

    Hou, Yutai and Dong, Hongyuan and Wang, Xinghao and Li, Bohan and Che, Wanxiang , editor =. Proceedings of the 29th. 2022 , pages =

  35. [51]

    Zhu, Yinghao and He, Ziyi and Hu, Haoran and Zheng, Xiaochen and Zhang, Xichen and Wang, Zixiang and Gao, Junyi and Ma, Liantao and Yu, Lequan , month = oct, year =

  36. [54]

    ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

    Huang, Kexin and Altosaar, Jaan and Ranganath, Rajesh , month = apr, year =. doi:10.48550/arXiv.1904.05342 , abstract =

  37. [55]

    Liu, Sizhe and Lu, Yizhou and Chen, Siyu and Hu, Xiyang and Zhao, Jieyu and Fu, Tianfan and Zhao, Yue , month = dec, year =

  38. [56]

    2024 , volume =

    Wu, Jinlin and Liang, Xusheng and Bai, Xuexue and Chen, Zhen , month = dec, year =. doi:10.1109/BigData62323.2024.10825748 , abstract =

  39. [57]

    Moghani, Masoud and Doorenbos, Lars and Panitch, William Chung-Ho and Huver, Sean and Azizian, Mahdi and Goldberg, Ken and Garg, Animesh , month = oct, year =. 2024. doi:10.1109/IROS58592.2024.10802053 , abstract =

  40. [58]

    Guttag, and Adrian V

    Hoopes, Andrew and Dey, Neel and Butoi, Victor Ion and Guttag, John V. and Dalca, Adrian V. , month = oct, year =. doi:10.48550/arXiv.2410.08397 , abstract =

  41. [59]

    Enhancing diagnostic capability with multi-agents conversational large language models , volume =

    Chen, Xi and Yi, Huahui and You, Mingke and Liu, WeiZhi and Wang, Li and Li, Hairui and Zhang, Xue and Guo, Yingman and Fan, Lei and Chen, Gang and Lao, Qicheng and Fu, Weili and Li, Kang and Li, Jian , month = mar, year =. Enhancing diagnostic capability with multi-agents conversational large language models , volume =. npj Digital Medicine , publisher =...

  42. [60]

    Proceedings of the 15th

    Yue, Ling and Xing, Sixue and Chen, Jintai and Fu, Tianfan , year =. Proceedings of the 15th. doi:10.1145/3698587.3701359 , abstract =

  43. [61]

    npj Digital Medicine , publisher =

    Li, Rumeng and Wang, Xun and Berlowitz, Dan and Mez, Jesse and Lin, Honghuang and Yu, Hong , month = aug, year =. npj Digital Medicine , publisher =. doi:10.1038/s41746-025-01940-4 , abstract =

  44. [62]

    Mmedagent: Learning to use medical tools with multi-modal agent

    Li, Binxu and Yan, Tiankai and Pan, Yuanting and Luo, Jie and Ji, Ruiyang and Ding, Jiayuan and Xu, Zhe and Liu, Shilong and Dong, Haoyu and Lin, Zihao and Wang, Yixin , editor =. Findings of the. 2024 , pages =. doi:10.18653/v1/2024.findings-emnlp.510 , abstract =

  45. [64]

    OpenAI and Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and Avila, Red and Babuschkin, Igor and Balaji, Suchir and Balcom, Valerie and Baltescu, Paul and Bao, Haiming and Bavarian, Mohammad and Belgum, Jeff a...

  46. [66]

    Learning

    Lee, Kwanhyung and Lee, Soojeong and Hahn, Sangchul and Hyun, Heejung and Choi, Edward and Ahn, Byungeun and Lee, Joohyung , month = dec, year =. Learning. Proceedings of the 8th

  47. [67]

    2025 , journal=

    Ovis2.5 Technical Report , author=. 2025 , journal=

  48. [68]

    MedGemma Technical Report

    Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and Chen, Justin and Mahvar, Fereshteh and Yatziv, Liron and Chen, Tiffany and Sterling, Bram and Baby, Stefanie Anna and Baby, Susanna Maria and Lai, Jeremy and Schmi...

  49. [69]

    Kalpelbe, Beria Chingnabe and Adaambiik, Angel Gabriel and Peng, Wei , month = feb, year =. Vision. doi:10.48550/arXiv.2503.01863 , abstract =

  50. [70]

    Multimodal

    Chen, Emma and Kansal, Aman and Chen, Julie and Jin, Boyang Tom and Reisler, Julia and Kim, David E and Rajpurkar, Pranav , editor =. Multimodal. Advances in. 2023 , pages =

  51. [73]

    Guyon and A

    I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. JMLR

  52. [74]

    Guyon and C

    I. Guyon and C. Aliferis and A. Elisseeff , title =

  53. [75]

    Johnson, Alistair E. W. and Pollard, Tom J. and Shen, Lu and Lehman, Li-wei H. and Feng, Mengling and Ghassemi, Mohammad and Moody, Benjamin and Szolovits, Peter and Celi, Leo Anthony and Mark, Roger G. , journal=. doi:https://doi.org/10.1038/sdata.2016.35 , volume=

  54. [76]

    Johnson, Alistair E. W. and Pollard, Tom J. and Mark, Roger G. , year=. doi:https://doi.org/10.13026/C2XW26 , publisher=

  55. [77]

    Clinical risk prediction using language models: benefits and considerations

    Angeela Acharya, Sulabh Shrestha, Anyi Chen, Joseph Conte, Sanja Avramovic, Siddhartha Sikdar, Antonios Anastasopoulos, and Sanmay Das. Clinical risk prediction using language models: benefits and considerations. Journal of the American Medical Informatics Association: JAMIA, 31 0 (9): 0 1856--1864, September 2024. ISSN 1527-974X. doi:10.1093/jamia/ocae030

  56. [78]

    Brazelton, Margaret A

    Majid Afshar, Mary Ryan Baumann, Felice Resnik, Josie Hintzke, Anne Gravel Sullivan, Graham Wills, Kayla Lemmon, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Thomas B. Brazelton, Margaret A. Leaf, Heidi Twedt, David Kunstman, Brian Patterson, Frank Liao, Stacy Rasmussen, Elizabeth S. Burnside, Cherodeep Goswami, and...

  57. [79]

    Baraa Al Jorf, Bartlomiej Piechowski-Jozwiak, and Farah E. Shamout. A data-centric perspective on designing AI foundation models for healthcare. Frontiers in Digital Health, Volume 8 - 2026, 2026. ISSN 2673-253X. doi:10.3389/fdgth.2026.1738523. URL https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2026.1738523

  58. [80]

    EHRXQA : A Multi - Modal Question Answering Dataset for Electronic Health Records with Chest X -ray Images

    Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, and Edward Choi. EHRXQA : A Multi - Modal Question Answering Dataset for Electronic Health Records with Chest X -ray Images . Advances in Neural Information Processing Systems, 36: 0 3867--3880, December 2023. URL https://proceedings....

  59. [81]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- VL Technical Report ,...

  60. [82]

    Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J

    Brenton T. Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J. Dixon, Abigail B. Clark, Olivia Spaedy, Adam Skelton, Neel Edupuganti, Lance Dzubinski, Hudson Tate, Garrett Dyess, Brenessa Lindeman, and Lisa Soleymani Lehmann. ChatGPT -4 Omni Performance in USMLE Disciplines and Clinical Skills : Comparative Analysis . JMIR Medical Education, 10 0...

  61. [83]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  62. [84]

    Knowledge and perception of primary care healthcare professionals on the use of artificial intelligence as a healthcare tool

    Queralt Miró Catalina, Aïna Fuster-Casanovas, Josep Vidal-Alaball, Anna Escalé-Besa, Francesc X Marin-Gomez, Joaquim Femenia, and Jordi Solé-Casals. Knowledge and perception of primary care healthcare professionals on the use of artificial intelligence as a healthcare tool. DIGITAL HEALTH, 9: 0 20552076231180511, January 2023. ISSN 2055-2076. doi:10.1177/...

  63. [85]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi - Agent LLM Systems Fail ?, March 2025. URL https://arxiv.org/abs/2503.13657v3

  64. [86]

    Multimodal Clinical Benchmark for Emergency Care ( MC - BEC ): A Comprehensive Benchmark for Evaluating Foundation Models in Emergency Medicine

    Emma Chen, Aman Kansal, Julie Chen, Boyang Tom Jin, Julia Reisler, David E Kim, and Pranav Rajpurkar. Multimodal Clinical Benchmark for Emergency Care ( MC - BEC ): A Comprehensive Benchmark for Evaluating Foundation Models in Emergency Medicine . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Informati...

  65. [87]

    Multi-modal learning for inpatient length of stay prediction

    Junde Chen, Yuxin Wen, Michael Pokojovy, Tzu-Liang (Bill) Tseng, Peter McCaffrey, Alexander Vo, Eric Walser, and Scott Moen. Multi-modal learning for inpatient length of stay prediction. Computers in Biology and Medicine, 171: 0 108121, March 2024 a . ISSN 0010-4825. doi:10.1016/j.compbiomed.2024.108121. URL https://www.sciencedirect.com/science/article/p...

  66. [88]

    et al.: HuatuoGPT-Vision, To- wards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale (Sep 2024)

    Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, and Benyou Wang. HuatuoGPT - Vision , Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , September 2024 b . URL http://arxiv.org/abs/2406.19280. arXiv:2406.19280 [cs]

  67. [89]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

  68. [90]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of ICML '24 , pages 11733--11763, Vienna, Austria, 2024. JMLR.org

  69. [91]

    Geras, and Farah E

    Shaza Elsharief, Saeed Shurrab, Baraa Al Jorf, Leopoldo Julian Lechuga Lopez, Krzysztof J. Geras, and Farah E. Shamout. MedMod : Multimodal Benchmark for Medical Prediction Tasks with Electronic Health Records and Chest X - Ray Scans . In Proceedings of the sixth Conference on Health , Inference , and Learning , pages 781--803. PMLR, July 2025. URL https:...

  70. [92]

    Single-agent or Multi -agent Systems ? Why Not Both ?, May 2025

    Mingyan Gao, Yanzi Li, Banruo Liu, Yifan Yu, Phillip Wang, Ching-Yu Lin, and Fan Lai. Single-agent or Multi -agent Systems ? Why Not Both ?, May 2025. URL https://arxiv.org/abs/2505.18286v1

  71. [93]

    Geras, and Farah E

    Nasir Hayat, Krzysztof J. Geras, and Farah E. Shamout. MedFuse : Multi -modal fusion with clinical time-series data and chest X -ray images. In Proceedings of the 7th Machine Learning for Healthcare Conference , pages 479--503. PMLR, December 2022. URL https://proceedings.mlr.press/v182/hayat22a.html. ISSN: 2640-3498

  72. [94]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT : Meta Programming for A Multi - Agent Collaborative Framework , August 2023. URL https://arxiv.org/abs/2308.00352v7

  73. [95]

    MetaPrompting : Learning to Learn Better Prompts

    Yutai Hou, Hongyuan Dong, Xinghao Wang, Bohan Li, and Wanxiang Che. MetaPrompting : Learning to Learn Better Prompts . In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, T...

  74. [96]

    ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission , April 2019

    Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission , April 2019. URL https://ui.adsabs.harvard.edu/abs/2019arXiv190405342H. ADS Bibcode: 2019arXiv190405342H

  75. [97]

    John Wilbur, Zhe He, R

    Qiao Jin, Zhizheng Wang, Yifan Yang, Qingqing Zhu, Donald Wright, Thomas Huang, Nikhil Khandekar, Nicholas Wan, Xuguang Ai, W. John Wilbur, Zhe He, R. Andrew Taylor, Qingyu Chen, and Zhiyong Lu. AgentMD : Empowering language agents for risk prediction with large-scale clinical tool learning. Nature Communications, 16 0 (1): 0 9377, October 2025. ISSN 2041...

  76. [98]

    MIMIC - IV - Note : Deidentified free-text clinical notes, 2023 a

    Alistair Johnson, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC - IV - Note : Deidentified free-text clinical notes, 2023 a . URL https://physionet.org/content/mimic-iv-note/2.2/

  77. [99]

    Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. MIMIC - CXR , a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6 0 (1): 0 317, December 2019. ISSN 2052-4463. doi:10.1038/s41597-019-0322-0. URL htt...

  78. [100]

    Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. MIMIC - IV , a freely accessible electronic health record dataset. Scientific Data, 10 0 (1): 0 1, January 2023 b . ISSN 2052-4463. doi:10.1038/s41597-022-01...

  79. [101]

    Baraa Al Jorf and Farah E. Shamout. Medpatch: Confidence-guided multi-stage fusion for multimodal clinical data. In Monica Agrawal, Kaivalya Deshpande, Matthew Engelhard, Shalmali Joshi, Shengpu Tang, and Iñigo Urteaga, editors, Proceedings of the 10th Machine Learning for Healthcare Conference, volume 298 of Proceedings of Machine Learning Research. PMLR...

  80. [102]

    Voting or Consensus ? Decision - Making in Multi - Agent Debate

    Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. Voting or Consensus ? Decision - Making in Multi - Agent Debate . In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics : ACL 2025 , pages 11640--11671, Vienna, Austria, July 2025. ...

Showing first 80 references.