Recognition: no theorem link
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Pith reviewed 2026-05-12 05:05 UTC · model grok-4.3
The pith
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that single agent frameworks outperform naive multi-agent systems on large-scale real-world clinical prediction tasks, performing better when handling multimodal data and yielding better-calibrated outputs. This is shown through assessments in unimodal and multimodal settings using temporal EHR data, images, reports, and clinical notes. The results indicate limitations in current naive multi-agent collaboration for heterogeneous healthcare data.
What carries the argument
Benchmark evaluation comparing single-agent and multi-agent LLM frameworks on multimodal clinical risk prediction tasks using real-world hospital data.
Load-bearing premise
That the naive multi-agent systems tested accurately represent the potential of multi-agent LLM frameworks and that the selected metrics and data splits measure clinically meaningful differences.
What would settle it
Demonstration of a multi-agent LLM framework that achieves higher accuracy and better calibration than single agents on the same multimodal clinical datasets.
Figures
read the original abstract
Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a benchmark evaluation of LLM-based agents on multimodal clinical risk prediction tasks using large-scale real-world data (EHR, medical images, radiology reports, clinical notes). It compares single-agent frameworks against naive multi-agent systems in unimodal and multimodal settings, reporting that single agents outperform the multi-agent baselines, handle multimodal inputs more effectively, and exhibit better calibration. The work open-sources its code and evaluation framework as a new benchmark for agentic systems in healthcare.
Significance. If the central empirical findings hold after clarification of methods, this provides a useful open benchmark highlighting limitations of current naive multi-agent LLM setups for heterogeneous clinical data synthesis. The emphasis on real-world multimodal data and the call for improved collaboration mechanisms could guide future agent design in healthcare, where data fragmentation is common. Open-sourcing the code is a clear strength for reproducibility.
major comments (3)
- [Methods/Experimental Setup] Methods/Experimental Setup: The multi-agent systems are repeatedly labeled 'naive' (Abstract, §4), but the manuscript provides insufficient detail on the exact collaboration mechanism (e.g., independent agents with late fusion, voting, or absence of debate/tool-use/hierarchical routing). Without this, the performance gap cannot be confidently attributed to multi-agent frameworks in general rather than the specific implementation, directly undermining the claims that single agents are 'better at handling multimodal data' and 'better calibrated.'
- [Results] Results (§5, Tables 2-4): No statistical tests, confidence intervals, or error bars are reported for the performance differences between single- and multi-agent systems. Combined with missing details on data exclusion criteria and exact train/test splits, this leaves the central claim only partially supported and difficult to interpret for clinical relevance.
- [§4.1 and §5.2] §4.1 and §5.2: The calibration analysis and multimodal fusion strategy for the single-agent case are not described with sufficient precision (e.g., how probabilities are aggregated across modalities or which prompting strategy is used). This makes it impossible to assess whether the reported calibration advantage is robust or an artifact of unequal implementation effort between single- and multi-agent conditions.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly state the exact clinical prediction tasks (e.g., specific risk scores or outcomes) rather than referring generically to 'clinical prediction tasks.'
- [Figures/Tables] Figure captions and table footnotes should include the exact LLM backbone, temperature settings, and number of runs to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has identified important areas where the manuscript can be strengthened for clarity and rigor. We address each major comment point-by-point below. Revisions will be incorporated in the next version to provide the requested details, statistical analyses, and precise descriptions while preserving the core empirical findings and benchmark contributions.
read point-by-point responses
-
Referee: The multi-agent systems are repeatedly labeled 'naive' (Abstract, §4), but the manuscript provides insufficient detail on the exact collaboration mechanism (e.g., independent agents with late fusion, voting, or absence of debate/tool-use/hierarchical routing). Without this, the performance gap cannot be confidently attributed to multi-agent frameworks in general rather than the specific implementation, directly undermining the claims that single agents are 'better at handling multimodal data' and 'better calibrated.'
Authors: We agree that the current description of the multi-agent setup lacks sufficient specificity. In the revised manuscript, we will expand Section 4 to explicitly define the 'naive' multi-agent system as one in which independent agents each process a single modality (EHR, images, reports, notes) in isolation, with no inter-agent debate, tool use, or hierarchical routing. Final predictions are obtained via late fusion by averaging the probability outputs across agents. This basic implementation was intentionally chosen to benchmark against more advanced collaboration strategies. We will update the abstract, Section 4, and discussion to frame our claims as applying to this naive setup, thereby underscoring the need for improved multi-agent mechanisms rather than generalizing against all multi-agent approaches. revision: yes
-
Referee: No statistical tests, confidence intervals, or error bars are reported for the performance differences between single- and multi-agent systems. Combined with missing details on data exclusion criteria and exact train/test splits, this leaves the central claim only partially supported and difficult to interpret for clinical relevance.
Authors: We acknowledge the omission of statistical support and data transparency details. In the revision, we will add 95% bootstrap confidence intervals and error bars to all metrics in Tables 2-4, along with appropriate statistical tests (McNemar's test for binary outcomes and paired t-tests for calibration metrics) to assess significance of differences. We will also expand Section 3 with a clear description of exclusion criteria (e.g., patients lacking complete multimodal records or with insufficient follow-up time) and the exact stratified 70/30 train/test splits used across datasets, including summary statistics on cohort sizes. These additions will improve interpretability and support clinical relevance assessment. revision: yes
-
Referee: The calibration analysis and multimodal fusion strategy for the single-agent case are not described with sufficient precision (e.g., how probabilities are aggregated across modalities or which prompting strategy is used). This makes it impossible to assess whether the reported calibration advantage is robust or an artifact of unequal implementation effort between single- and multi-agent conditions.
Authors: We will revise Sections 4.1 and 5.2 to include precise implementation details. For the single-agent multimodal case, modalities are fused by concatenating modality-specific textual representations (EHR summaries, vision-model image captions, full reports and notes) into one structured prompt; the LLM is prompted to output class probabilities directly, with post-hoc temperature scaling applied for calibration. Calibration is quantified via Expected Calibration Error (10 bins) and Brier score. The multi-agent condition uses per-modality agents with probability averaging, and we will confirm equivalent prompting and post-processing effort across conditions. Pseudocode and example prompts will be added to the appendix to ensure full reproducibility and allow readers to evaluate robustness. revision: yes
Circularity Check
Empirical benchmark study with no circular derivations
full rationale
This paper is a systematic empirical benchmark evaluating LLM agents on multimodal clinical prediction tasks with real-world data. All claims, including single-agent outperformance over naive multi-agent systems in handling multimodal inputs and calibration, rest on directly measured performance metrics from implemented experiments rather than any derivation chain, equations, fitted parameters presented as predictions, or self-referential reductions. No load-bearing steps invoke self-citations for uniqueness theorems, ansatzes, or renamings of known results; the work is self-contained against external benchmarks via open-sourced code and data splits.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world clinical data is representative for evaluating general agent performance
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
arXiv.org , author =
-
[6]
and Gupta, Vinayak and Althoff, Tim and Hartvigsen, Thomas , month = dec, year =
Tan, Mingtian and Merrill, Mike A. and Gupta, Vinayak and Althoff, Tim and Hartvigsen, Thomas , month = dec, year =. Are language models actually useful for time series forecasting? , volume =. Proceedings of the 38th
-
[7]
Zhang, Jiayi and Xiang, Jinyu and Yu, Zhaoyang and Teng, Fengwei and Chen, Xiong-Hui and Chen, Jiaqi and Zhuge, Mingchen and Cheng, Xin and Hong, Sirui and Wang, Jinlin and Zheng, Bingnan and Liu, Bang and Luo, Yuyu and Wu, Chenglin , year =
-
[8]
Zhuge, Mingchen and Wang, Wenyi and Kirsch, Louis and Faccio, Francesco and Khizbullin, Dmitrii and Schmidhuber, Jürgen , month = jul, year =. Proceedings of the 41st
-
[9]
Proceedings of the 10th Machine Learning for Healthcare Conference , year =
MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data , author =. Proceedings of the 10th Machine Learning for Healthcare Conference , year =
-
[10]
Hayat, Nasir and Geras, Krzysztof J. and Shamout, Farah E. , month = dec, year =. Proceedings of the 7th
-
[15]
Frontiers in Digital Health , author =
Large language models in real-world clinical workflows: a systematic review of applications and implementation , volume =. Frontiers in Digital Health , author =. 2025 , note =. doi:10.3389/fdgth.2025.1659134 , abstract =
-
[17]
Kim, Yubin and Park, Chanwoo and Jeong, Hyewon and Chan, Yik Siu and Xu, Xuhai and McDuff, Daniel and Lee, Hyeonhoon and Ghassemi, Marzyeh and Breazeal, Cynthia and Park, Hae Won , month = dec, year =. Proceedings of the 38th
-
[18]
Informatics and Health , author =
Next-generation agentic. Informatics and Health , author =. 2025 , keywords =. doi:10.1016/j.infoh.2025.03.001 , abstract =
-
[19]
Trad, Fouad and Chehab, Ali , year =. To. doi:10.1007/978-3-031-82150-9_13 , note =
- [21]
- [22]
-
[23]
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , month = nov, year =. Chain-of-thought prompting elicits reasoning in large language models , isbn =. Proceedings of the 36th
-
[24]
Retrieval-augmented generation for knowledge-intensive
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , year =. Retrieval-augmented generation for knowledge-intensive. Proceedings of the 34th
-
[25]
Intelligence-Based Medicine , author =
Large language models in radiology reporting -. Intelligence-Based Medicine , author =. 2025 , keywords =. doi:10.1016/j.ibmed.2025.100287 , abstract =
-
[26]
Clinician. Hospitals , author =. 2026 , note =. doi:10.3390/hospitals3010001 , abstract =
-
[33]
Chen, Justin and Saha, Swarnadeep and Bansal, Mohit , editor =. Proceedings of the 62nd. 2024 , pages =. doi:10.18653/v1/2024.acl-long.381 , abstract =
-
[34]
Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , year =. Improving factuality and reasoning in language models through multiagent debate , volume =. Proceedings of the 41st
-
[35]
MIMIC-IV- Note: Deidentified free-text clinical notes.PhysioNet, January 2023
Johnson, Alistair and Pollard, Tom and Horng, Steven and Celi, Leo Anthony and Mark, Roger , year =. doi:10.13026/1N74-NE17 , abstract =
-
[38]
Elsharief, Shaza and Shurrab, Saeed and Jorf, Baraa Al and Lopez, Leopoldo Julian Lechuga and Geras, Krzysztof J. and Shamout, Farah E. , month = jul, year =. Proceedings of the sixth
-
[39]
Advances in Neural Information Processing Systems , author =. 2023 , pages =
work page 2023
-
[40]
microsoft/llava-med-v1.5-mistral-7b ·
-
[41]
Li, Chunyuan and Wong, Cliff and Zhang, Sheng and Usuyama, Naoto and Liu, Haotian and Yang, Jianwei and Naumann, Tristan and Poon, Hoifung and Gao, Jianfeng , month = jun, year =. doi:10.48550/arXiv.2306.00890 , abstract =
-
[42]
Chen, Junying and Gui, Chi and Ouyang, Ruyi and Gao, Anningzhe and Chen, Shunian and Chen, Guiming Hardy and Wang, Xidong and Zhang, Ruifei and Cai, Zhenyang and Ji, Ke and Yu, Guangjun and Wan, Xiang and Wang, Benyou , month = sep, year =. doi:10.48550/arXiv.2406.19280 , abstract =
-
[43]
Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923
-
[44]
Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and Gu, Lixin and Wang, Xuehui and Li, Qingyun and Ren, Yiming and Chen, Zixuan and Luo, Jiapeng and Wang, Jiahao and Jiang, Tan and Wang, Bo and He, Conghui and Shi, Botian and Zhang, Xingcheng and L...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.05271
-
[45]
Nori, Harsha and Lee, Yin Tat and Zhang, Sheng and Carignan, Dean and Edgar, Richard and Fusi, Nicolo and King, Nicholas and Larson, Jonathan and Li, Yuanzhi and Liu, Weishung and Luo, Renqian and McKinney, Scott Mayer and Ness, Robert Osazuwa and Poon, Hoifung and Qin, Tao and Usuyama, Naoto and White, Chris and Horvitz, Eric , month = nov, year =. Can. ...
-
[46]
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[47]
Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , month = may, year =. Self-. doi:10.48550/ar...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.17651
-
[48]
Traj-coa: Patient trajectory modeling via chain-of-agents for lung cancer risk prediction, 2025
Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction , author =. 2025 , eprint =. doi:10.48550/arXiv.2510.10454 , url =
-
[49]
Hou, Yutai and Dong, Hongyuan and Wang, Xinghao and Li, Bohan and Che, Wanxiang , editor =. Proceedings of the 29th. 2022 , pages =
work page 2022
-
[51]
Zhu, Yinghao and He, Ziyi and Hu, Haoran and Zheng, Xiaochen and Zhang, Xichen and Wang, Zixiang and Gao, Junyi and Ma, Liantao and Yu, Lequan , month = oct, year =
-
[54]
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
Huang, Kexin and Altosaar, Jaan and Ranganath, Rajesh , month = apr, year =. doi:10.48550/arXiv.1904.05342 , abstract =
work page internal anchor Pith review doi:10.48550/arxiv.1904.05342 1904
-
[55]
Liu, Sizhe and Lu, Yizhou and Chen, Siyu and Hu, Xiyang and Zhao, Jieyu and Fu, Tianfan and Zhao, Yue , month = dec, year =
-
[56]
Wu, Jinlin and Liang, Xusheng and Bai, Xuexue and Chen, Zhen , month = dec, year =. doi:10.1109/BigData62323.2024.10825748 , abstract =
-
[57]
Moghani, Masoud and Doorenbos, Lars and Panitch, William Chung-Ho and Huver, Sean and Azizian, Mahdi and Goldberg, Ken and Garg, Animesh , month = oct, year =. 2024. doi:10.1109/IROS58592.2024.10802053 , abstract =
-
[58]
Hoopes, Andrew and Dey, Neel and Butoi, Victor Ion and Guttag, John V. and Dalca, Adrian V. , month = oct, year =. doi:10.48550/arXiv.2410.08397 , abstract =
-
[59]
Enhancing diagnostic capability with multi-agents conversational large language models , volume =
Chen, Xi and Yi, Huahui and You, Mingke and Liu, WeiZhi and Wang, Li and Li, Hairui and Zhang, Xue and Guo, Yingman and Fan, Lei and Chen, Gang and Lao, Qicheng and Fu, Weili and Li, Kang and Li, Jian , month = mar, year =. Enhancing diagnostic capability with multi-agents conversational large language models , volume =. npj Digital Medicine , publisher =...
-
[60]
Yue, Ling and Xing, Sixue and Chen, Jintai and Fu, Tianfan , year =. Proceedings of the 15th. doi:10.1145/3698587.3701359 , abstract =
-
[61]
npj Digital Medicine , publisher =
Li, Rumeng and Wang, Xun and Berlowitz, Dan and Mez, Jesse and Lin, Honghuang and Yu, Hong , month = aug, year =. npj Digital Medicine , publisher =. doi:10.1038/s41746-025-01940-4 , abstract =
-
[62]
Mmedagent: Learning to use medical tools with multi-modal agent
Li, Binxu and Yan, Tiankai and Pan, Yuanting and Luo, Jie and Ji, Ruiyang and Ding, Jiayuan and Xu, Zhe and Liu, Shilong and Dong, Haoyu and Lin, Zihao and Wang, Yixin , editor =. Findings of the. 2024 , pages =. doi:10.18653/v1/2024.findings-emnlp.510 , abstract =
-
[64]
OpenAI and Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and Avila, Red and Babuschkin, Igor and Balaji, Suchir and Balcom, Valerie and Baltescu, Paul and Bao, Haiming and Bavarian, Mohammad and Belgum, Jeff a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774
- [66]
- [67]
-
[68]
Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and Chen, Justin and Mahvar, Fereshteh and Yatziv, Liron and Chen, Tiffany and Sterling, Bram and Baby, Stefanie Anna and Baby, Susanna Maria and Lai, Jeremy and Schmi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.05201
-
[69]
Kalpelbe, Beria Chingnabe and Adaambiik, Angel Gabriel and Peng, Wei , month = feb, year =. Vision. doi:10.48550/arXiv.2503.01863 , abstract =
-
[70]
Chen, Emma and Kansal, Aman and Chen, Julie and Jin, Boyang Tom and Reisler, Julia and Kim, David E and Rajpurkar, Pranav , editor =. Multimodal. Advances in. 2023 , pages =
work page 2023
-
[73]
I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. JMLR
- [74]
-
[75]
Johnson, Alistair E. W. and Pollard, Tom J. and Shen, Lu and Lehman, Li-wei H. and Feng, Mengling and Ghassemi, Mohammad and Moody, Benjamin and Szolovits, Peter and Celi, Leo Anthony and Mark, Roger G. , journal=. doi:https://doi.org/10.1038/sdata.2016.35 , volume=
-
[76]
Johnson, Alistair E. W. and Pollard, Tom J. and Mark, Roger G. , year=. doi:https://doi.org/10.13026/C2XW26 , publisher=
-
[77]
Clinical risk prediction using language models: benefits and considerations
Angeela Acharya, Sulabh Shrestha, Anyi Chen, Joseph Conte, Sanja Avramovic, Siddhartha Sikdar, Antonios Anastasopoulos, and Sanmay Das. Clinical risk prediction using language models: benefits and considerations. Journal of the American Medical Informatics Association: JAMIA, 31 0 (9): 0 1856--1864, September 2024. ISSN 1527-974X. doi:10.1093/jamia/ocae030
-
[78]
Majid Afshar, Mary Ryan Baumann, Felice Resnik, Josie Hintzke, Anne Gravel Sullivan, Graham Wills, Kayla Lemmon, Jason Dambach, Leigh Ann Mrotek, Mariah Quinn, Kirsten Abramson, Peter Kleinschmidt, Thomas B. Brazelton, Margaret A. Leaf, Heidi Twedt, David Kunstman, Brian Patterson, Frank Liao, Stacy Rasmussen, Elizabeth S. Burnside, Cherodeep Goswami, and...
-
[79]
Baraa Al Jorf, Bartlomiej Piechowski-Jozwiak, and Farah E. Shamout. A data-centric perspective on designing AI foundation models for healthcare. Frontiers in Digital Health, Volume 8 - 2026, 2026. ISSN 2673-253X. doi:10.3389/fdgth.2026.1738523. URL https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2026.1738523
-
[80]
Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, and Edward Choi. EHRXQA : A Multi - Modal Question Answering Dataset for Electronic Health Records with Chest X -ray Images . Advances in Neural Information Processing Systems, 36: 0 3867--3880, December 2023. URL https://proceedings....
work page 2023
-
[81]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- VL Technical Report ,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[82]
Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J
Brenton T. Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J. Dixon, Abigail B. Clark, Olivia Spaedy, Adam Skelton, Neel Edupuganti, Lance Dzubinski, Hudson Tate, Garrett Dyess, Brenessa Lindeman, and Lisa Soleymani Lehmann. ChatGPT -4 Omni Performance in USMLE Disciplines and Clinical Skills : Comparative Analysis . JMIR Medical Education, 10 0...
-
[83]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[84]
Queralt Miró Catalina, Aïna Fuster-Casanovas, Josep Vidal-Alaball, Anna Escalé-Besa, Francesc X Marin-Gomez, Joaquim Femenia, and Jordi Solé-Casals. Knowledge and perception of primary care healthcare professionals on the use of artificial intelligence as a healthcare tool. DIGITAL HEALTH, 9: 0 20552076231180511, January 2023. ISSN 2055-2076. doi:10.1177/...
-
[85]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi - Agent LLM Systems Fail ?, March 2025. URL https://arxiv.org/abs/2503.13657v3
work page internal anchor Pith review arXiv 2025
-
[86]
Emma Chen, Aman Kansal, Julie Chen, Boyang Tom Jin, Julia Reisler, David E Kim, and Pranav Rajpurkar. Multimodal Clinical Benchmark for Emergency Care ( MC - BEC ): A Comprehensive Benchmark for Evaluating Foundation Models in Emergency Medicine . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Informati...
work page 2023
-
[87]
Multi-modal learning for inpatient length of stay prediction
Junde Chen, Yuxin Wen, Michael Pokojovy, Tzu-Liang (Bill) Tseng, Peter McCaffrey, Alexander Vo, Eric Walser, and Scott Moen. Multi-modal learning for inpatient length of stay prediction. Computers in Biology and Medicine, 171: 0 108121, March 2024 a . ISSN 0010-4825. doi:10.1016/j.compbiomed.2024.108121. URL https://www.sciencedirect.com/science/article/p...
-
[88]
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, and Benyou Wang. HuatuoGPT - Vision , Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , September 2024 b . URL http://arxiv.org/abs/2406.19280. arXiv:2406.19280 [cs]
-
[89]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[90]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of ICML '24 , pages 11733--11763, Vienna, Austria, 2024. JMLR.org
work page 2024
-
[91]
Shaza Elsharief, Saeed Shurrab, Baraa Al Jorf, Leopoldo Julian Lechuga Lopez, Krzysztof J. Geras, and Farah E. Shamout. MedMod : Multimodal Benchmark for Medical Prediction Tasks with Electronic Health Records and Chest X - Ray Scans . In Proceedings of the sixth Conference on Health , Inference , and Learning , pages 781--803. PMLR, July 2025. URL https:...
work page 2025
-
[92]
Single-agent or Multi -agent Systems ? Why Not Both ?, May 2025
Mingyan Gao, Yanzi Li, Banruo Liu, Yifan Yu, Phillip Wang, Ching-Yu Lin, and Fan Lai. Single-agent or Multi -agent Systems ? Why Not Both ?, May 2025. URL https://arxiv.org/abs/2505.18286v1
-
[93]
Nasir Hayat, Krzysztof J. Geras, and Farah E. Shamout. MedFuse : Multi -modal fusion with clinical time-series data and chest X -ray images. In Proceedings of the 7th Machine Learning for Healthcare Conference , pages 479--503. PMLR, December 2022. URL https://proceedings.mlr.press/v182/hayat22a.html. ISSN: 2640-3498
work page 2022
-
[94]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT : Meta Programming for A Multi - Agent Collaborative Framework , August 2023. URL https://arxiv.org/abs/2308.00352v7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[95]
MetaPrompting : Learning to Learn Better Prompts
Yutai Hou, Hongyuan Dong, Xinghao Wang, Bohan Li, and Wanxiang Che. MetaPrompting : Learning to Learn Better Prompts . In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, T...
work page 2022
-
[96]
ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission , April 2019
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. ClinicalBERT : Modeling Clinical Notes and Predicting Hospital Readmission , April 2019. URL https://ui.adsabs.harvard.edu/abs/2019arXiv190405342H. ADS Bibcode: 2019arXiv190405342H
work page 2019
-
[97]
Qiao Jin, Zhizheng Wang, Yifan Yang, Qingqing Zhu, Donald Wright, Thomas Huang, Nikhil Khandekar, Nicholas Wan, Xuguang Ai, W. John Wilbur, Zhe He, R. Andrew Taylor, Qingyu Chen, and Zhiyong Lu. AgentMD : Empowering language agents for risk prediction with large-scale clinical tool learning. Nature Communications, 16 0 (1): 0 9377, October 2025. ISSN 2041...
-
[98]
MIMIC - IV - Note : Deidentified free-text clinical notes, 2023 a
Alistair Johnson, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC - IV - Note : Deidentified free-text clinical notes, 2023 a . URL https://physionet.org/content/mimic-iv-note/2.2/
work page 2023
-
[99]
Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. MIMIC - CXR , a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6 0 (1): 0 317, December 2019. ISSN 2052-4463. doi:10.1038/s41597-019-0322-0. URL htt...
-
[100]
Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. MIMIC - IV , a freely accessible electronic health record dataset. Scientific Data, 10 0 (1): 0 1, January 2023 b . ISSN 2052-4463. doi:10.1038/s41597-022-01...
-
[101]
Baraa Al Jorf and Farah E. Shamout. Medpatch: Confidence-guided multi-stage fusion for multimodal clinical data. In Monica Agrawal, Kaivalya Deshpande, Matthew Engelhard, Shalmali Joshi, Shengpu Tang, and Iñigo Urteaga, editors, Proceedings of the 10th Machine Learning for Healthcare Conference, volume 298 of Proceedings of Machine Learning Research. PMLR...
work page 2025
-
[102]
Voting or Consensus ? Decision - Making in Multi - Agent Debate
Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. Voting or Consensus ? Decision - Making in Multi - Agent Debate . In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics : ACL 2025 , pages 11640--11671, Vienna, Austria, July 2025. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.