RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records
Pith reviewed 2026-05-19 03:53 UTC · model grok-4.3
The pith
Agent tools let small quantized models extract rare diseases from noisy clinical notes without any task-specific training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RDMA shows that an agentic framework supplying abbreviation resolution, implicit phenotype reasoning, and ontology grounding tools allows a small quantized LLM to outperform fine-tuned and RAG baselines on rare-disease extraction from real-world clinical notes without task-specific training or large labeled data.
What carries the argument
The RDMA agentic framework, which equips the model with callable tools for abbreviation resolution, phenotype reasoning, and ontology grounding against Orphanet and HPO.
If this is right
- Rare-disease populations become visible in existing EHR data without new annotation campaigns.
- Expert review effort drops because uncertainty flags direct attention only to ambiguous cases.
- Deployment can move to local standard hardware, removing the need to send protected health information to external cloud services.
- The same agent pattern could scale to other sparsely coded conditions once the core tools are in place.
Where Pith is reading between the lines
- Hospitals could run the system nightly on existing note archives to generate candidate rare-disease lists for specialist review.
- The uncertainty-flagging step might serve as a general template for reducing labeling cost in other clinical extraction tasks.
- If the tool set generalizes, similar agent designs could address other under-coded medical domains such as social determinants or adverse-event detection.
Load-bearing premise
The provided tools are enough for a small quantized model to handle the noise and abbreviations in actual clinical notes reliably enough to beat trained baselines.
What would settle it
A held-out collection of real clinical notes containing many rare-disease mentions and heavy abbreviation use where the small quantized RDMA model fails to exceed the accuracy of a fine-tuned baseline.
read the original abstract
Rare diseases affect 1 in 10 Americans yet remain systematically underdocumented in clinical records. ICD-based systems cannot capture their breadth, over 50\% of Orphanet codes lack a direct ICD mapping and only 2.2\% of HPO codes have matching ICD codes, leaving patient populations invisible and delaying diagnosis. Mining unstructured clinical notes offers a direct path forward, but real notes are long, noisy, and abbreviation-dense, and limited annotations make fine-tuning infeasible, demanding approaches that generalize without task-specific training. We present Rare Disease Mining Agents (RDMA), an agentic framework equipping smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO. RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with different data characteristics, without any task-specific training. A small quantized model achieves maximal performance, reducing inference costs by up to 10x and local hardware costs by up to 17x, enabling private deployment on standard hardware without cloud-based PHI exposure. RDMA's uncertainty-flagging mechanism further reduces expert annotation burden while preserving agreement quality, supporting scalable rare disease documentation in clinical practice. Available at https://github.com/jhnwu3/RDMA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RDMA, an agentic framework that equips smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO to extract rare disease information from noisy, abbreviation-dense clinical notes in EHRs. The central claim is that RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with varying data characteristics, without any task-specific training; a small quantized model achieves peak performance, yielding up to 10x inference cost reduction and 17x local hardware cost reduction while enabling private on-premise deployment, and an uncertainty-flagging mechanism reduces expert annotation burden.
Significance. If the empirical results hold under scrutiny, the work could meaningfully advance scalable rare-disease documentation by demonstrating that tool-augmented small models can generalize to real clinical notes without fine-tuning or large annotated datasets, while addressing privacy and cost barriers. The emphasis on cost reductions and uncertainty flagging is clinically relevant. The significance is tempered by the current lack of detailed quantitative support and component ablations needed to confirm that the claimed gains are attributable to the proposed framework rather than untested assumptions about tool sufficiency.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experimental Results): The abstract asserts that RDMA 'substantially outperforms fine-tuned and RAG-based baselines across benchmarks' yet supplies no numeric metrics, exact baseline implementations, statistical significance tests, or error bars. This absence directly undermines evaluation of the headline performance and cost-reduction claims (10x inference, 17x hardware), which are load-bearing for the paper's contribution.
- [§4.2–4.3] §4.2–4.3 (Ablations and Robustness): No ablation results are presented that remove or isolate individual tools (abbreviation resolution, implicit phenotype reasoning, ontology grounding) or that evaluate performance on progressively noisier held-out clinical notes. Without these controls it is impossible to determine whether the reported outperformance on real-world abbreviation-dense notes is driven by the agentic tool suite or by the base quantized LLM, which is the central assumption underlying the claim of training-free generalization and the associated cost savings.
minor comments (2)
- [§3] §3 (Method): The description of how tool outputs are aggregated and passed back to the LLM could be clarified with a short pseudocode snippet or explicit state diagram to improve reproducibility.
- [Table 1] Table 1 or equivalent benchmark table: Ensure all baselines are described with the exact model sizes, quantization levels, and prompting strategies used so that the 'no task-specific training' comparison is fully transparent.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and have revised the paper to incorporate the suggested improvements where feasible.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The abstract asserts that RDMA 'substantially outperforms fine-tuned and RAG-based baselines across benchmarks' yet supplies no numeric metrics, exact baseline implementations, statistical significance tests, or error bars. This absence directly undermines evaluation of the headline performance and cost-reduction claims (10x inference, 17x hardware), which are load-bearing for the paper's contribution.
Authors: We agree that the abstract would benefit from explicit quantitative support to make the claims more immediately verifiable. In the revised manuscript, we will update the abstract to report key metrics such as F1-score gains over baselines, exact inference cost reductions (e.g., 10x), and hardware cost savings (e.g., 17x), while directing readers to the corresponding tables in §4. We will also expand §4 to fully specify baseline implementations (including model sizes, quantization levels, and RAG configurations), include statistical significance testing (e.g., McNemar’s test or paired t-tests with p-values), and add error bars or standard deviations across multiple runs for all primary results. revision: yes
-
Referee: [§4.2–4.3] §4.2–4.3 (Ablations and Robustness): No ablation results are presented that remove or isolate individual tools (abbreviation resolution, implicit phenotype reasoning, ontology grounding) or that evaluate performance on progressively noisier held-out clinical notes. Without these controls it is impossible to determine whether the reported outperformance on real-world abbreviation-dense notes is driven by the agentic tool suite or by the base quantized LLM, which is the central assumption underlying the claim of training-free generalization and the associated cost savings.
Authors: We concur that targeted ablations are important for isolating the contribution of the tool suite. Although the existing comparisons to fine-tuned and RAG baselines provide indirect evidence of the framework’s value, we will add a dedicated ablation study in the revised §4. This will include variants that disable one tool at a time (abbreviation resolution, implicit phenotype reasoning, and ontology grounding) while keeping the rest of the agent intact, reporting performance deltas on the same benchmarks. We will also introduce robustness experiments on progressively noisier held-out clinical note subsets (e.g., by synthetically increasing abbreviation density and noise levels) to directly test generalization under realistic EHR conditions. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper presents an agentic framework (RDMA) that augments smaller quantized LLMs with tools for abbreviation resolution, phenotype reasoning, and ontology grounding, then reports empirical outperformance versus fine-tuned and RAG baselines on benchmarks with varying data characteristics. No equations, derivations, or load-bearing self-citations appear in the abstract or description that would reduce any claimed result to a fitted parameter or self-referential definition by construction. All central claims are tested against independent external baselines rather than internal fits, satisfying the self-contained-against-benchmarks criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Smaller quantized LLMs equipped with abbreviation resolution, implicit phenotype reasoning, and ontology grounding tools can generalize to real clinical notes without task-specific fine-tuning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RDMA connects scattered clinical observations... tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RDMA substantially outperforms fine-tuned and RAG-based baselines... without any task-specific training.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Virginia Tech: One in 10 Americans Is Living with a Rare Disease. Virginia Tech News. Accessed: 2025-04-02 (2025). news.vt.edu/articles/2025/02/research fralinbiomed rarediseaseday2025 0228.html
work page 2025
-
[2]
Value in Health 21(5), 501–507 (2018)
Auvin, S., Irwin, J., Abi-Aad, P., Battersby, A.: The problem of rarity: estimation of prevalence in rare disease. Value in Health 21(5), 501–507 (2018)
work page 2018
-
[3]
European Journal of Public Health 30(Supplement 5), 166–494 (2020)
Cavero-Carbonell, C., Rico, J., Garibay, L., Garc´ ıa-L´ opez, M., Guardiola- Vilarroig, S., Maceda-Rold´ an, L., Zurriaga, O.: From icd10 to orphacodes: paving the way towards improved identification systems for rare diseases. European Journal of Public Health 30(Supplement 5), 166–494 (2020)
work page 2020
-
[4]
Tan, A.L., Gon¸ calves, R.S., Yuan, W., Brat, G.A., Gentleman, R., Kohane, I.S.: Implications of mappings between international classification of diseases clinical diagnosis codes and human phenotype ontology terms. JAMIA open 7(4), 118 (2024)
work page 2024
-
[5]
BMC Medical Informatics and Decision Making 23(1), 86 (2023)
Dong, H., Su´ arez-Paniagua, V., Zhang, H., Wang, M., Casey, A., Davidson, E., Chen, J., Alex, B., Whiteley, W., Wu, H.: Ontology-driven and weakly super- vised rare disease identification from clinical notes. BMC Medical Informatics and Decision Making 23(1), 86 (2023)
work page 2023
-
[6]
https://arxiv.org/abs/2308.06294
Yang, J., Liu, C., Deng, W., Wu, D., Weng, C., Zhou, Y., Wang, K.: Enhanc- ing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT (2023). https://arxiv.org/abs/2308.06294
-
[7]
BMC Medical Informatics and Decision Making 24(1), 289 (2024)
Wu, J., Dong, H., Li, Z., Wang, H., Li, R., Patra, A., Dai, C., Ali, W., Scordis, P., Wu, H.: A hybrid framework with large language models for rare disease phenotyping. BMC Medical Informatics and Decision Making 24(1), 289 (2024)
work page 2024
-
[8]
Chen, X., Mao, X., Guo, Q., Wang, L., Zhang, S., Chen, T.: RareBench: Can LLMs Serve as Rare Diseases Specialists? (2024). https://arxiv.org/abs/2402. 06341
work page 2024
-
[9]
NPJ Digital Medicine 7(1), 20 (2024)
Savage, T., Nayak, A., Gallo, R., Rangan, E., Chen, J.H.: Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine 7(1), 20 (2024)
work page 2024
-
[10]
Garcia, B.T., Westerfield, L., Yelemali, P., Gogate, N., Rivera-Munoz, E.A., Du, H., Dawood, M., Jolly, A., Lupski, J.R., Posey, J.E.: Improving automated deep phenotyping through large language models using retrieval augmented generation. medRxiv, 2024–12 (2024)
work page 2024
-
[11]
arXiv preprint arXiv:2405.12035 (2024) 32
Sanmartin, D.: Kg-rag: Bridging the gap between knowledge and creativity. arXiv preprint arXiv:2405.12035 (2024) 32
-
[12]
Retrieval-Augmented Generation for Large Language Models: A Survey
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-Augmented Generation for Large Language Models: A Survey (2024). https://arxiv.org/abs/2312.10997
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
: ¡? mode longauthoraffil?¿ the human phenotype ontology in 2024: phenotypes around the world
Gargano, M.A., Matentzoglu, N., Coleman, B., Addo-Lartey, E.B., Anagnos- topoulos, A.V., Anderton, J., Avillach, P., Bagley, A.M., Bakˇ stein, E., Balhoff, J.P., et al. : ¡? mode longauthoraffil?¿ the human phenotype ontology in 2024: phenotypes around the world. Nucleic acids research 52(D1), 1333–1346 (2024)
work page 2024
-
[14]
Nederlands tijdschrift voor geneeskunde 152(9), 518–519 (2008)
Weinreich, S.S., Mangon, R., Sikkens, J., Teeuw, M.E., Cornel, M.: Orphanet: a european database for rare diseases. Nederlands tijdschrift voor geneeskunde 152(9), 518–519 (2008)
work page 2008
-
[15]
Scientific data 3(1), 1–9 (2016)
Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.-w.H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., Mark, R.G.: Mimic-iii, a freely accessible critical care database. Scientific data 3(1), 1–9 (2016)
work page 2016
-
[16]
Edin, J., Junge, A., Havtorn, J.D., Borgholt, L., Maistro, M., Ruotsalo, T., Maaløe, L.: Automated medical coding on mimic-iii and mimic-iv: a critical review and replicability study. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2572–2582 (2023)
work page 2023
-
[17]
MIMIC-IV-Note: Deidentified free-text clinical notes.PhysioNet, 2023b
Johnson, A., et al.: MIMIC-IV-Note: Deidentified free-text clinical notes. Phy- sioNet (2023). https://doi.org/10.13026/1n74-ne17 . https://doi.org/10.13026/ 1n74-ne17
-
[18]
Windows% 20Azure% 20HIPAA% 20Imple- mentation% 20Guidance
Ayad, M., Rodriguez, H., Squire, J.: Addressing hipaa security and privacy requirements in the microsoft cloud. Windows% 20Azure% 20HIPAA% 20Imple- mentation% 20Guidance. pdf (2011)
work page 2011
-
[19]
Keshetti, S., et al. : Designing scalable and hipaa-compliant notification systems for healthcare: Leveraging cloud, microservices, and secure architectures. In: International Journal for Research Publication and Seminar, vol. 16, pp. 154–173 (2025)
work page 2025
-
[20]
Chest 148(5), 1148–1155 (2015)
Grady, C.: Institutional review boards: Purpose and challenges. Chest 148(5), 1148–1155 (2015)
work page 2015
-
[21]
Sun, Q., Wu, H., Zhang, X.S.: On Active Privacy Auditing in Supervised Fine-tuning for White-Box Language Models (2024). https://arxiv.org/abs/2411. 07070
work page 2024
-
[22]
Bioinformatics 40(7), 406 (2024) 33
Groza, T., Gration, D., Baynam, G., Robinson, P.N.: Fasthpocr: pragmatic, fast, and accurate concept recognition using the human phenotype ontology. Bioinformatics 40(7), 406 (2024) 33
work page 2024
-
[23]
Journal of the American Medical Informatics Association 25(5), 530–537 (2018)
Wu, H., Toti, G., Morley, K.I., Ibrahim, Z.M., Folarin, A., Jackson, R., Kartoglu, I., Agrawal, A., Stringer, C., Gale, D., et al.: Semehr: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. Journal of the American Medical Informatics Association 25(5), 530–537 (2018)
work page 2018
-
[24]
: A survey on large language model based autonomous agents
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al. : A survey on large language model based autonomous agents. Frontiers of Computer Science 18(6), 186345 (2024)
work page 2024
-
[25]
Edin, J., Junge, A., Havtorn, J.D., Borgholt, L., Maistro, M., Ruotsalo, T., Maaløe, L.: Automated medical coding on mimic-iii and mimic-iv: A critical review and replicability study. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’23, pp. 2572–2582. ACM, ??? (2023). https://doi.o...
- [26]
-
[27]
https://www.hyperstack.cloud/gpu-pricing
Hyperstack: GPU Pricing (2025). https://www.hyperstack.cloud/gpu-pricing
work page 2025
-
[28]
https:// github.com/abhinand5/MedEmbed
Balachandran, A.: MedEmbed: Medical-Focused Embedding Models. https:// github.com/abhinand5/MedEmbed
-
[29]
Natural Language Engineering, 1–28 (2023)
Rohanian, O., Nouriborji, M., Jauncey, H., Kouchaki, S., Nooralahzadeh, F., Clifton, L., Merson, L., Clifton, D.A., Group, I.C.C., et al.: Lightweight trans- formers for clinical natural language processing. Natural Language Engineering, 1–28 (2023)
work page 2023
-
[30]
Ankit Pal, M.S.: OpenBioLLMs: Advancing Open-Source Large Language Models for Healthcare and Life Sciences. Hugging Face (2024)
work page 2024
-
[31]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Cauchete...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
https://mistral.ai/news/mistral-small-3-1 Accessed 2025-04-27
AI, M.: Mistral Small 3.1. https://mistral.ai/news/mistral-small-3-1 Accessed 2025-04-27
work page 2025
-
[33]
https://www.newegg.com/p/ 3D5-000V-001R8 Accessed 2025-04-26
Newegg: Product 3D5-000V-001R8. https://www.newegg.com/p/ 3D5-000V-001R8 Accessed 2025-04-26
work page 2025
-
[34]
Newegg: Velztorm Gaming Desktop with NVIDIA RTX A6000, Intel Core i9-13900K. https://www.newegg.com/ velztorm-gaming-desktop-nvidia-rtx-a6000-intel-core-i9-13900k-32gb-ddr5-1tb-ssd-ace-i-black/ p/3D5-000W-134U1 Accessed 2025-04-26
work page 2025
-
[35]
https://www.thinkmate.com/system/ gpx-xn4-21s3-4gpu Accessed 2025-04-26
Thinkmate: GPX XN4-21S3-4GPU. https://www.thinkmate.com/system/ gpx-xn4-21s3-4gpu Accessed 2025-04-26
work page 2025
-
[36]
BioMed Research International 2017(1), 8565739 (2017)
Lobo, M., Lamurias, A., Couto, F.M.: Identifying human phenotype terms by combining machine learning and validation rules. BioMed Research International 2017(1), 8565739 (2017)
work page 2017
-
[37]
arXiv preprint 36 arXiv:2003.07082 (2020)
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A python natural language processing toolkit for many human languages. arXiv preprint 36 arXiv:2003.07082 (2020)
-
[38]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling Laws for Neural Language Models (2020). https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[39]
Genome medicine 7, 1–14 (2015)
Wei, W.-Q., Denny, J.C.: Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome medicine 7, 1–14 (2015)
work page 2015
-
[40]
https://arxiv.org/abs/2412.12475
Chen, X., Jin, Y., Mao, X., Wang, L., Zhang, S., Chen, T.: RareAgents: Advancing Rare Disease Care through LLM-Empowered Multi-disciplinary Team (2025). https://arxiv.org/abs/2412.12475
-
[41]
Orphanet Journal of Rare Diseases 20, 186 (2025)
Germain, D.P., Gruson, D., Malcles, M., Garcelon, N.: Applying artificial intelli- gence to rare diseases: a literature review highlighting lessons from fabry disease. Orphanet Journal of Rare Diseases 20, 186 (2025)
work page 2025
-
[42]
https://arxiv.org/abs/2108.01204
Mart´ ınez-deMiguel, C., Segura-Bedmar, I., Chac´ on-Solano, E., Guerrero-Aspizua, S.: The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms (2021). https://arxiv.org/abs/2108.01204
-
[43]
: Mimic-iv, a freely accessible electronic health record dataset
Johnson, A.E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T.J., Hao, S., Moody, B., Gow, B., et al. : Mimic-iv, a freely accessible electronic health record dataset. Scientific data 10(1), 1 (2023)
work page 2023
-
[44]
Soroush, A., Glicksberg, B.S., Zimlichman, E., Barash, Y., Freeman, R., Char- ney, A.W., Nadkarni, G.N., Klang, E.: Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI 1(5), 2300040 (2024)
work page 2024
-
[45]
npj Digital Medicine 7(1), 16 (2024)
Wang, H., Gao, C., Dantona, C., Hull, B., Sun, J.: Drg-llama: tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digital Medicine 7(1), 16 (2024)
work page 2024
-
[46]
Mazzucato, M., Pozza, L.V.D., Facchin, P., Angin, C., Agius, F., Cavero- Carbonell, C., Corrochano, V., Hanusova, K., Kirch, K., Lambert, D., et al. : Orphacodes use for the coding of rare diseases: comparison of the accuracy and cross country comparability. Orphanet Journal of Rare Diseases18(1), 267 (2023)
work page 2023
-
[47]
Journal of clinical epidemiology 65(9), 1026–1027 (2012)
Kodra, Y., Fantini, B., Taruscio, D.: Classification and codification of rare diseases. Journal of clinical epidemiology 65(9), 1026–1027 (2012)
work page 2012
-
[48]
In: Rogers, A., Boyd- Graber, J., Okazaki, N
Cheng, H., Jafari, R., Russell, A., Klopfer, R., Lu, E., Striner, B., Gormley, M.: MDACE: MIMIC documents annotated with code evidence. In: Rogers, A., Boyd- Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pp. 7534–7550. Association for Computational Linguist...
-
[49]
arXiv preprint arXiv:2504.13861 (2025)
Sviridov, I., Miftakhova, A., Tereshchenko, A., Zubkova, G., Blinov, P., Savchenko, A.: 3mdbench: Medical multimodal multi-agent dialogue benchmark. arXiv preprint arXiv:2504.13861 (2025)
-
[50]
Schmidgall, S., Ziaei, R., Harris, C., Reis, E., Jopling, J., Moor, M.: Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960 (2024)
-
[51]
Wu, Z., Dadu, A., Nalls, M., Faghri, F., Sun, J.: Instruction tuning large language models to understand electronic health records. In: The Thirty-eight Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2024)
work page 2024
-
[52]
Advances in Neural Information Processing Systems 37, 140334–140365 (2024)
Xia, P., Chen, Z., Tian, J., Gong, Y., Hou, R., Xu, Y., Wu, Z., Fan, Z., Zhou, Y., Zhu, K., et al.: Cares: A comprehensive benchmark of trustworthiness in medical vision language models. Advances in Neural Information Processing Systems 37, 140334–140365 (2024)
work page 2024
-
[53]
Encyclopedia of library and information science, 369–378 (2002)
Nelson, S.J., Powell, T., Humphreys, B.: The unified medical language system (umls) project. Encyclopedia of library and information science, 369–378 (2002)
work page 2002
-
[54]
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar´ e, P.-E., Lomeli, M., Hosseini, L., J´ egou, H.: The Faiss library (2025). https://arxiv.org/abs/2401. 08281 38
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.