pith. sign in

arxiv: 2507.15867 · v2 · pith:KZ5K6C5Hnew · submitted 2025-07-14 · 💻 cs.LG · cs.AI· cs.CL· cs.MA

RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records

Pith reviewed 2026-05-19 03:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.MA
keywords rare disease miningelectronic health recordsclinical notesagentic frameworkquantized language modelsontology groundingphenotype reasoning
0
0 comments X

The pith

Agent tools let small quantized models extract rare diseases from noisy clinical notes without any task-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RDMA as an agentic system that gives smaller language models access to tools for resolving abbreviations, reasoning about implicit phenotypes, and grounding findings to Orphanet and HPO ontologies. This setup targets the gap where more than half of rare-disease codes lack direct ICD mappings, so patient cases stay invisible in structured records. By operating directly on long, abbreviation-heavy notes, the approach avoids the need for fine-tuning or large annotated datasets that are usually required for clinical NLP tasks. The result is higher accuracy than both fine-tuned models and retrieval-augmented baselines across varied benchmarks, achieved with a quantized model that also lowers inference and hardware costs.

Core claim

RDMA shows that an agentic framework supplying abbreviation resolution, implicit phenotype reasoning, and ontology grounding tools allows a small quantized LLM to outperform fine-tuned and RAG baselines on rare-disease extraction from real-world clinical notes without task-specific training or large labeled data.

What carries the argument

The RDMA agentic framework, which equips the model with callable tools for abbreviation resolution, phenotype reasoning, and ontology grounding against Orphanet and HPO.

If this is right

  • Rare-disease populations become visible in existing EHR data without new annotation campaigns.
  • Expert review effort drops because uncertainty flags direct attention only to ambiguous cases.
  • Deployment can move to local standard hardware, removing the need to send protected health information to external cloud services.
  • The same agent pattern could scale to other sparsely coded conditions once the core tools are in place.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hospitals could run the system nightly on existing note archives to generate candidate rare-disease lists for specialist review.
  • The uncertainty-flagging step might serve as a general template for reducing labeling cost in other clinical extraction tasks.
  • If the tool set generalizes, similar agent designs could address other under-coded medical domains such as social determinants or adverse-event detection.

Load-bearing premise

The provided tools are enough for a small quantized model to handle the noise and abbreviations in actual clinical notes reliably enough to beat trained baselines.

What would settle it

A held-out collection of real clinical notes containing many rare-disease mentions and heavy abbreviation use where the small quantized RDMA model fails to exceed the accuracy of a fine-tuned baseline.

read the original abstract

Rare diseases affect 1 in 10 Americans yet remain systematically underdocumented in clinical records. ICD-based systems cannot capture their breadth, over 50\% of Orphanet codes lack a direct ICD mapping and only 2.2\% of HPO codes have matching ICD codes, leaving patient populations invisible and delaying diagnosis. Mining unstructured clinical notes offers a direct path forward, but real notes are long, noisy, and abbreviation-dense, and limited annotations make fine-tuning infeasible, demanding approaches that generalize without task-specific training. We present Rare Disease Mining Agents (RDMA), an agentic framework equipping smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO. RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with different data characteristics, without any task-specific training. A small quantized model achieves maximal performance, reducing inference costs by up to 10x and local hardware costs by up to 17x, enabling private deployment on standard hardware without cloud-based PHI exposure. RDMA's uncertainty-flagging mechanism further reduces expert annotation burden while preserving agreement quality, supporting scalable rare disease documentation in clinical practice. Available at https://github.com/jhnwu3/RDMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RDMA, an agentic framework that equips smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO to extract rare disease information from noisy, abbreviation-dense clinical notes in EHRs. The central claim is that RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with varying data characteristics, without any task-specific training; a small quantized model achieves peak performance, yielding up to 10x inference cost reduction and 17x local hardware cost reduction while enabling private on-premise deployment, and an uncertainty-flagging mechanism reduces expert annotation burden.

Significance. If the empirical results hold under scrutiny, the work could meaningfully advance scalable rare-disease documentation by demonstrating that tool-augmented small models can generalize to real clinical notes without fine-tuning or large annotated datasets, while addressing privacy and cost barriers. The emphasis on cost reductions and uncertainty flagging is clinically relevant. The significance is tempered by the current lack of detailed quantitative support and component ablations needed to confirm that the claimed gains are attributable to the proposed framework rather than untested assumptions about tool sufficiency.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): The abstract asserts that RDMA 'substantially outperforms fine-tuned and RAG-based baselines across benchmarks' yet supplies no numeric metrics, exact baseline implementations, statistical significance tests, or error bars. This absence directly undermines evaluation of the headline performance and cost-reduction claims (10x inference, 17x hardware), which are load-bearing for the paper's contribution.
  2. [§4.2–4.3] §4.2–4.3 (Ablations and Robustness): No ablation results are presented that remove or isolate individual tools (abbreviation resolution, implicit phenotype reasoning, ontology grounding) or that evaluate performance on progressively noisier held-out clinical notes. Without these controls it is impossible to determine whether the reported outperformance on real-world abbreviation-dense notes is driven by the agentic tool suite or by the base quantized LLM, which is the central assumption underlying the claim of training-free generalization and the associated cost savings.
minor comments (2)
  1. [§3] §3 (Method): The description of how tool outputs are aggregated and passed back to the LLM could be clarified with a short pseudocode snippet or explicit state diagram to improve reproducibility.
  2. [Table 1] Table 1 or equivalent benchmark table: Ensure all baselines are described with the exact model sizes, quantization levels, and prompting strategies used so that the 'no task-specific training' comparison is fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and have revised the paper to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The abstract asserts that RDMA 'substantially outperforms fine-tuned and RAG-based baselines across benchmarks' yet supplies no numeric metrics, exact baseline implementations, statistical significance tests, or error bars. This absence directly undermines evaluation of the headline performance and cost-reduction claims (10x inference, 17x hardware), which are load-bearing for the paper's contribution.

    Authors: We agree that the abstract would benefit from explicit quantitative support to make the claims more immediately verifiable. In the revised manuscript, we will update the abstract to report key metrics such as F1-score gains over baselines, exact inference cost reductions (e.g., 10x), and hardware cost savings (e.g., 17x), while directing readers to the corresponding tables in §4. We will also expand §4 to fully specify baseline implementations (including model sizes, quantization levels, and RAG configurations), include statistical significance testing (e.g., McNemar’s test or paired t-tests with p-values), and add error bars or standard deviations across multiple runs for all primary results. revision: yes

  2. Referee: [§4.2–4.3] §4.2–4.3 (Ablations and Robustness): No ablation results are presented that remove or isolate individual tools (abbreviation resolution, implicit phenotype reasoning, ontology grounding) or that evaluate performance on progressively noisier held-out clinical notes. Without these controls it is impossible to determine whether the reported outperformance on real-world abbreviation-dense notes is driven by the agentic tool suite or by the base quantized LLM, which is the central assumption underlying the claim of training-free generalization and the associated cost savings.

    Authors: We concur that targeted ablations are important for isolating the contribution of the tool suite. Although the existing comparisons to fine-tuned and RAG baselines provide indirect evidence of the framework’s value, we will add a dedicated ablation study in the revised §4. This will include variants that disable one tool at a time (abbreviation resolution, implicit phenotype reasoning, and ontology grounding) while keeping the rest of the agent intact, reporting performance deltas on the same benchmarks. We will also introduce robustness experiments on progressively noisier held-out clinical note subsets (e.g., by synthetically increasing abbreviation density and noise levels) to directly test generalization under realistic EHR conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper presents an agentic framework (RDMA) that augments smaller quantized LLMs with tools for abbreviation resolution, phenotype reasoning, and ontology grounding, then reports empirical outperformance versus fine-tuned and RAG baselines on benchmarks with varying data characteristics. No equations, derivations, or load-bearing self-citations appear in the abstract or description that would reduce any claimed result to a fitted parameter or self-referential definition by construction. All central claims are tested against independent external baselines rather than internal fits, satisfying the self-contained-against-benchmarks criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that tool-augmented smaller LLMs can perform implicit phenotype reasoning and ontology grounding on noisy clinical text without task-specific training data.

axioms (1)
  • domain assumption Smaller quantized LLMs equipped with abbreviation resolution, implicit phenotype reasoning, and ontology grounding tools can generalize to real clinical notes without task-specific fine-tuning.
    This premise is required for the claim that no annotations are needed and that performance exceeds fine-tuned baselines.

pith-pipeline@v0.9.0 · 5756 in / 1269 out tokens · 40663 ms · 2026-05-19T03:53:12.168797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

  1. [1]

    Virginia Tech News

    Virginia Tech: One in 10 Americans Is Living with a Rare Disease. Virginia Tech News. Accessed: 2025-04-02 (2025). news.vt.edu/articles/2025/02/research fralinbiomed rarediseaseday2025 0228.html

  2. [2]

    Value in Health 21(5), 501–507 (2018)

    Auvin, S., Irwin, J., Abi-Aad, P., Battersby, A.: The problem of rarity: estimation of prevalence in rare disease. Value in Health 21(5), 501–507 (2018)

  3. [3]

    European Journal of Public Health 30(Supplement 5), 166–494 (2020)

    Cavero-Carbonell, C., Rico, J., Garibay, L., Garc´ ıa-L´ opez, M., Guardiola- Vilarroig, S., Maceda-Rold´ an, L., Zurriaga, O.: From icd10 to orphacodes: paving the way towards improved identification systems for rare diseases. European Journal of Public Health 30(Supplement 5), 166–494 (2020)

  4. [4]

    JAMIA open 7(4), 118 (2024)

    Tan, A.L., Gon¸ calves, R.S., Yuan, W., Brat, G.A., Gentleman, R., Kohane, I.S.: Implications of mappings between international classification of diseases clinical diagnosis codes and human phenotype ontology terms. JAMIA open 7(4), 118 (2024)

  5. [5]

    BMC Medical Informatics and Decision Making 23(1), 86 (2023)

    Dong, H., Su´ arez-Paniagua, V., Zhang, H., Wang, M., Casey, A., Davidson, E., Chen, J., Alex, B., Whiteley, W., Wu, H.: Ontology-driven and weakly super- vised rare disease identification from clinical notes. BMC Medical Informatics and Decision Making 23(1), 86 (2023)

  6. [6]

    https://arxiv.org/abs/2308.06294

    Yang, J., Liu, C., Deng, W., Wu, D., Weng, C., Zhou, Y., Wang, K.: Enhanc- ing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT (2023). https://arxiv.org/abs/2308.06294

  7. [7]

    BMC Medical Informatics and Decision Making 24(1), 289 (2024)

    Wu, J., Dong, H., Li, Z., Wang, H., Li, R., Patra, A., Dai, C., Ali, W., Scordis, P., Wu, H.: A hybrid framework with large language models for rare disease phenotyping. BMC Medical Informatics and Decision Making 24(1), 289 (2024)

  8. [8]

    https://arxiv.org/abs/2402

    Chen, X., Mao, X., Guo, Q., Wang, L., Zhang, S., Chen, T.: RareBench: Can LLMs Serve as Rare Diseases Specialists? (2024). https://arxiv.org/abs/2402. 06341

  9. [9]

    NPJ Digital Medicine 7(1), 20 (2024)

    Savage, T., Nayak, A., Gallo, R., Rangan, E., Chen, J.H.: Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine 7(1), 20 (2024)

  10. [10]

    medRxiv, 2024–12 (2024)

    Garcia, B.T., Westerfield, L., Yelemali, P., Gogate, N., Rivera-Munoz, E.A., Du, H., Dawood, M., Jolly, A., Lupski, J.R., Posey, J.E.: Improving automated deep phenotyping through large language models using retrieval augmented generation. medRxiv, 2024–12 (2024)

  11. [11]

    arXiv preprint arXiv:2405.12035 (2024) 32

    Sanmartin, D.: Kg-rag: Bridging the gap between knowledge and creativity. arXiv preprint arXiv:2405.12035 (2024) 32

  12. [12]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-Augmented Generation for Large Language Models: A Survey (2024). https://arxiv.org/abs/2312.10997

  13. [13]

    : ¡? mode longauthoraffil?¿ the human phenotype ontology in 2024: phenotypes around the world

    Gargano, M.A., Matentzoglu, N., Coleman, B., Addo-Lartey, E.B., Anagnos- topoulos, A.V., Anderton, J., Avillach, P., Bagley, A.M., Bakˇ stein, E., Balhoff, J.P., et al. : ¡? mode longauthoraffil?¿ the human phenotype ontology in 2024: phenotypes around the world. Nucleic acids research 52(D1), 1333–1346 (2024)

  14. [14]

    Nederlands tijdschrift voor geneeskunde 152(9), 518–519 (2008)

    Weinreich, S.S., Mangon, R., Sikkens, J., Teeuw, M.E., Cornel, M.: Orphanet: a european database for rare diseases. Nederlands tijdschrift voor geneeskunde 152(9), 518–519 (2008)

  15. [15]

    Scientific data 3(1), 1–9 (2016)

    Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.-w.H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., Mark, R.G.: Mimic-iii, a freely accessible critical care database. Scientific data 3(1), 1–9 (2016)

  16. [16]

    In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp

    Edin, J., Junge, A., Havtorn, J.D., Borgholt, L., Maistro, M., Ruotsalo, T., Maaløe, L.: Automated medical coding on mimic-iii and mimic-iv: a critical review and replicability study. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2572–2582 (2023)

  17. [17]

    MIMIC-IV-Note: Deidentified free-text clinical notes.PhysioNet, 2023b

    Johnson, A., et al.: MIMIC-IV-Note: Deidentified free-text clinical notes. Phy- sioNet (2023). https://doi.org/10.13026/1n74-ne17 . https://doi.org/10.13026/ 1n74-ne17

  18. [18]

    Windows% 20Azure% 20HIPAA% 20Imple- mentation% 20Guidance

    Ayad, M., Rodriguez, H., Squire, J.: Addressing hipaa security and privacy requirements in the microsoft cloud. Windows% 20Azure% 20HIPAA% 20Imple- mentation% 20Guidance. pdf (2011)

  19. [19]

    : Designing scalable and hipaa-compliant notification systems for healthcare: Leveraging cloud, microservices, and secure architectures

    Keshetti, S., et al. : Designing scalable and hipaa-compliant notification systems for healthcare: Leveraging cloud, microservices, and secure architectures. In: International Journal for Research Publication and Seminar, vol. 16, pp. 154–173 (2025)

  20. [20]

    Chest 148(5), 1148–1155 (2015)

    Grady, C.: Institutional review boards: Purpose and challenges. Chest 148(5), 1148–1155 (2015)

  21. [21]

    https://arxiv.org/abs/2411

    Sun, Q., Wu, H., Zhang, X.S.: On Active Privacy Auditing in Supervised Fine-tuning for White-Box Language Models (2024). https://arxiv.org/abs/2411. 07070

  22. [22]

    Bioinformatics 40(7), 406 (2024) 33

    Groza, T., Gration, D., Baynam, G., Robinson, P.N.: Fasthpocr: pragmatic, fast, and accurate concept recognition using the human phenotype ontology. Bioinformatics 40(7), 406 (2024) 33

  23. [23]

    Journal of the American Medical Informatics Association 25(5), 530–537 (2018)

    Wu, H., Toti, G., Morley, K.I., Ibrahim, Z.M., Folarin, A., Jackson, R., Kartoglu, I., Agrawal, A., Stringer, C., Gale, D., et al.: Semehr: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. Journal of the American Medical Informatics Association 25(5), 530–537 (2018)

  24. [24]

    : A survey on large language model based autonomous agents

    Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al. : A survey on large language model based autonomous agents. Frontiers of Computer Science 18(6), 186345 (2024)

  25. [25]

    In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Edin, J., Junge, A., Havtorn, J.D., Borgholt, L., Maistro, M., Ruotsalo, T., Maaløe, L.: Automated medical coding on mimic-iii and mimic-iv: A critical review and replicability study. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’23, pp. 2572–2582. ACM, ??? (2023). https://doi.o...

  26. [26]

    https://salad.com/pricing

    Pricing, S.: Pricing (2025). https://salad.com/pricing

  27. [27]

    https://www.hyperstack.cloud/gpu-pricing

    Hyperstack: GPU Pricing (2025). https://www.hyperstack.cloud/gpu-pricing

  28. [28]

    https:// github.com/abhinand5/MedEmbed

    Balachandran, A.: MedEmbed: Medical-Focused Embedding Models. https:// github.com/abhinand5/MedEmbed

  29. [29]

    Natural Language Engineering, 1–28 (2023)

    Rohanian, O., Nouriborji, M., Jauncey, H., Kouchaki, S., Nooralahzadeh, F., Clifton, L., Merson, L., Clifton, D.A., Group, I.C.C., et al.: Lightweight trans- formers for clinical natural language processing. Natural Language Engineering, 1–28 (2023)

  30. [30]

    Hugging Face (2024)

    Ankit Pal, M.S.: OpenBioLLMs: Advancing Open-Source Large Language Models for Healthcare and Life Sciences. Hugging Face (2024)

  31. [31]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Cauchete...

  32. [32]

    https://mistral.ai/news/mistral-small-3-1 Accessed 2025-04-27

    AI, M.: Mistral Small 3.1. https://mistral.ai/news/mistral-small-3-1 Accessed 2025-04-27

  33. [33]

    https://www.newegg.com/p/ 3D5-000V-001R8 Accessed 2025-04-26

    Newegg: Product 3D5-000V-001R8. https://www.newegg.com/p/ 3D5-000V-001R8 Accessed 2025-04-26

  34. [34]

    https://www.newegg.com/ velztorm-gaming-desktop-nvidia-rtx-a6000-intel-core-i9-13900k-32gb-ddr5-1tb-ssd-ace-i-black/ p/3D5-000W-134U1 Accessed 2025-04-26

    Newegg: Velztorm Gaming Desktop with NVIDIA RTX A6000, Intel Core i9-13900K. https://www.newegg.com/ velztorm-gaming-desktop-nvidia-rtx-a6000-intel-core-i9-13900k-32gb-ddr5-1tb-ssd-ace-i-black/ p/3D5-000W-134U1 Accessed 2025-04-26

  35. [35]

    https://www.thinkmate.com/system/ gpx-xn4-21s3-4gpu Accessed 2025-04-26

    Thinkmate: GPX XN4-21S3-4GPU. https://www.thinkmate.com/system/ gpx-xn4-21s3-4gpu Accessed 2025-04-26

  36. [36]

    BioMed Research International 2017(1), 8565739 (2017)

    Lobo, M., Lamurias, A., Couto, F.M.: Identifying human phenotype terms by combining machine learning and validation rules. BioMed Research International 2017(1), 8565739 (2017)

  37. [37]

    arXiv preprint 36 arXiv:2003.07082 (2020)

    Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A python natural language processing toolkit for many human languages. arXiv preprint 36 arXiv:2003.07082 (2020)

  38. [38]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling Laws for Neural Language Models (2020). https://arxiv.org/abs/2001.08361

  39. [39]

    Genome medicine 7, 1–14 (2015)

    Wei, W.-Q., Denny, J.C.: Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome medicine 7, 1–14 (2015)

  40. [40]

    https://arxiv.org/abs/2412.12475

    Chen, X., Jin, Y., Mao, X., Wang, L., Zhang, S., Chen, T.: RareAgents: Advancing Rare Disease Care through LLM-Empowered Multi-disciplinary Team (2025). https://arxiv.org/abs/2412.12475

  41. [41]

    Orphanet Journal of Rare Diseases 20, 186 (2025)

    Germain, D.P., Gruson, D., Malcles, M., Garcelon, N.: Applying artificial intelli- gence to rare diseases: a literature review highlighting lessons from fabry disease. Orphanet Journal of Rare Diseases 20, 186 (2025)

  42. [42]

    https://arxiv.org/abs/2108.01204

    Mart´ ınez-deMiguel, C., Segura-Bedmar, I., Chac´ on-Solano, E., Guerrero-Aspizua, S.: The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms (2021). https://arxiv.org/abs/2108.01204

  43. [43]

    : Mimic-iv, a freely accessible electronic health record dataset

    Johnson, A.E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T.J., Hao, S., Moody, B., Gow, B., et al. : Mimic-iv, a freely accessible electronic health record dataset. Scientific data 10(1), 1 (2023)

  44. [44]

    NEJM AI 1(5), 2300040 (2024)

    Soroush, A., Glicksberg, B.S., Zimlichman, E., Barash, Y., Freeman, R., Char- ney, A.W., Nadkarni, G.N., Klang, E.: Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI 1(5), 2300040 (2024)

  45. [45]

    npj Digital Medicine 7(1), 16 (2024)

    Wang, H., Gao, C., Dantona, C., Hull, B., Sun, J.: Drg-llama: tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digital Medicine 7(1), 16 (2024)

  46. [46]

    : Orphacodes use for the coding of rare diseases: comparison of the accuracy and cross country comparability

    Mazzucato, M., Pozza, L.V.D., Facchin, P., Angin, C., Agius, F., Cavero- Carbonell, C., Corrochano, V., Hanusova, K., Kirch, K., Lambert, D., et al. : Orphacodes use for the coding of rare diseases: comparison of the accuracy and cross country comparability. Orphanet Journal of Rare Diseases18(1), 267 (2023)

  47. [47]

    Journal of clinical epidemiology 65(9), 1026–1027 (2012)

    Kodra, Y., Fantini, B., Taruscio, D.: Classification and codification of rare diseases. Journal of clinical epidemiology 65(9), 1026–1027 (2012)

  48. [48]

    In: Rogers, A., Boyd- Graber, J., Okazaki, N

    Cheng, H., Jafari, R., Russell, A., Klopfer, R., Lu, E., Striner, B., Gormley, M.: MDACE: MIMIC documents annotated with code evidence. In: Rogers, A., Boyd- Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pp. 7534–7550. Association for Computational Linguist...

  49. [49]

    arXiv preprint arXiv:2504.13861 (2025)

    Sviridov, I., Miftakhova, A., Tereshchenko, A., Zubkova, G., Blinov, P., Savchenko, A.: 3mdbench: Medical multimodal multi-agent dialogue benchmark. arXiv preprint arXiv:2504.13861 (2025)

  50. [50]

    Schmidgall, R

    Schmidgall, S., Ziaei, R., Harris, C., Reis, E., Jopling, J., Moor, M.: Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960 (2024)

  51. [51]

    In: The Thirty-eight Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2024)

    Wu, Z., Dadu, A., Nalls, M., Faghri, F., Sun, J.: Instruction tuning large language models to understand electronic health records. In: The Thirty-eight Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track (2024)

  52. [52]

    Advances in Neural Information Processing Systems 37, 140334–140365 (2024)

    Xia, P., Chen, Z., Tian, J., Gong, Y., Hou, R., Xu, Y., Wu, Z., Fan, Z., Zhou, Y., Zhu, K., et al.: Cares: A comprehensive benchmark of trustworthiness in medical vision language models. Advances in Neural Information Processing Systems 37, 140334–140365 (2024)

  53. [53]

    Encyclopedia of library and information science, 369–378 (2002)

    Nelson, S.J., Powell, T., Humphreys, B.: The unified medical language system (umls) project. Encyclopedia of library and information science, 369–378 (2002)

  54. [54]

    https://arxiv.org/abs/2401

    Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar´ e, P.-E., Lomeli, M., Hosseini, L., J´ egou, H.: The Faiss library (2025). https://arxiv.org/abs/2401. 08281 38