pith. machine review for the scientific record. sign in

arxiv: 2605.06226 · v2 · submitted 2026-05-07 · 💻 cs.AI · q-bio.GN

Recognition: 2 theorem links

· Lean Theorem

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:32 UTC · model grok-4.3

classification 💻 cs.AI q-bio.GN
keywords rare disease diagnosisAI agentmulti-modal data integrationgene prioritizationclinical decision supporthallucination mitigationprecision medicine
0
0 comments X

The pith

Hygieia, a multi-modal AI agent, diagnoses rare diseases more accurately than physicians by routing across genetic, phenotypic, and clinical data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hygieia as an AI agent system built to address slow and inaccurate diagnosis of rare diseases by combining phenotypic features, genetic profiles, and clinical records in one workflow. It uses a router-based and knowledge-enhanced framework to select appropriate strategies for different disease types while limiting hallucinations in generated outputs. The authors report that this produces state-of-the-art results on diagnostic benchmarks and yields 12 to 60 percent higher accuracy than physicians in tests with clinical experts, along with risk-gene prioritization and confidence scores. A sympathetic reader would care because rare diseases often involve long delays before correct identification, and a tool that improves both speed and precision could shorten that process and lower the workload on clinicians.

Core claim

Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks for rare diseases and shows superior diagnostic accuracy compared to physicians, with gains ranging from 12 to 60 percent, while also prioritizing risk-related genomic factors and supplying confidence scores to support clinical decisions.

What carries the argument

A router-based and knowledge-enhanced framework that selects tailored diagnostic strategies for different disease categories and mitigates hallucination while integrating multi-modal inputs.

If this is right

  • Clinicians receive prioritized risk genes and confidence scores that can narrow testing options and guide treatment choices for rare diseases.
  • The system reduces the time clinicians spend reviewing and interpreting complex medical records in real-world cases.
  • Diagnostic accuracy improves across benchmarks, which would shorten the average time to correct identification of rare conditions.
  • The framework supplies interpretable outputs that can be reviewed alongside physician judgment rather than replacing it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the routing logic generalizes, the same architecture could be tested on other diagnostic domains that mix genetic and phenotypic data.
  • Deployment would require ongoing monitoring of outputs against new patient cohorts to confirm that performance does not degrade with shifts in data distribution.
  • The reported reduction in clinician workload suggests the agent could serve as a first-pass filter in settings with limited access to rare-disease specialists.

Load-bearing premise

The router-based and knowledge-enhanced framework reliably reduces hallucination and delivers unbiased gene prioritization when run on real-world heterogeneous clinical data without adding new errors from its training sources or data selection.

What would settle it

A prospective trial on a fresh cohort of undiagnosed rare-disease cases drawn from multiple hospitals where Hygieia's final diagnostic suggestions and gene rankings are compared against independent expert consensus and confirmed molecular results, checking whether the reported accuracy gain disappears or hallucinations appear in the reasoning traces.

Figures

Figures reproduced from arXiv: 2605.06226 by Benny Kai Guo Loo, Botao Yu, Hongyu Zhao, Hua Xu, Hui Zhang, James Zou, Jeffries Lauran, Jianlei Gu, Kexin Huang, Nan Liu, Rui Yang, Tianyu Liu, Wangjie Zheng, Weihao Xuan, Yonghui Jiang.

Figure 1
Figure 1. Figure 1: Overall pipeline of Hygieia. (a) Here we showcase how Hygieia can help physicians and view at source ↗
Figure 2
Figure 2. Figure 2: Benchmarking results for Hygieia in rare disease diagnosis. (a) Number of cases in our view at source ↗
Figure 3
Figure 3. Figure 3: Case study of Hygieia and other baselines in disease diagnosis. We mask some phenotypes view at source ↗
Figure 4
Figure 4. Figure 4: Benchmarking results for Hygieia in risk gene prioritization. (a) Statistics in our testing view at source ↗
Figure 5
Figure 5. Figure 5: Case study for Hygieia and other baselines in risk gene prioritization. We mask some view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of Human-AI collaboration for disease diagnosis and decision correction based view at source ↗
read the original abstract

Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Hygieia, a multi-modal AI agent for rare disease diagnosis and risk gene prioritization. It integrates phenotypic features, genetic profiles, and clinical records via a router-based, knowledge-enhanced framework designed to mitigate hallucinations, tailor strategies to disease categories, prioritize risk genes, and output confidence scores. The central claims are state-of-the-art performance across multiple diagnostic benchmarks plus, in a collaboration with clinicians from Yale School of Medicine and Duke-NUS, 12-60% superior diagnostic performance over physicians and reduced workload on real-world cases.

Significance. If the evaluation details and controls support the claims, this would represent a meaningful advance in clinical decision support for rare diseases, where diagnostic delays are common. The router-based multi-modal design and explicit gene prioritization with confidence scores address practical needs for interpretability and hallucination control; reproducible benchmarks plus clinician validation would strengthen its utility as a workload-reducing tool.

major comments (1)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the manuscript asserts SOTA benchmark results and 12-60% physician improvement but supplies no details on the specific diagnostic benchmarks, dataset sizes/composition, baseline methods, statistical tests, or controls for selection bias in the clinician study. These omissions are load-bearing for the central performance claims, as they prevent verification of whether the data actually support the stated superiority.
minor comments (2)
  1. [Methods] The description of the router mechanism and knowledge-enhancement components would benefit from a high-level diagram or pseudocode to clarify how routing decisions are made across modalities.
  2. [Clinical Validation] Clarify the exact number of clinicians, cases, and blinding procedures in the Yale/Duke-NUS collaboration study to allow readers to assess the practical utility results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our performance claims. We agree that the abstract and evaluation sections require substantial expansion to allow verification of the reported results, and we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the manuscript asserts SOTA benchmark results and 12-60% physician improvement but supplies no details on the specific diagnostic benchmarks, dataset sizes/composition, baseline methods, statistical tests, or controls for selection bias in the clinician study. These omissions are load-bearing for the central performance claims, as they prevent verification of whether the data actually support the stated superiority.

    Authors: We agree that the submitted manuscript does not include sufficient methodological details to substantiate the SOTA benchmark results or the 12-60% diagnostic improvement over physicians. In the revised version we will expand the Evaluation section with: (1) explicit names and descriptions of all diagnostic benchmarks together with their dataset sizes, class distributions, and sources; (2) a complete list of baseline methods and their implementations; (3) the statistical tests used (including exact p-values and confidence intervals); and (4) a dedicated subsection on the clinician study that describes case selection procedures, randomization, blinding, and any other controls for selection bias. We will also add supplementary tables containing the raw performance numbers and workload metrics. These additions will directly address the load-bearing omissions noted by the referee. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents Hygieia as an empirical AI system whose central claims (SOTA benchmark performance and 12-60% diagnostic improvement over physicians) rest on external validation against held-out test sets, ground-truth labels, and independent clinician assessments from Yale and Duke-NUS. No load-bearing step reduces a claimed result to a quantity defined by the model's own fitted parameters, self-citations, or ansatz. The router-based framework and knowledge enhancement are described as architectural choices evaluated on separate data; performance numbers are not constructed by re-using the same fitted quantities as both input and output. Self-citations, if present, are not invoked to justify uniqueness or forbid alternatives in a way that collapses the argument. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the contribution is the overall system description rather than explicit mathematical components.

pith-pipeline@v0.9.0 · 5562 in / 1125 out tokens · 64275 ms · 2026-05-12T03:32:35.371517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

  1. [1]

    Public health and rare diseases: oxymoron no more.Preventing chronic disease, 13:E05, 2016

    Rodolfo Valdez, Lijing Ouyang, and Julie Bolen. Public health and rare diseases: oxymoron no more.Preventing chronic disease, 13:E05, 2016

  2. [2]

    Estimatingcumulative pointprevalenceofrarediseases: analysisoftheorphanetdatabase.Europeanjournalofhuman genetics, 28(2):165–173, 2020

    Stéphanie Nguengang Wakap, Deborah M Lambert, Annie Olry, Charlotte Rodwell, Charlotte Gueydan, ValérieLanneau, DanielMurphy, YannLeCam, andAnaRath. Estimatingcumulative pointprevalenceofrarediseases: analysisoftheorphanetdatabase.Europeanjournalofhuman genetics, 28(2):165–173, 2020

  3. [3]

    Hope for a rare disease: eculizumab in neuromyelitis optica.The Lancet Neurology, 12(6):529–531, 2013

    Friedemann Paul. Hope for a rare disease: eculizumab in neuromyelitis optica.The Lancet Neurology, 12(6):529–531, 2013

  4. [4]

    Access in the rare diseases landscape.The Lancet Global Health, 12(10):e1587, 2024

    Anneliene H Jonker, Maria Cavaller-Bellaubi, Yukiko Nishimura, and David A Pearce. Access in the rare diseases landscape.The Lancet Global Health, 12(10):e1587, 2024

  5. [5]

    An agentic system for rare disease diagnosis with traceable reasoning.Nature, 2026

    Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Ya Zhang, Yongguo Yu, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, 2026

  6. [6]

    Artificial intelligence in rare disease diagnostics: Shortening the path to early detection

    Tapan Ghosh. Artificial intelligence in rare disease diagnostics: Shortening the path to early detection. 2025

  7. [7]

    DongDong,RogerYat-NorkChung,RufinaHWChan,ShiweiGong,andRichardHuanXu. Why is misdiagnosis more likely among some people with rare diseases than others? insights from a population-based cross-sectional study in china.Orphanet journal of rare diseases, 15(1):307, 2020

  8. [8]

    Deeplearningforraredisease: Ascopingreview.Journal of biomedical informatics, 135:104227, 2022

    Junghwan Lee, Cong Liu, Junyoung Kim, Zhehuan Chen, Yingcheng Sun, James R Rogers, WendyKChung, andChunhuaWeng. Deeplearningforraredisease: Ascopingreview.Journal of biomedical informatics, 135:104227, 2022

  9. [9]

    Matteo Rossi and Aisha El-Sayed. Meta-learning driven few-shot diagnostics: Addressing rare disease classification in medical ai.International Journal of Advanced Artificial Intelligence Re- search, 2(05):7–14, 2025

  10. [10]

    Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

  11. [11]

    Visual–language foundation models in medicine

    Chunyu Liu, Yixiao Jin, Zhouyu Guan, Tingyao Li, Yiming Qin, Bo Qian, Zehua Jiang, Yilan Wu, Xiangning Wang, Ying Feng Zheng, et al. Visual–language foundation models in medicine. The Visual Computer, 41(4):2953–2972, 2025

  12. [12]

    Natural language processing for digital health in the era of large language models.Yearbook of Medical Informatics, 33(01):229–240, 2024

    Abeed Sarker, Rui Zhang, Yanshan Wang, Yunyu Xiao, Sudeshna Das, Dalton Schutte, David Oniani, Qianqian Xie, and Hua Xu. Natural language processing for digital health in the era of large language models.Yearbook of Medical Informatics, 33(01):229–240, 2024

  13. [13]

    Geneverse: A col- lection of open-source multimodal large language models for genomic and proteomic research

    Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, Wenjin Zheng, and Hongyu Zhao. Geneverse: A col- lection of open-source multimodal large language models for genomic and proteomic research. InFindings of the association for computational linguistics: EMNLP 2024, pages 4819–4836, 2024. 18 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

  14. [14]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

  15. [15]

    Accelerating scientific discovery with autonomous goal- evolving agents.arXiv preprint arXiv:2512.21782, 2025

    Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, et al. Accelerating scientific discovery with autonomous goal- evolving agents.arXiv preprint arXiv:2512.21782, 2025

  16. [16]

    Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data.Nature Biomedical Engineering, pages 1–18, 2026

    Tianyu Liu, Tinglin Huang, Tong Ding, Hao Wu, Peter Humphrey, Sudhir Perincheri, Kurt Schalper, Rex Ying, Hua Xu, James Zou, et al. Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data.Nature Biomedical Engineering, pages 1–18, 2026

  17. [17]

    TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

    Tianyu Liu, Weihao Xuan, Hao Wu, Peter Humphrey, Marcello DiStasio, Heli Qi, Rui Yang, SimengHan,TinglinHuang,FangWu,etal. Teampath: Buildingmultimodalpathologyexperts with reasoning ai copilots.arXiv preprint arXiv:2511.17652, 2025

  18. [18]

    Rarebench: can llms serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 4850–4861, 2024

    Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, and Ting Chen. Rarebench: can llms serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 4850–4861, 2024

  19. [19]

    Rarearena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis.The Lancet Digital Health, 2026

    Haichao Chen, Zhengyun Zhao, Songchi Zhou, Shikai Hu, Jinyuan Wang, Ye Jin, Xianghong Jin, Yih Chung Tham, Xiaofei Wang, Weizhi Ma, et al. Rarearena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis.The Lancet Digital Health, 2026

  20. [20]

    Visual- rag: Knowledge-guided retrieval augmentation for image-text matching.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Hengchang Wang, Li Liu, Huaxiang Zhang, Lei Zhu, Xiaojun Chang, and Hao Du. Visual- rag: Knowledge-guided retrieval augmentation for image-text matching.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  21. [21]

    Multi-agent sys- tem based medical diagnosis using particle swarm optimization in healthcare

    Jagjit Singh Dhatterwal, Mahaveer Singh Naruka, and Kuldeep Singh Kaswan. Multi-agent sys- tem based medical diagnosis using particle swarm optimization in healthcare. In2023 Inter- national Conference on Artificial Intelligence and Smart Communication (AISC), pages 889–893. IEEE, 2023

  22. [22]

    Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

    Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

  23. [23]

    Vc-rdagent: Anefficientrarediseasediagnosisagentviavirtual case construction informed by hybrid statistical-metric and hyperbolic-semantic prioritization

    Yang Liu, Honglei Li, Peng Jiang, Lizhen Wu, Zhi Xie, Chao Ning, Xiangya Kong, Yayun Wang, XinleiZhang, andZechiHuang. Vc-rdagent: Anefficientrarediseasediagnosisagentviavirtual case construction informed by hybrid statistical-metric and hyperbolic-semantic prioritization. bioRxiv, pages 2026–02, 2026

  24. [24]

    Variant-level matching for diagnosis and discovery: Challenges and opportunities.Human mutation, 43(6):782–790, 2022

    ElietedaSRodrigues, SeanGriffith, RenanMartin, CorinaAntonescu, JenniferEPosey, Zeynep Coban-Akdemir, Shalini N Jhangiani, Kimberly F Doheny, James R Lupski, David Valle, et al. Variant-level matching for diagnosis and discovery: Challenges and opportunities.Human mutation, 43(6):782–790, 2022

  25. [25]

    Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018

    Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018. 19 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

  26. [26]

    The un- diagnosed diseases network: accelerating discovery about health and disease.The American Journal of Human Genetics, 100(2):185–192, 2017

    Rachel B Ramoni, John J Mulvihill, David R Adams, Patrick Allard, Euan A Ashley, Jonathan A Bernstein, William A Gahl, Rizwan Hamid, Joseph Loscalzo, Alexa T McCray, et al. The un- diagnosed diseases network: accelerating discovery about health and disease.The American Journal of Human Genetics, 100(2):185–192, 2017

  27. [27]

    Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases.npj Digital Medicine, 8(1):380, 2025

    Emily Alsentzer, Michelle M Li, Shilpa N Kobren, Ayush Noori, Undiagnosed Diseases Network, Isaac S Kohane, and Marinka Zitnik. Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases.npj Digital Medicine, 8(1):380, 2025

  28. [28]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  29. [29]

    Clinical research for rare disease: opportunities,challenges,andsolutions.Moleculargeneticsandmetabolism,96(1):20– 26, 2009

    Robert C Griggs, Mark Batshaw, Mary Dunkle, Rashmi Gopal-Srivastava, Edward Kaye, Jef- frey Krischer, Tan Nguyen, Kathleen Paulus, Peter A Merkel, et al. Clinical research for rare disease: opportunities,challenges,andsolutions.Moleculargeneticsandmetabolism,96(1):20– 26, 2009

  30. [30]

    Unmaskingkabukisyndrome.Clinicalgenetics, 83(3):201–211, 2013

    NBögershausenandBWollnik. Unmaskingkabukisyndrome.Clinicalgenetics, 83(3):201–211, 2013

  31. [31]

    Causal machine learning for healthcare and precision medicine.Royal Society Open Science, 9(8):220638, 2022

    Pedro Sanchez, Jeremy P Voisey, Tian Xia, Hannah I Watson, Alison Q O’Neil, and Sotirios A Tsaftaris. Causal machine learning for healthcare and precision medicine.Royal Society Open Science, 9(8):220638, 2022

  32. [32]

    Rare-disease genetics in the era of next-generation sequencing: discovery to translation.Nature Reviews Genetics, 14(10):681–691, 2013

    Kym M Boycott, Megan R Vanstone, Dennis E Bulman, and Alex E MacKenzie. Rare-disease genetics in the era of next-generation sequencing: discovery to translation.Nature Reviews Genetics, 14(10):681–691, 2013

  33. [33]

    Gpt-5systemcard

    OpenAI. Gpt-5systemcard. 2025. Availableat: https://openai.com/index/introducing-gpt-5/

  34. [34]

    Claude sonnet 4.5.https://www.anthropic.com/claude/sonnet, 2025

    Anthropic. Claude sonnet 4.5.https://www.anthropic.com/claude/sonnet, 2025. Large language model

  35. [35]

    Biomni: A general-purpose biomedical ai agent

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent. biorxiv, 2025

  36. [36]

    The human phenotype ontology.Clinical genetics, 77(6):525–534, 2010

    Peter N Robinson and Stefan Mundlos. The human phenotype ontology.Clinical genetics, 77(6):525–534, 2010

  37. [37]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  38. [38]

    Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

    Jacob White. Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

  39. [39]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJOstrow, AkilaWelihinda, AlanHayes, AlecRadford, etal. Gpt-4osystemcard.arXivpreprint arXiv:2410.21276, 2024

  40. [40]

    React: Synergizing reasoningand actinginlanguage models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoningand actinginlanguage models. InTheeleventh international conference on learning representations, 2022. 20 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

  41. [41]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InICLR, 2024

  42. [42]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  43. [43]

    Think twice before assure: Confidence estimation for large language models through reflection on multiple answers.CoRR, 2024

    Moxin Li, Wenjie Wang, Fuli Feng, Fengbin Zhu, Qifan Wang, and Tat-Seng Chua. Think twice before assure: Confidence estimation for large language models through reflection on multiple answers.CoRR, 2024

  44. [44]

    Openai o3 and o4-mini system card.https://openai.com/index/ o3-o4-mini-system-card/, April 2025

    OpenAI. Openai o3 and o4-mini system card.https://openai.com/index/ o3-o4-mini-system-card/, April 2025. System card. Accessed 2025-11-07

  45. [45]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  46. [46]

    Lla- mafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYanhan YeYanhan, and Zheyan Luo. Lla- mafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstra- tions), pages 400–410, 2024

  47. [47]

    Mdagents: An adaptive collabo- ration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collabo- ration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  48. [48]

    Rdguru: An intelligent agent for rare diseases

    Jian Yang, Liqi Shu, Huilong Duan, and Haomin Li. Rdguru: An intelligent agent for rare diseases. InAMIA Annual Symposium Proceedings, volume 2024, page 1275, 2025

  49. [49]

    Makediagnosisforthispatient. Knownphenotypesinclude: {phenotype_list}. Multiple local hospital evaluations failed to establish a definitive diagnosis

    Xuanzhong Chen, Ye Jin, Xiaohao Mao, Lun Wang, Shuyang Zhang, and Ting Chen. Rareagents: Autonomous multi-disciplinary team for rare disease diagnosis and treatment. arXiv preprint arXiv:2412.12475, 2024. 21 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization A. Comparison between Hygieia and other AI agents System Distinguish Hum...

  50. [50]

    Distal arthrogryposis type 10

    If the diagnosis is correct, explicitly state that it is correct and explain why. 3. If the diagnosis is incorrect or incomplete, clearly state that it is incorrect and: - Provide the most likely corrected diag- nosis - Briefly justify the correction using key phenotype–disease matches 4. Do not provide multiple diagnoses—return one best diagnosis only. 5...