arxiv: 2605.06226 · v2 · submitted 2026-05-07 · 💻 cs.AI · q-bio.GN

Recognition: 2 theorem links

· Lean Theorem

A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Tianyu Liu , Wangjie Zheng , Rui Yang , Benny Kai Guo Loo , Hui Zhang , Jeffries Lauran , Jianlei Gu , Botao Yu

show 7 more authors

Weihao Xuan Kexin Huang Nan Liu James Zou Yonghui Jiang Hua Xu Hongyu Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:32 UTC · model grok-4.3

classification 💻 cs.AI q-bio.GN

keywords rare disease diagnosisAI agentmulti-modal data integrationgene prioritizationclinical decision supporthallucination mitigationprecision medicine

0 comments

The pith

Hygieia, a multi-modal AI agent, diagnoses rare diseases more accurately than physicians by routing across genetic, phenotypic, and clinical data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hygieia as an AI agent system built to address slow and inaccurate diagnosis of rare diseases by combining phenotypic features, genetic profiles, and clinical records in one workflow. It uses a router-based and knowledge-enhanced framework to select appropriate strategies for different disease types while limiting hallucinations in generated outputs. The authors report that this produces state-of-the-art results on diagnostic benchmarks and yields 12 to 60 percent higher accuracy than physicians in tests with clinical experts, along with risk-gene prioritization and confidence scores. A sympathetic reader would care because rare diseases often involve long delays before correct identification, and a tool that improves both speed and precision could shorten that process and lower the workload on clinicians.

Core claim

Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks for rare diseases and shows superior diagnostic accuracy compared to physicians, with gains ranging from 12 to 60 percent, while also prioritizing risk-related genomic factors and supplying confidence scores to support clinical decisions.

What carries the argument

A router-based and knowledge-enhanced framework that selects tailored diagnostic strategies for different disease categories and mitigates hallucination while integrating multi-modal inputs.

If this is right

Clinicians receive prioritized risk genes and confidence scores that can narrow testing options and guide treatment choices for rare diseases.
The system reduces the time clinicians spend reviewing and interpreting complex medical records in real-world cases.
Diagnostic accuracy improves across benchmarks, which would shorten the average time to correct identification of rare conditions.
The framework supplies interpretable outputs that can be reviewed alongside physician judgment rather than replacing it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the routing logic generalizes, the same architecture could be tested on other diagnostic domains that mix genetic and phenotypic data.
Deployment would require ongoing monitoring of outputs against new patient cohorts to confirm that performance does not degrade with shifts in data distribution.
The reported reduction in clinician workload suggests the agent could serve as a first-pass filter in settings with limited access to rare-disease specialists.

Load-bearing premise

The router-based and knowledge-enhanced framework reliably reduces hallucination and delivers unbiased gene prioritization when run on real-world heterogeneous clinical data without adding new errors from its training sources or data selection.

What would settle it

A prospective trial on a fresh cohort of undiagnosed rare-disease cases drawn from multiple hospitals where Hygieia's final diagnostic suggestions and gene rankings are compared against independent expert consensus and confirmed molecular results, checking whether the reported accuracy gain disappears or hallucinations appear in the reasoning traces.

Figures

Figures reproduced from arXiv: 2605.06226 by Benny Kai Guo Loo, Botao Yu, Hongyu Zhao, Hua Xu, Hui Zhang, James Zou, Jeffries Lauran, Jianlei Gu, Kexin Huang, Nan Liu, Rui Yang, Tianyu Liu, Wangjie Zheng, Weihao Xuan, Yonghui Jiang.

**Figure 1.** Figure 1: Overall pipeline of Hygieia. (a) Here we showcase how Hygieia can help physicians and view at source ↗

**Figure 2.** Figure 2: Benchmarking results for Hygieia in rare disease diagnosis. (a) Number of cases in our view at source ↗

**Figure 3.** Figure 3: Case study of Hygieia and other baselines in disease diagnosis. We mask some phenotypes view at source ↗

**Figure 4.** Figure 4: Benchmarking results for Hygieia in risk gene prioritization. (a) Statistics in our testing view at source ↗

**Figure 5.** Figure 5: Case study for Hygieia and other baselines in risk gene prioritization. We mask some view at source ↗

**Figure 6.** Figure 6: Illustration of Human-AI collaboration for disease diagnosis and decision correction based view at source ↗

read the original abstract

Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hygieia applies a router-based multi-modal agent to rare-disease diagnosis and gene prioritization, with clinician validation, but the strong SOTA and physician-superiority claims rest on evaluation details that need verification.

read the letter

The core contribution is a router that picks diagnostic strategies depending on disease category, then folds in phenotypic, genetic, and record data while adding risk-gene ranking and scores. That combination is the concrete new piece beyond generic diagnostic models or plain agents. The authors also ran a study with Yale and Duke-NUS clinicians on real cases, which is the kind of external check that most benchmark-only papers skip. Those two elements—tailored routing plus practical clinician feedback—give the work a usable shape for people who actually need to handle heterogeneous rare-disease data. The architecture description itself is clear enough that someone could re-implement the router and knowledge layer without too much guesswork. The claims of SOTA benchmark results and 12-60% better performance than physicians are the parts that matter most for impact. If the full results section shows proper baselines, held-out test sets, and controls for case selection, the numbers would be worth citing in medical-AI work. The stress-test note says the construction does not contain obvious internal contradictions, which matches what the abstract and described validation approach suggest. The main soft spot is still the level of detail on evaluation. The abstract gives no dataset sizes, no list of competing methods, and no description of how the physician comparison was blinded or sampled. Until those are checked, the size of the reported gains remains hard to judge. Bias in the gene-prioritization step is another standing concern if the training sources over-represent well-studied conditions. This paper is for readers who build or evaluate clinical decision-support agents and for medical-genetics groups that want to test multi-modal tools on their own cohorts. It is solid enough on the architecture side and timely enough on the application side that a serious editor should send it to referees rather than desk-reject it. The evaluation sections will probably need expansion, but the underlying idea is worth the review cycle.

Referee Report

1 major / 2 minor

Summary. The paper introduces Hygieia, a multi-modal AI agent for rare disease diagnosis and risk gene prioritization. It integrates phenotypic features, genetic profiles, and clinical records via a router-based, knowledge-enhanced framework designed to mitigate hallucinations, tailor strategies to disease categories, prioritize risk genes, and output confidence scores. The central claims are state-of-the-art performance across multiple diagnostic benchmarks plus, in a collaboration with clinicians from Yale School of Medicine and Duke-NUS, 12-60% superior diagnostic performance over physicians and reduced workload on real-world cases.

Significance. If the evaluation details and controls support the claims, this would represent a meaningful advance in clinical decision support for rare diseases, where diagnostic delays are common. The router-based multi-modal design and explicit gene prioritization with confidence scores address practical needs for interpretability and hallucination control; reproducible benchmarks plus clinician validation would strengthen its utility as a workload-reducing tool.

major comments (1)

[Abstract and Evaluation] Abstract and Evaluation section: the manuscript asserts SOTA benchmark results and 12-60% physician improvement but supplies no details on the specific diagnostic benchmarks, dataset sizes/composition, baseline methods, statistical tests, or controls for selection bias in the clinician study. These omissions are load-bearing for the central performance claims, as they prevent verification of whether the data actually support the stated superiority.

minor comments (2)

[Methods] The description of the router mechanism and knowledge-enhancement components would benefit from a high-level diagram or pseudocode to clarify how routing decisions are made across modalities.
[Clinical Validation] Clarify the exact number of clinicians, cases, and blinding procedures in the Yale/Duke-NUS collaboration study to allow readers to assess the practical utility results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our performance claims. We agree that the abstract and evaluation sections require substantial expansion to allow verification of the reported results, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the manuscript asserts SOTA benchmark results and 12-60% physician improvement but supplies no details on the specific diagnostic benchmarks, dataset sizes/composition, baseline methods, statistical tests, or controls for selection bias in the clinician study. These omissions are load-bearing for the central performance claims, as they prevent verification of whether the data actually support the stated superiority.

Authors: We agree that the submitted manuscript does not include sufficient methodological details to substantiate the SOTA benchmark results or the 12-60% diagnostic improvement over physicians. In the revised version we will expand the Evaluation section with: (1) explicit names and descriptions of all diagnostic benchmarks together with their dataset sizes, class distributions, and sources; (2) a complete list of baseline methods and their implementations; (3) the statistical tests used (including exact p-values and confidence intervals); and (4) a dedicated subsection on the clinician study that describes case selection procedures, randomization, blinding, and any other controls for selection bias. We will also add supplementary tables containing the raw performance numbers and workload metrics. These additions will directly address the load-bearing omissions noted by the referee. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents Hygieia as an empirical AI system whose central claims (SOTA benchmark performance and 12-60% diagnostic improvement over physicians) rest on external validation against held-out test sets, ground-truth labels, and independent clinician assessments from Yale and Duke-NUS. No load-bearing step reduces a claimed result to a quantity defined by the model's own fitted parameters, self-citations, or ansatz. The router-based framework and knowledge enhancement are described as architectural choices evaluated on separate data; performance numbers are not constructed by re-using the same fitted quantities as both input and output. Self-citations, if present, are not invoked to justify uniqueness or forbid alternatives in a way that collapses the argument. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the contribution is the overall system description rather than explicit mathematical components.

pith-pipeline@v0.9.0 · 5562 in / 1125 out tokens · 64275 ms · 2026-05-12T03:32:35.371517+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination... self-verification mechanism... confidence estimation through majority voting
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train a classifier-based router... verification agent... J-cost or golden-ratio structures never appear

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Public health and rare diseases: oxymoron no more.Preventing chronic disease, 13:E05, 2016

Rodolfo Valdez, Lijing Ouyang, and Julie Bolen. Public health and rare diseases: oxymoron no more.Preventing chronic disease, 13:E05, 2016

work page 2016
[2]

Estimatingcumulative pointprevalenceofrarediseases: analysisoftheorphanetdatabase.Europeanjournalofhuman genetics, 28(2):165–173, 2020

Stéphanie Nguengang Wakap, Deborah M Lambert, Annie Olry, Charlotte Rodwell, Charlotte Gueydan, ValérieLanneau, DanielMurphy, YannLeCam, andAnaRath. Estimatingcumulative pointprevalenceofrarediseases: analysisoftheorphanetdatabase.Europeanjournalofhuman genetics, 28(2):165–173, 2020

work page 2020
[3]

Hope for a rare disease: eculizumab in neuromyelitis optica.The Lancet Neurology, 12(6):529–531, 2013

Friedemann Paul. Hope for a rare disease: eculizumab in neuromyelitis optica.The Lancet Neurology, 12(6):529–531, 2013

work page 2013
[4]

Access in the rare diseases landscape.The Lancet Global Health, 12(10):e1587, 2024

Anneliene H Jonker, Maria Cavaller-Bellaubi, Yukiko Nishimura, and David A Pearce. Access in the rare diseases landscape.The Lancet Global Health, 12(10):e1587, 2024

work page 2024
[5]

An agentic system for rare disease diagnosis with traceable reasoning.Nature, 2026

Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Ya Zhang, Yongguo Yu, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, 2026

work page 2026
[6]

Artificial intelligence in rare disease diagnostics: Shortening the path to early detection

Tapan Ghosh. Artificial intelligence in rare disease diagnostics: Shortening the path to early detection. 2025

work page 2025
[7]

DongDong,RogerYat-NorkChung,RufinaHWChan,ShiweiGong,andRichardHuanXu. Why is misdiagnosis more likely among some people with rare diseases than others? insights from a population-based cross-sectional study in china.Orphanet journal of rare diseases, 15(1):307, 2020

work page 2020
[8]

Deeplearningforraredisease: Ascopingreview.Journal of biomedical informatics, 135:104227, 2022

Junghwan Lee, Cong Liu, Junyoung Kim, Zhehuan Chen, Yingcheng Sun, James R Rogers, WendyKChung, andChunhuaWeng. Deeplearningforraredisease: Ascopingreview.Journal of biomedical informatics, 135:104227, 2022

work page 2022
[9]

Matteo Rossi and Aisha El-Sayed. Meta-learning driven few-shot diagnostics: Addressing rare disease classification in medical ai.International Journal of Advanced Artificial Intelligence Re- search, 2(05):7–14, 2025

work page 2025
[10]

Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

work page 1930
[11]

Visual–language foundation models in medicine

Chunyu Liu, Yixiao Jin, Zhouyu Guan, Tingyao Li, Yiming Qin, Bo Qian, Zehua Jiang, Yilan Wu, Xiangning Wang, Ying Feng Zheng, et al. Visual–language foundation models in medicine. The Visual Computer, 41(4):2953–2972, 2025

work page 2025
[12]

Natural language processing for digital health in the era of large language models.Yearbook of Medical Informatics, 33(01):229–240, 2024

Abeed Sarker, Rui Zhang, Yanshan Wang, Yunyu Xiao, Sudeshna Das, Dalton Schutte, David Oniani, Qianqian Xie, and Hua Xu. Natural language processing for digital health in the era of large language models.Yearbook of Medical Informatics, 33(01):229–240, 2024

work page 2024
[13]

Geneverse: A col- lection of open-source multimodal large language models for genomic and proteomic research

Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, Wenjin Zheng, and Hongyu Zhao. Geneverse: A col- lection of open-source multimodal large language models for genomic and proteomic research. InFindings of the association for computational linguistics: EMNLP 2024, pages 4819–4836, 2024. 18 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

work page 2024
[14]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

work page internal anchor Pith review arXiv 2025
[15]

Accelerating scientific discovery with autonomous goal- evolving agents.arXiv preprint arXiv:2512.21782, 2025

Yuanqi Du, Botao Yu, Tianyu Liu, Tony Shen, Junwu Chen, Jan G Rittig, Kunyang Sun, Yikun Zhang, Zhangde Song, Bo Zhou, et al. Accelerating scientific discovery with autonomous goal- evolving agents.arXiv preprint arXiv:2512.21782, 2025

work page arXiv 2025
[16]

Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data.Nature Biomedical Engineering, pages 1–18, 2026

Tianyu Liu, Tinglin Huang, Tong Ding, Hao Wu, Peter Humphrey, Sudhir Perincheri, Kurt Schalper, Rex Ying, Hua Xu, James Zou, et al. Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data.Nature Biomedical Engineering, pages 1–18, 2026

work page 2026
[17]

TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

Tianyu Liu, Weihao Xuan, Hao Wu, Peter Humphrey, Marcello DiStasio, Heli Qi, Rui Yang, SimengHan,TinglinHuang,FangWu,etal. Teampath: Buildingmultimodalpathologyexperts with reasoning ai copilots.arXiv preprint arXiv:2511.17652, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Rarebench: can llms serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 4850–4861, 2024

Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, and Ting Chen. Rarebench: can llms serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 4850–4861, 2024

work page 2024
[19]

Rarearena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis.The Lancet Digital Health, 2026

Haichao Chen, Zhengyun Zhao, Songchi Zhou, Shikai Hu, Jinyuan Wang, Ye Jin, Xianghong Jin, Yih Chung Tham, Xiaofei Wang, Weizhi Ma, et al. Rarearena: a comprehensive benchmark dataset unveiling the potential of large language models in rare disease diagnosis.The Lancet Digital Health, 2026

work page 2026
[20]

Visual- rag: Knowledge-guided retrieval augmentation for image-text matching.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Hengchang Wang, Li Liu, Huaxiang Zhang, Lei Zhu, Xiaojun Chang, and Hao Du. Visual- rag: Knowledge-guided retrieval augmentation for image-text matching.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[21]

Multi-agent sys- tem based medical diagnosis using particle swarm optimization in healthcare

Jagjit Singh Dhatterwal, Mahaveer Singh Naruka, and Kuldeep Singh Kaswan. Multi-agent sys- tem based medical diagnosis using particle swarm optimization in healthcare. In2023 Inter- national Conference on Artificial Intelligence and Smart Communication (AISC), pages 889–893. IEEE, 2023

work page 2023
[22]

Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

Xi Chen, Huahui Yi, Mingke You, WeiZhi Liu, Li Wang, Hairui Li, Xue Zhang, Yingman Guo, Lei Fan, Gang Chen, et al. Enhancing diagnostic capability with multi-agents conversational large language models.NPJ digital medicine, 8(1):159, 2025

work page 2025
[23]

Vc-rdagent: Anefficientrarediseasediagnosisagentviavirtual case construction informed by hybrid statistical-metric and hyperbolic-semantic prioritization

Yang Liu, Honglei Li, Peng Jiang, Lizhen Wu, Zhi Xie, Chao Ning, Xiangya Kong, Yayun Wang, XinleiZhang, andZechiHuang. Vc-rdagent: Anefficientrarediseasediagnosisagentviavirtual case construction informed by hybrid statistical-metric and hyperbolic-semantic prioritization. bioRxiv, pages 2026–02, 2026

work page 2026
[24]

Variant-level matching for diagnosis and discovery: Challenges and opportunities.Human mutation, 43(6):782–790, 2022

ElietedaSRodrigues, SeanGriffith, RenanMartin, CorinaAntonescu, JenniferEPosey, Zeynep Coban-Akdemir, Shalini N Jhangiani, Kimberly F Doheny, James R Lupski, David Valle, et al. Variant-level matching for diagnosis and discovery: Challenges and opportunities.Human mutation, 43(6):782–790, 2022

work page 2022
[25]

Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018. 19 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

work page 2018
[26]

The un- diagnosed diseases network: accelerating discovery about health and disease.The American Journal of Human Genetics, 100(2):185–192, 2017

Rachel B Ramoni, John J Mulvihill, David R Adams, Patrick Allard, Euan A Ashley, Jonathan A Bernstein, William A Gahl, Rizwan Hamid, Joseph Loscalzo, Alexa T McCray, et al. The un- diagnosed diseases network: accelerating discovery about health and disease.The American Journal of Human Genetics, 100(2):185–192, 2017

work page 2017
[27]

Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases.npj Digital Medicine, 8(1):380, 2025

Emily Alsentzer, Michelle M Li, Shilpa N Kobren, Ayush Noori, Undiagnosed Diseases Network, Isaac S Kohane, and Marinka Zitnik. Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases.npj Digital Medicine, 8(1):380, 2025

work page 2025
[28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Clinical research for rare disease: opportunities,challenges,andsolutions.Moleculargeneticsandmetabolism,96(1):20– 26, 2009

Robert C Griggs, Mark Batshaw, Mary Dunkle, Rashmi Gopal-Srivastava, Edward Kaye, Jef- frey Krischer, Tan Nguyen, Kathleen Paulus, Peter A Merkel, et al. Clinical research for rare disease: opportunities,challenges,andsolutions.Moleculargeneticsandmetabolism,96(1):20– 26, 2009

work page 2009
[30]

Unmaskingkabukisyndrome.Clinicalgenetics, 83(3):201–211, 2013

NBögershausenandBWollnik. Unmaskingkabukisyndrome.Clinicalgenetics, 83(3):201–211, 2013

work page 2013
[31]

Causal machine learning for healthcare and precision medicine.Royal Society Open Science, 9(8):220638, 2022

Pedro Sanchez, Jeremy P Voisey, Tian Xia, Hannah I Watson, Alison Q O’Neil, and Sotirios A Tsaftaris. Causal machine learning for healthcare and precision medicine.Royal Society Open Science, 9(8):220638, 2022

work page 2022
[32]

Rare-disease genetics in the era of next-generation sequencing: discovery to translation.Nature Reviews Genetics, 14(10):681–691, 2013

Kym M Boycott, Megan R Vanstone, Dennis E Bulman, and Alex E MacKenzie. Rare-disease genetics in the era of next-generation sequencing: discovery to translation.Nature Reviews Genetics, 14(10):681–691, 2013

work page 2013
[33]

Gpt-5systemcard

OpenAI. Gpt-5systemcard. 2025. Availableat: https://openai.com/index/introducing-gpt-5/

work page 2025
[34]

Claude sonnet 4.5.https://www.anthropic.com/claude/sonnet, 2025

Anthropic. Claude sonnet 4.5.https://www.anthropic.com/claude/sonnet, 2025. Large language model

work page 2025
[35]

Biomni: A general-purpose biomedical ai agent

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent. biorxiv, 2025

work page 2025
[36]

The human phenotype ontology.Clinical genetics, 77(6):525–534, 2010

Peter N Robinson and Stefan Mundlos. The human phenotype ontology.Clinical genetics, 77(6):525–534, 2010

work page 2010
[37]

Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

work page 2011
[38]

Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

Jacob White. Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020

work page 2020
[39]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJOstrow, AkilaWelihinda, AlanHayes, AlecRadford, etal. Gpt-4osystemcard.arXivpreprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

React: Synergizing reasoningand actinginlanguage models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoningand actinginlanguage models. InTheeleventh international conference on learning representations, 2022. 20 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

work page 2022
[41]

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InICLR, 2024

work page 2024
[42]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Think twice before assure: Confidence estimation for large language models through reflection on multiple answers.CoRR, 2024

Moxin Li, Wenjie Wang, Fuli Feng, Fengbin Zhu, Qifan Wang, and Tat-Seng Chua. Think twice before assure: Confidence estimation for large language models through reflection on multiple answers.CoRR, 2024

work page 2024
[44]

Openai o3 and o4-mini system card.https://openai.com/index/ o3-o4-mini-system-card/, April 2025

OpenAI. Openai o3 and o4-mini system card.https://openai.com/index/ o3-o4-mini-system-card/, April 2025. System card. Accessed 2025-11-07

work page 2025
[45]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Lla- mafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, YeYanhan YeYanhan, and Zheyan Luo. Lla- mafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstra- tions), pages 400–410, 2024

work page 2024
[47]

Mdagents: An adaptive collabo- ration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collabo- ration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

work page 2024
[48]

Rdguru: An intelligent agent for rare diseases

Jian Yang, Liqi Shu, Huilong Duan, and Haomin Li. Rdguru: An intelligent agent for rare diseases. InAMIA Annual Symposium Proceedings, volume 2024, page 1275, 2025

work page 2024
[49]

Makediagnosisforthispatient. Knownphenotypesinclude: {phenotype_list}. Multiple local hospital evaluations failed to establish a definitive diagnosis

Xuanzhong Chen, Ye Jin, Xiaohao Mao, Lun Wang, Shuyang Zhang, and Ting Chen. Rareagents: Autonomous multi-disciplinary team for rare disease diagnosis and treatment. arXiv preprint arXiv:2412.12475, 2024. 21 A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization A. Comparison between Hygieia and other AI agents System Distinguish Hum...

work page arXiv 2024
[50]

Distal arthrogryposis type 10

If the diagnosis is correct, explicitly state that it is correct and explain why. 3. If the diagnosis is incorrect or incomplete, clearly state that it is incorrect and: - Provide the most likely corrected diag- nosis - Briefly justify the correction using key phenotype–disease matches 4. Do not provide multiple diagnoses—return one best diagnosis only. 5...

work page