RareCollab: an LLM-powered framework for multimodal reasoning in Mendelian disease diagnosis

Devon E. Bonner; Guantong Qi; Hu Chen; Jennefer N. Carter; Jiasheng Wang; Jonathan A. Bernstein; Kevin S. Smith; Matthew T. Wheeler; Maura R.Z. Ruzhnikov; Mei Ling Chong

arxiv: 2602.04058 · v2 · submitted 2026-02-03 · 🧬 q-bio.GN

RareCollab: an LLM-powered framework for multimodal reasoning in Mendelian disease diagnosis

Guantong Qi , Jiasheng Wang , Mei Ling Chong , Zahid Shaik , Shenglan Li , Shinya Yamamoto , Maura R.Z. Ruzhnikov , Devon E. Bonner

show 10 more authors

Jennefer N. Carter Kevin S. Smith Matthew T. Wheeler Stephen B. Montgomery Jonathan A. Bernstein Sasidhar Pasupuleti Undiagnosed Diseases Network Pengfei Liu Hu Chen Zhandong Liu

This is my paper

Pith reviewed 2026-05-16 07:37 UTC · model grok-4.3

classification 🧬 q-bio.GN

keywords rare disease diagnosisMendelian diseasemultimodal reasoninglarge language modelsgenomic data integrationtranscriptomic evidencevariant prioritizationphenotype-driven diagnosis

0 comments

The pith

An LLM framework integrates over 100 genomic phenotypic and transcriptomic signals to rank diagnostic genes for Mendelian diseases in a benchmark of 890 patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RareCollab as a system that lets large language models act as separate calibrated reasoning modules to combine DNA RNA phenotype and other diagnostic evidence for rare genetic diseases. By breaking the task into modular steps rather than using one end-to-end ranking process the approach aims to reconcile signals that are otherwise hard to weigh together. Tested on 890 patients including 119 with paired DNA and RNA data the method placed the correct diagnostic gene in the top 10 for 94 percent of cases. It also showed RNA evidence helping identify the gene in 35 percent of the paired-data cases. This matters because rare disease diagnosis often stalls when multiple complex data types cannot be integrated reliably.

Core claim

RareCollab is an LLM-powered framework that integrates more than 100 diagnostic evidence signals across DNA RNA phenotype curated variant-level knowledge and in-silico pathogenicity evidence. It enables large language models to operate as calibrated interpretable reasoning modules rather than as a single end-to-end ranker. Applied to 890 patients from three cohorts including 119 Undiagnosed Diseases Network probands with paired DNA and RNA data it prioritized 94 percent of diagnostic genes within the top 10 and outperformed proprietary phenotype-driven LLM baselines by more than 25 percent on average while surpassing established state-of-the-art variant prioritization methods by 11 to 24.

What carries the argument

The modular design that treats LLMs as calibrated reasoning modules to combine heterogeneous evidence signals from DNA RNA and phenotype sources instead of using a single ranking model.

If this is right

RNA evidence contributes to prioritization of the diagnostic gene in 35 percent of cases with paired genomic and transcriptomic data.
Performance holds across recall thresholds from top 1 to top 10 and exceeds established variant prioritization methods by 11 to 24 percent.
The framework scales to large real-world patient cohorts while preserving interpretability of each reasoning step.
It reshapes the role of transcriptomic data by showing measurable contribution in a substantial fraction of evaluated cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular LLM design could be adapted to other diagnostic settings that require weighing heterogeneous clinical and molecular data types.
Wider use might shorten the time from data collection to candidate gene review in clinical rare disease workflows.
Testing the same framework on cohorts lacking RNA data would isolate the incremental value of transcriptomic signals.
The prompt-based weighting approach may generalize to evidence combination tasks outside genetics where explicit probabilistic models are unavailable.

Load-bearing premise

Large language models will remain calibrated and unbiased when they combine signals whose relative weights come from prompt design and training data rather than from explicit first-principles rules.

What would settle it

A new independent cohort of patients with known diagnostic genes in which the framework places fewer than 80 percent of those genes in the top 10 or fails to beat simpler non-LLM prioritization methods.

read the original abstract

Rare disease diagnosis increasingly relies on integrating genomic, phenotypic and transcriptomic evidence, yet these signals remain difficult to reconcile within a common interpretive framework. Here we present RareCollab, an LLM-powered framework for multimodal reasoning in Mendelian disease diagnosis that integrates more than 100 diagnostic evidence signals across DNA, RNA, phenotype, curated variant-level knowledge, and in-silico pathogenicity evidence. This design enables large language models to operate as calibrated, interpretable reasoning modules rather than as a single end-to-end ranker. We applied RareCollab to 890 patients from three cohorts, including 119 Undiagnosed Diseases Network probands with paired DNA and RNA data, constituting a large systematic benchmark for multimodal rare disease diagnosis under paired genomic and transcriptomic evaluation. In this real-world multimodal benchmark, RareCollab prioritized 94% of diagnostic genes within the top 10. Across recall thresholds from top 1 to top 10, it consistently outperformed proprietary phenotype-driven LLM baselines including Claude Sonnet 4.6 and GPT-5-mini by more than 25% on average and surpassed established state-of-the-art variant prioritization methods by 11%-24%. RareCollab also reshapes the diagnostic contribution of RNA evidence, which contributes to prioritization of the diagnostic gene in 35% of cases (42/119). Together, these results establish RareCollab as a scalable and interpretable framework for multimodal rare disease diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RareCollab gets real gains on a paired DNA-RNA benchmark with modular LLM modules, but the prompt-only weighting for signals is the part that needs checking.

read the letter

The main point is that this framework puts LLMs to work as separate reasoning pieces for different evidence types instead of one big end-to-end model, and on 890 patients including 119 with paired DNA and RNA data it reaches 94% top-10 recall while beating phenotype-only LLM baselines by more than 25% on average and older variant tools by 11-24%. RNA evidence ends up mattering for prioritization in 35% of those paired cases. That benchmark on actual UDN probands with both DNA and RNA is the concrete advance here, and the modular split lets them show how each data type contributes without hiding everything inside a single black-box ranker. The numbers line up across several recall thresholds, which is better than the usual single-point claims in this area. The paper is aimed at people who build or use gene prioritization tools for undiagnosed Mendelian cases, and anyone running multimodal pipelines would get practical value from seeing how the RNA contribution shifts the results on real paired data. The soft spot is the integration step. All the relative weights across the hundred-plus signals come from prompt templates rather than any learned parameters or first-principles fusion, so the reported lifts and the 35% RNA figure sit on top of choices that could move with phrasing, model version, or cohort shifts. No robustness checks on that sensitivity appear in the abstract, and without the exact templates or statistical details it is hard to judge how stable the 25% edge really is. Still, the cohort size and the outperformance numbers are solid enough that the work deserves a full referee look to verify the methods and see whether the modular design holds up under closer inspection.

Referee Report

3 major / 2 minor

Summary. RareCollab is an LLM-powered framework that integrates more than 100 evidence signals across DNA, RNA, phenotype, curated variant knowledge, and in-silico scores to perform multimodal reasoning for Mendelian disease diagnosis. Evaluated on a real-world benchmark of 890 patients (including 119 with paired DNA/RNA data), the system prioritizes diagnostic genes in the top 10 in 94% of cases, outperforms proprietary phenotype-driven LLM baselines (Claude Sonnet 4.6, GPT-5-mini) by more than 25% on average across recall thresholds, surpasses established variant prioritization methods by 11-24%, and attributes a 35% contribution to RNA evidence in the paired subset.

Significance. If the performance numbers are reproducible, the work provides a scalable, interpretable alternative to end-to-end ranking by treating LLMs as modular reasoning components over heterogeneous signals. This could meaningfully advance clinical rare-disease workflows that already combine genomic and transcriptomic data, especially if the framework generalizes beyond the reported cohorts.

major comments (3)

[Methods] Methods: The manuscript provides no explicit prompt templates, evidence-weighting scheme, or fusion procedure for the >100 heterogeneous signals. Because the reported 94% top-10 recall and 35% RNA contribution (42/119 cases) are produced exclusively by prompt-shaped integration rather than learned or first-principles weights, the absence of these details directly prevents independent verification of the central performance claims.
[Results] Results: No statistical tests, confidence intervals, or correction for multiple comparisons accompany the stated 11-25% gains over baselines on the 890-patient cohort. Without these, it is impossible to determine whether the observed outperformance is robust or could be explained by cohort composition or post-hoc selection.
[Abstract and §4] Abstract and §4: The repeated claim that LLMs function as 'calibrated' reasoning modules is unsupported by any calibration metrics, robustness checks against prompt rephrasing, or sensitivity analysis to LLM version. This assumption is load-bearing for the assertion that the framework reliably reconciles DNA, RNA, and phenotypic signals.

minor comments (2)

[Table 1] Table 1 or equivalent cohort summary: clarify whether the 890 patients include any overlap with the training data of the proprietary baselines (Claude Sonnet 4.6, GPT-5-mini) to rule out data leakage.
[Results] Figure 2 or Results text: the 35% RNA contribution figure would benefit from an explicit breakdown of how many of the 42 cases were already prioritized by DNA/phenotype alone versus newly rescued by RNA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and have revised the manuscript to improve reproducibility, add statistical rigor, and clarify our claims about the LLM components.

read point-by-point responses

Referee: [Methods] Methods: The manuscript provides no explicit prompt templates, evidence-weighting scheme, or fusion procedure for the >100 heterogeneous signals. Because the reported 94% top-10 recall and 35% RNA contribution (42/119 cases) are produced exclusively by prompt-shaped integration rather than learned or first-principles weights, the absence of these details directly prevents independent verification of the central performance claims.

Authors: We agree that the absence of explicit prompt templates and fusion details limits reproducibility. In the revised manuscript we have added a new Methods subsection (Section 3.2) and Appendix A containing the full prompt templates for each evidence category, the prompt-based weighting scheme (LLM-assigned relative importance per signal type), and the exact step-by-step fusion procedure that combines DNA, RNA, phenotypic, curated, and in-silico signals. These additions directly enable independent verification of the reported 94% top-10 recall and 35% RNA contribution. revision: yes
Referee: [Results] Results: No statistical tests, confidence intervals, or correction for multiple comparisons accompany the stated 11-25% gains over baselines on the 890-patient cohort. Without these, it is impossible to determine whether the observed outperformance is robust or could be explained by cohort composition or post-hoc selection.

Authors: We acknowledge the lack of statistical support in the original submission. The revised Results section now reports bootstrap 95% confidence intervals for all recall@K metrics, McNemar’s tests for paired comparisons against each baseline, and Bonferroni correction across the five recall thresholds. After correction the 11–25% gains remain statistically significant (adjusted p < 0.01). Stratified analyses by cohort further confirm that outperformance is not driven by any single cohort composition. revision: yes
Referee: [Abstract and §4] Abstract and §4: The repeated claim that LLMs function as 'calibrated' reasoning modules is unsupported by any calibration metrics, robustness checks against prompt rephrasing, or sensitivity analysis to LLM version. This assumption is load-bearing for the assertion that the framework reliably reconciles DNA, RNA, and phenotypic signals.

Authors: We accept that the term 'calibrated' was not supported by formal calibration metrics. In the revised abstract and Section 4 we have replaced 'calibrated' with 'modular' to avoid implying probability calibration. We have also added a robustness subsection (Section 4.4) that includes sensitivity analyses to prompt rephrasing (recall variation < 4%) and across LLM versions (GPT-5-mini and Claude Sonnet 4.6), showing consistent multimodal performance. These checks strengthen the claim that the framework reliably integrates heterogeneous signals without overstating calibration. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark on external cohorts with no fitted parameters or self-referential derivations

full rationale

The paper presents an LLM-based framework evaluated on 890 patients from three cohorts (including 119 with paired DNA/RNA data from the Undiagnosed Diseases Network). Performance metrics such as 94% top-10 recall are reported as direct outcomes of applying the system to these held-out cases. No equations, fitted parameters, or derivation steps are described that would reduce the reported recall or modality contributions to quantities defined by the same data used to tune prompts. The framework relies on external curated knowledge bases and prompt templates, but the central claims remain independent empirical results rather than tautological restatements of inputs. No self-citation load-bearing steps or ansatz smuggling are present in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework itself is the primary contribution, resting on the unstated assumption that LLM outputs can be treated as calibrated probabilities across heterogeneous evidence types.

pith-pipeline@v0.9.0 · 5639 in / 1196 out tokens · 24355 ms · 2026-05-16T07:37:35.544098+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

integrates more than 100 diagnostic evidence signals... relative contribution of each modality is controlled exclusively by prompt templates rather than learned parameters, statistical fusion, or first-principles weighting
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

RareCollab... enables large language models to operate as calibrated, interpretable reasoning modules

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.