pith. machine review for the scientific record. sign in

arxiv: 2604.17114 · v2 · submitted 2026-04-18 · 💻 cs.CL

Recognition: unknown

The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:34 UTC · model grok-4.3

classification 💻 cs.CL
keywords provenance gapclinical AItemporal knowledge graphrare disease reasoningevidence traceabilitycitation hallucinationmedical reasoningPubMed grounding
0
0 comments X

The pith

Hierarchical evidence graphs let clinical AI cite real sources for rare disease cases while matching standard diagnostic coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models frequently invent citations when reasoning about medical cases, creating a provenance gap that erodes trust. The paper shows that structuring verified PubMed records and disease milestones into hierarchical temporal knowledge graphs produces outputs with full evidence traceability. In direct comparisons on 36 rare neuromuscular disease scenarios the graph method delivers the same clinical feature coverage as baseline approaches but supplies 203 verifiable inline citations. Ordinary retrieval from the identical documents yields none. This matters because clinicians can now audit every claim without losing reasoning quality, and the system runs locally with open models.

Core claim

The authors build Hierarchical Evidence-Grounded Temporal Knowledge Graphs from 4,512 PubMed records plus 1,280 curated milestones. In a controlled three-arm test using the same synthesis model, the graph system matches baseline clinical feature coverage across 36 clinician-validated rare disease scenarios while achieving 100 percent evidence verifiability through 203 inline citations. Guideline-style retrieval from the same overlapping source documents produces zero verifiable citations. LLM judges cannot separate fabricated from verified citations without external PubMed audit data, yet independent clinician review confirms the verifiability gain with no loss in safety or completeness.

What carries the argument

Hierarchical Evidence-Grounded Temporal Knowledge Graphs, which organize clinical evidence into time-ordered nodes and relations drawn from literature so reasoning steps can attach specific, auditable source references.

If this is right

  • The graph approach keeps diagnostic completeness while making every generated claim traceable to original documents.
  • Citation traces make injected factual errors detectable at 100 percent rate in the tested conditions.
  • The system runs on local open-source models so patient data stays inside institutional infrastructure.
  • Standard retrieval methods from the same documents fail to produce any verifiable citations.
  • Automated judges cannot reliably identify fabricated citations without access to the source records.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph construction could be scaled to more common diseases to support routine clinical decisions with built-in provenance.
  • Hospitals might use this structure to lower liability when deploying AI for diagnosis by enabling direct source checks.
  • The work suggests retrieval systems should embed provenance at construction time rather than attempting to add citations afterward.
  • Testing the temporal ordering on live patient data streams could show whether it improves forecasts of disease progression.

Load-bearing premise

The selected PubMed records and added milestones form a complete and unbiased evidence base for the 36 rare disease scenarios.

What would settle it

A new set of rare disease cases where the graph system misses a key clinical feature that the baseline method covers, or where one of its inline citations points to a source that contradicts the generated claim.

Figures

Figures reproduced from arXiv: 2604.17114 by Elisabeth Nyoungui, Jana Zsch\"untzsch, Lukas Galke Poech, Maja Dusanic, Md Shamim Ahmed, Moritz Nikolai Kirschner, Richard R\"ottger.

Figure 1
Figure 1. Figure 1: The Provenance Gap: frontier LLMs cite confidently, but rarely correctly. (a) Citation audit of five frontier LLMs in unprompted and citation-prompted conditions; ten sampled PMIDs per row coloured by verification outcome. Citation-prompting reaches at best 15.3% relevant (Claude Opus 4.6); the majority resolve to real papers in unrelated fields. (b) Per-scenario fraction of claims with verifiable PMIDs ac… view at source ↗
Figure 2
Figure 2. Figure 2: Clinician evaluation of HEG-TKG versus a parametric-only vanilla LLM across three neuromuscular disease pairs. (a) Per-dimension effect sizes ∆ = µHEG−TKG − µVanilla in Likert points with 10,000-resample bootstrap 95% CIs, across the three primary clinicians (C1: all three pairs, n = 17– 18 per arm; C2: all three pairs, n = 35–36 per arm; C3: CIDP/GBS only, n = 5–6 per arm). ∗ = Benjamini-Hochberg q < 0.05… view at source ↗
Figure 3
Figure 3. Figure 3: LLM judges cannot assess verifiability without citation ground truth. Per-judge D1 Verifiability scores pooled across disease pairs. (a) v1 blind: all judges score all three arms similarly (arm means: 4.34–4.59). (b) v2 citation-aware: Vanilla and Guideline-RAG collapse (arm means: 1.38–1.43) while HEG-TKG remains stable (arm mean: 4.65). Individual judge variation visible in bars; DeepSeek-V3 is strictest… view at source ↗
Figure 4
Figure 4. Figure 4: Counterfactual safety and knowledge graph statistics. (a) Of 15 injected clinically incorrect evidence statements, 80% were resisted by the model’s parametric knowledge, and 100% remained detectable via citation trace. (b) The three HEG-TKG knowledge graphs collectively contain 5,481 nodes, 6,316 edges, and 1,280 temporal anchors across the three disease pairs. 2.11 Local synthesis keeps patient data on si… view at source ↗
Figure 5
Figure 5. Figure 5: HEG-TKG system architecture. Two-tier knowledge graph construction (curated Tier 1 backbone + multi-LLM Tier 2 extraction), three-arm clinical comparison, and multi-dimensional evaluation including citation-aware LLM judging. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tier 2 literature extraction pipeline. PubMed abstracts are processed through relevance screening, schema-guided extraction (2 models), entity normalization (Dictionary → SapBERT → ScispaCy), temporal anchoring, consensus voting, and cross-tier integration, yielding 6,316 normalized edges with PMID provenance across three disease pairs [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative semantic correction rules. Five of the 18 rules applied during entity normal￾ization (Step 3). Each rule detects a specific extraction error pattern (left), applies a structured correction (centre), and logs a rationale (right). Rule types include direction inversion (R1), predicate refinement (R3, R5), entity retyping (R13, R16). The full rule set is listed in Supplementary Table S10. 19 [… view at source ↗
Figure 8
Figure 8. Figure 8: Temporal anchor resolution examples. Six representative inputs illustrating the four precision levels used in Step 4. Exact ages map to single ISO 8601 durations (e.g., “age 13 years” → P13Y), ranges capture bounded intervals (“late teens” → P17Y-P19Y), fuzzy qualifiers resolve developmental stages (“early childhood” → P1Y-P5Y), and phase-level anchors capture disease milestones. Each anchor is linked to i… view at source ↗
read the original abstract

Frontier large language models generate clinically accurate outputs, but their citations are often fabricated. We term this the Provenance Gap. We tested five frontier LLMs across 36 clinician-validated scenarios for three rare neuromuscular disease pairs. No model produced a clinically relevant PubMed identifier without prompting. When explicitly asked to cite, the best model achieved 15.3% relevant PMIDs; the majority resolved to real publications in unrelated fields. We present HEG-TKG (Hierarchical Evidence-Grounded Temporal Knowledge Graphs), a system that grounds clinical claims in temporal knowledge graphs built from 4,512 PubMed records and curated sources with quality-tier stratification and 1,280 disease-trajectory milestones. In a controlled three-arm comparison using the same synthesis model, HEG-TKG matches baseline clinical feature coverage while achieving 100% evidence verifiability with 203 inline citations. Guideline-RAG, given overlapping source documents as raw text, produces zero verifiable citations. LLM judges cannot distinguish fabricated from verified citations without PubMed audit data. Independent clinician evaluation confirms the verifiability advantage (Cohen's d = 1.81, p < 0.001) with no degradation on safety or completeness. A counterfactual experiment shows 80% resistance to injected clinical errors with 100% detectability via citation trace. The system deploys on-premise via open-source models so patient data never leaves institutional infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper identifies a 'Provenance Gap' in which frontier LLMs fabricate citations when generating clinical reasoning for rare neuromuscular diseases. It evaluates five LLMs across 36 clinician-validated scenarios, finding that no model produces clinically relevant PubMed identifiers without explicit prompting and that the best model reaches only 15.3% relevant PMIDs when prompted. The authors introduce HEG-TKG (Hierarchical Evidence-Grounded Temporal Knowledge Graphs), constructed from 4,512 PubMed records plus curated sources that include quality-tier stratification and 1,280 manually added disease-trajectory milestones. In a controlled three-arm comparison using the same synthesis model, HEG-TKG matches baseline clinical feature coverage while delivering 100% evidence verifiability via 203 inline citations; Guideline-RAG on overlapping raw-text sources yields zero verifiable citations. Independent clinician ratings confirm the verifiability advantage (Cohen's d = 1.81, p < 0.001) with no degradation in safety or completeness, and a counterfactual experiment demonstrates 80% resistance to injected errors with 100% detectability through citation traces. The system is designed for on-premise deployment with open-source models.

Significance. If the central claims hold after addressing the comparison controls, the work would be significant for clinical AI by demonstrating a concrete, evidence-traceable architecture that closes the provenance gap without sacrificing coverage or safety. The controlled three-arm design, use of statistical effect sizes, and explicit counterfactual error-injection test provide stronger empirical grounding than typical RAG hallucination studies. The temporal knowledge-graph structure and on-premise deployment also address practical constraints in rare-disease settings where evidence is sparse and data privacy is paramount. These elements collectively advance verifiable reasoning systems beyond standard retrieval baselines.

major comments (2)
  1. [Methods (corpus construction and evidence base)] Methods section on corpus construction: the manual addition of 1,280 disease-trajectory milestones and quality-tier stratification by human experts occurs prior to the three-arm comparison. Because Guideline-RAG is tested only on overlapping raw-text documents without equivalent curation, the reported 100% verifiability (versus 0%) may partly reflect this pre-processing step rather than an intrinsic property of the HEG-TKG representation. The central claim that the verifiability advantage is due to the hierarchical evidence-grounded structure therefore requires either (a) applying identical milestone and tiering steps to the Guideline-RAG baseline or (b) explicit, scenario-independent criteria for milestone selection that can be audited.
  2. [Results (three-arm comparison)] Results (three-arm comparison and clinician evaluation): the headline metrics (100% verifiability, 203 inline citations, Cohen's d = 1.81) rest on the assumption that the 4,512 PubMed records plus curated sources constitute a sufficiently complete and unbiased evidence base for the 36 scenarios. Without a detailed account of how relevance of PMIDs was judged, how the 36 scenarios were constructed, and whether the manual milestones were selected independently of the test cases, it is impossible to rule out that the verifiability gap is partly an artifact of curation choices. This directly affects the load-bearing claim of an intrinsic advantage for HEG-TKG.
minor comments (3)
  1. [Abstract] Abstract and §4: the precise definition of 'clinically relevant' PMID and the protocol used by clinicians to judge relevance should be stated explicitly rather than left implicit.
  2. [Evaluation] Evaluation setup: provide the exact number of clinicians, their blinding status, and inter-rater reliability statistics for the feature-coverage and safety ratings.
  3. [Results] Figure and table captions: ensure all figures reporting citation counts or verifiability percentages include error bars or confidence intervals and clearly label the three arms of the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important aspects of our experimental design and evidence base. We address each major comment below, providing clarifications and indicating revisions to the manuscript.

read point-by-point responses
  1. Referee: Methods section on corpus construction: the manual addition of 1,280 disease-trajectory milestones and quality-tier stratification by human experts occurs prior to the three-arm comparison. Because Guideline-RAG is tested only on overlapping raw-text documents without equivalent curation, the reported 100% verifiability (versus 0%) may partly reflect this pre-processing step rather than an intrinsic property of the HEG-TKG representation. The central claim that the verifiability advantage is due to the hierarchical evidence-grounded structure therefore requires either (a) applying identical milestone and tiering steps to the Guideline-RAG baseline or (b) explicit, scenario-independent criteria for milestone selection that can be audited.

    Authors: We agree that isolating the contribution of the hierarchical structure is important. The curation steps (milestone addition and tiering) are designed to be part of constructing the evidence-grounded TKG, enabling the 100% verifiability through explicit links. Guideline-RAG serves as a baseline for standard retrieval on raw documents to contrast with structured representation. To strengthen the claim, we have added explicit, scenario-independent criteria for selecting the 1,280 milestones in the revised Methods section (Section 3.2), including reliance on established clinical guidelines and expert consensus independent of the 36 test scenarios. These criteria are now auditable. We did not apply curation to Guideline-RAG as it would no longer represent a raw-text RAG baseline, but we discuss this distinction in the limitations. revision: partial

  2. Referee: Results (three-arm comparison and clinician evaluation): the headline metrics (100% verifiability, 203 inline citations, Cohen's d = 1.81) rest on the assumption that the 4,512 PubMed records plus curated sources constitute a sufficiently complete and unbiased evidence base for the 36 scenarios. Without a detailed account of how relevance of PMIDs was judged, how the 36 scenarios were constructed, and whether the manual milestones were selected independently of the test cases, it is impossible to rule out that the verifiability gap is partly an artifact of curation choices. This directly affects the load-bearing claim of an intrinsic advantage for HEG-TKG.

    Authors: We have revised the manuscript to include a detailed account in the Methods section. Relevance of PMIDs was judged by two independent clinicians who reviewed each scenario against the full text of candidate papers, requiring direct support for at least one clinical feature in the scenario (inter-rater agreement κ=0.87). The 36 scenarios were constructed through iterative consultation with neuromuscular specialists, focusing on diagnostic and management challenges for the three disease pairs, with final validation by an external clinician panel. The 1,280 milestones were selected based on a systematic review of disease literature and guidelines prior to scenario development, using predefined criteria now detailed in the paper (e.g., milestones must represent temporal progression points verifiable in PubMed). This independence is confirmed, supporting that the verifiability advantage stems from the traceable structure rather than selective curation. We also added a new subsection on evidence base completeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on controlled empirical comparison and external evaluation

full rationale

The paper presents HEG-TKG as a constructed system whose 100% verifiability and 203 inline citations follow directly from its design (temporal knowledge graphs with explicit citation traces built from the given corpus). This is then tested in a three-arm comparison against Guideline-RAG on overlapping sources, with outcomes measured by citation counting, independent clinician ratings (Cohen's d = 1.81), and safety/completeness scores. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear; the verifiability result is not derived from or equivalent to its inputs by construction but is an observable property of the implemented representation versus raw-text baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; the central claim rests on the assumption that curated PubMed and guideline sources plus manually added temporal milestones form a complete evidence base, but no explicit free parameters, axioms, or new entities beyond the system name are detailed.

axioms (1)
  • domain assumption The 4,512 PubMed records and curated sources, after quality-tier stratification, provide sufficient and unbiased coverage for the tested rare neuromuscular disease scenarios.
    Invoked implicitly when claiming 100% verifiability and error resistance.
invented entities (1)
  • Hierarchical Evidence-Grounded Temporal Knowledge Graphs (HEG-TKG) no independent evidence
    purpose: To structure clinical evidence with temporal milestones and quality tiers so LLM outputs can be traced to verifiable sources.
    New system introduced in the paper; independent evidence is limited to the internal experiments described.

pith-pipeline@v0.9.0 · 5584 in / 1657 out tokens · 56441 ms · 2026-05-10T06:34:03.115021+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 16 canonical work pages · 9 internal anchors

  1. [1]

    Bushby, K.et al.Diagnosis and management of Duchenne muscular dystrophy, part 1: diagnosis, and pharmacological and psychosocial management.The Lancet Neurology9, 77–93 (2010)

  2. [2]

    Singhal, K.et al.Towards expert-level medical question answering with large language models.arXiv preprint arXiv:2305.09617(2023)

  3. [3]

    Capabilities of GPT-4 on Medical Challenge Problems

    Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375(2023)

  4. [4]

    J.et al.Large language models in medicine.Nature Medicine29, 1930–1940 (2023)

    Thirunavukarasu, A. J.et al.Large language models in medicine.Nature Medicine29, 1930–1940 (2023)

  5. [5]

    EU Artificial Intelligence Act (2024)

    European Parliament and Council of the European Union. EU Artificial Intelligence Act (2024). Regu- lation (EU) 2024/1689

  6. [6]

    Food and Drug Administration

    U.S. Food and Drug Administration. Clinical decision support software: Guidance for industry and food and drug administration staff (2022). Updated 2026

  7. [7]

    Ethics and governance of artificial intelligence for health: WHO guidance (2021)

    World Health Organization. Ethics and governance of artificial intelligence for health: WHO guidance (2021)

  8. [8]

    Privacy and artificial intelligence: challenges for protecting health information in a new era of medicine.BMC Medical Ethics22, 122 (2021)

    Murdoch, B. Privacy and artificial intelligence: challenges for protecting health information in a new era of medicine.BMC Medical Ethics22, 122 (2021)

  9. [9]

    Wu, K., Wu, E., Wei, K., Zhang, A.et al.An automated framework for assessing how well LLMs cite relevant medical references.Nature Communications16, 3615 (2025)

  10. [10]

    J., Kim, Y.et al.Verifying facts in patient care documents generated by large language models using electronic health records.NEJM AI3(2025)

    Chung, P., Swaminathan, A., Goodell, A. J., Kim, Y.et al.Verifying facts in patient care documents generated by large language models using electronic health records.NEJM AI3(2025). 25

  11. [11]

    D., White, N

    Gibson, A. D., White, N. M., Collins, G. S. & Barnett, A. G. Evidence of unreliable data and poor data provenance in clinical prediction model research and clinical practice (2026)

  12. [12]

    Dozens of AI disease-prediction models were trained on dubious data (2026)

    Nature News. Dozens of AI disease-prediction models were trained on dubious data (2026). Nature News, 15 April 2026;doi:10.1038/d41586-026-00697-4. Reporting on Barnett, Gibson, White, and Collins (Queensland University of Technology) analysis of 124 peer-reviewed disease-prediction studies trained on Kaggle-hosted health datasets of unverified provenance

  13. [13]

    Ji, Z.et al.Survey of hallucination in natural language generation.ACM Computing Surveys55, 1–38 (2023)

  14. [14]

    Huang, L.et al.A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232(2023)

  15. [15]

    Nguengang Wakap, S.et al.Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database.European Journal of Human Genetics28, 165–173 (2020)

  16. [16]

    Gilhus, N. E. Myasthenia gravis.New England Journal of Medicine375, 2570–2581 (2016)

  17. [17]

    Shahrizaila, N., Lehmann, H. C. & Kuwabara, S. Guillain–Barr´ e syndrome.The Lancet397, 1214–1228 (2021)

  18. [18]

    & Zitnik, M

    Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine.Scientific Data10, 67 (2023)

  19. [19]

    The Unified Medical Language System (UMLS): integrating biomedical terminology

    Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research32, D199–D207 (2004)

  20. [20]

    S.et al.DrugBank 5.0: a major update to the DrugBank database for 2018.Nucleic Acids Research46, D1074–D1082 (2018)

    Wishart, D. S.et al.DrugBank 5.0: a major update to the DrugBank database for 2018.Nucleic Acids Research46, D1074–D1082 (2018)

  21. [21]

    P.et al.GeneReviews.https://www.ncbi.nlm.nih.gov/books/NBK1116/(2024)

    Adam, M. P.et al.GeneReviews.https://www.ncbi.nlm.nih.gov/books/NBK1116/(2024). University of Washington, Seattle. Updated continuously; accessed 2025

  22. [22]

    Online Mendelian Inheri- tance in Man (OMIM).https://omim.org/(2024)

    McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University. Online Mendelian Inheri- tance in Man (OMIM).https://omim.org/(2024). Accessed 2025

  23. [23]

    Orphanet: an online database of rare diseases and orphan drugs.https://www.orpha.net/ (2024)

    INSERM. Orphanet: an online database of rare diseases and orphan drugs.https://www.orpha.net/ (2024). Accessed 2025

  24. [24]

    Muscular Dystrophy: Care Considerations.https://www

    Centers for Disease Control and Prevention. Muscular Dystrophy: Care Considerations.https://www. cdc.gov/muscular-dystrophy/hcp/care-considerations/(2018). Accessed 2025

  25. [25]

    Gastinger, J.et al.Towards better evaluation of temporal knowledge graphs.Advances in Neural Information Processing Systems(2024)

  26. [26]

    npj Digital Medicine8, 577 (2025)

    Cui, H.et al.TIMER: Temporal instruction modeling and evaluation for longitudinal clinical records. npj Digital Medicine8, 577 (2025)

  27. [27]

    Lewis, P.et al.Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems33, 9459–9474 (2020). 26

  28. [28]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Asai, A., Wu, Z., Wang, Y., Sil, A. & Hajishirzi, H. Self-RAG: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511(2024)

  29. [29]

    Gao, Y.et al.Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study.JMIR AI4, e58670 (2025)

  30. [30]

    Zhao, W., Wu, C., Fan, Y., Qiu, P.et al.An agentic system for rare disease diagnosis with traceable reasoning.Nature651, 775–784 (2026)

  31. [31]

    & Shen, B

    Song, J., Xu, Z., He, M., Feng, J. & Shen, B. Graph retrieval augmented large language models for facial phenotype associated rare genetic disease.npj Digital Medicine8, 543 (2025)

  32. [32]

    & Alkhathlan, A

    Islam, S., Jamal, A. & Alkhathlan, A. ZebraMap: A multimodal rare disease knowledge map with automated data aggregation and LLM-enriched information extraction pipeline.Diagnostics16, 107 (2026)

  33. [33]

    W.et al.Database resources of the National Center for Biotechnology Information.Nucleic Acids Research50, D20–D26 (2022)

    Sayers, E. W.et al.Database resources of the National Center for Biotechnology Information.Nucleic Acids Research50, D20–D26 (2022)

  34. [34]

    H.et al.GRADE: an emerging consensus on rating quality of evidence and strength of recommendations.BMJ336, 924–926 (2008)

    Guyatt, G. H.et al.GRADE: an emerging consensus on rating quality of evidence and strength of recommendations.BMJ336, 924–926 (2008)

  35. [35]

    Wilson, E. B. Probable inference, the law of succession, and statistical inference.Journal of the American Statistical Association22, 209–212 (1927)

  36. [36]

    Yang, A.et al.Qwen2.5 technical report.arXiv preprint arXiv:2412.15115(2024)

  37. [37]

    Gemma 2: Improving Open Language Models at a Practical Size

    Google DeepMind. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118(2024)

  38. [38]

    Ollama: Run large language models locally.https://ollama.com/(2024)

    Ollama. Ollama: Run large language models locally.https://ollama.com/(2024). Accessed 2025

  39. [39]

    Klein, M.et al.Scholarly context not found: One in five articles suffers from reference rot.PLoS ONE 9, e115253 (2014)

  40. [40]

    & Lessig, L

    Zittrain, J., Albert, K. & Lessig, L. Perma: Scoping and addressing the problem of link and reference rot in legal citations.Legal Information Management14, 88–99 (2014)

  41. [41]

    Medical graph rag: Towards safe medical large language model via graph retrieval- augmented generation.arXiv preprint arXiv:2408.04187, 2024

    Wu, J.et al.Medical graph RAG: Towards safe medical large language model via graph retrieval- augmented generation.Findings of the Association for Computational Linguistics: ACL 2025(2025). ArXiv:2408.04187

  42. [42]

    Chiang and H.-Y

    Chiang, C.-H. & Lee, H.-y. Can large language models be an alternative to human evaluations?arXiv preprint arXiv:2305.01937(2023)

  43. [43]

    Zheng, L.et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems36(2023)

  44. [44]

    Krippendorff, K.Content Analysis: An Introduction to Its Methodology(Sage Publications, 2019), 4th edn. 27

  45. [45]

    Public RareBench subsets HMS, LIRICAL, MME, RAMEDIS; dataset at huggingface.co/datasets/chenxz/RareBench

    Chen, X.et al.RareBench: Can LLMs serve as rare diseases specialists? InProceed- ings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4850– 4861 (ACM, 2024). Public RareBench subsets HMS, LIRICAL, MME, RAMEDIS; dataset at huggingface.co/datasets/chenxz/RareBench

  46. [46]

    Zheng, Y.et al.Automating biomedical knowledge graph construction for context-aware scientific inference.bioRxiv(2026)

  47. [47]

    Lipscomb, C. E. Medical Subject Headings (MeSH).Bulletin of the Medical Library Association88, 265–266 (2000)

  48. [48]

    & Collier, N

    Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-Alignment Pretraining for Biomedical Entity Representations.arXiv preprint arXiv:2010.11784(2021). NAACL 2021

  49. [49]

    & Ammar, W

    Neumann, M., King, D., Beltagy, I. & Ammar, W. ScispaCy: Fast and robust models for biomedical natural language processing.arXiv preprint arXiv:1902.07669(2019). BioNLP 2019

  50. [50]

    & Gurevych, I

    Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3982–3992 (2019)

  51. [51]

    Neo4j graph database.https://neo4j.com/(2024)

    Neo4j, Inc. Neo4j graph database.https://neo4j.com/(2024). Version 5; accessed 2025

  52. [52]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774(2023)

  53. [53]

    The Claude 3 model family: Opus, Sonnet, Haiku

    Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Tech. Rep. (2024). URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_ Claude_3.pdf. Model card; accessed 2025

  54. [54]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437(2024)

  55. [55]

    & Hochberg, Y

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological)57, 289–300 (1995)

  56. [56]

    Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics18, 50–60 (1947)

  57. [57]

    Cohen, J.Statistical Power Analysis for the Behavioral Sciences(Lawrence Erlbaum Associates, Hills- dale, NJ, 1988), 2nd edn

  58. [58]

    EQUATOR Network registered reporting guideline; 14-item checklist across Purpose, Clinical, Decision, and Model attributes

    Brankovic, A.et al.Clinician-informed XAI evaluation checklist with metrics (CLIX-M) for AI-powered clinical decision support systems.npj Digital Medicine8, 364 (2025). EQUATOR Network registered reporting guideline; 14-item checklist across Purpose, Clinical, Decision, and Model attributes

  59. [59]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  60. [60]

    Birnkrant, D. J.et al.Diagnosis and management of Duchenne muscular dystrophy, part 1: diagnosis, and neuromuscular, rehabilitation, endocrine, and gastrointestinal and nutritional management.The Lancet Neurology17, 251–267 (2018). PMID:29395989

  61. [61]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A.et al.The Llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024). 28 Supplementary Materials The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning Contents 1 Introduction 1 2 Results 3 2.1 Frontier LLMs produce zero verifiable citations . . . . . ....

  62. [62]

    A structured comparison table covering: Clinical Features, Antibodies/Biomarkers, Autonomic Features, Reflexes/EMG, Treatment Approach, Temporal Course, Associated Conditions

  63. [63]

    For each feature, explain the pathophysiological basis for the difference

  64. [64]

    A clinical synthesis paragraph: what features in this specific patient point toward which diagnosis, what tests to order, and what red flags to watch for

  65. [65]

    Clinically,

    An evidence quality note: summarize what is backed by guidelines vs single studies Temporal Trajectory: You are a senior clinical neurologist writing a disease progression comparison for a colleague. You have access to temporal evidence from a curated knowledge graph with specific time anchors derived from clinical guidelines and literature. Your response...

  66. [66]

    A quick-reference milestone comparison (side-by-side table or timeline)

  67. [67]

    Detailed time-window-by-time-window analysis with clinical implications

  68. [68]

    For each divergence point: explain why the difference matters clinically S4

  69. [69]

    A synthesis: key temporal red flags that distinguish these conditions

  70. [70]

    Clinically established (not in current KG)

    Note evidence gaps -- which time windows lack high-quality data Treatment Rationale: You are a senior clinical neurologist writing treatment recommendations for a colleague. You have access to treatment evidence from a curated knowledge graph with hierarchical quality tiers. Your response should be as DETAILED as a treatment protocol -- dosing, monitoring...

  71. [71]

    First-line treatment with evidence tier, dosing, and rationale

  72. [72]

    Second-line options with escalation criteria and timing

  73. [73]

    For each treatment: mechanism, expected response timeline, monitoring requirements

  74. [74]

    Emerging therapies / newer agents with evidence tier

  75. [75]

    Treatments with conflicting evidence: present both sides with tiers

  76. [76]

    Clinical synthesis: recommended treatment algorithm for this specific patient

  77. [77]

    per GeneReviews

    Evidence quality summary: what is guideline-backed vs. emerging vs. expert opinion S2.2: Vanilla System Prompts (No Evidence, No Citation Requirement) Differential Diagnosis: You are a clinical neurology expert. Provide a detailed, evidence-based differential diagnosis for the clinical scenario presented. Cover clinical features, antibodies/biomarkers, tr...

  78. [78]

    Select one factual claim from the AI output

  79. [79]

    Attempt to verify it against PubMed S15 ID Dimension What to Assess D1 Verifiability Can each clinical claim be traced to a specific, identifiable published source? 1 = No claims traceable; 3 = About half sourced; 5 = Every claim traceable. D2 Actionability Is the information sufficient and well-supported to inform a clinical decision without additional l...

  80. [80]

    x-linked

    Record: the claim text, verification result, time to verify (seconds), and whether a specific supporting paper was found Time Estimate Approximately 2–3 hours total (24 cases at∼2 minutes each + global assessment). S16 S9: Per-Scenario Automated Metrics Evidence Traceability Score (ETS), Clinical Feature Coverage (FC), and Provenance Gap (PG) for each of ...