pith. sign in

arxiv: 2605.22734 · v1 · pith:53LYGPKLnew · submitted 2026-05-21 · 💻 cs.CL

ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

Pith reviewed 2026-05-22 05:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords biomedical knowledge graphtemporal reasoningclinical reasoningknowledge extractionlarge language modelsrare diseasesretrieval augmented generationbenchmark dataset
0
0 comments X

The pith

ChronoMedKG adds explicit timing to disease associations so retrieval systems can handle questions that depend on when symptoms appear during a disease course.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing biomedical knowledge graphs treat associations as fixed facts and therefore cannot support clinical reasoning that turns on timing, such as whether a finding at age three points to one diagnosis while the same finding at age thirteen points to another. The authors build ChronoMedKG by running multiple frontier language models over PubMed and PMC literature in a disease-by-disease pipeline, keeping only triples that survive consensus, credibility scoring, and ontology alignment. The resulting graph supplies temporal labels such as onset windows and progression stages for more than thirteen thousand diseases and supplies the first such coverage for thousands of conditions that lacked it. When the graph is used for retrieval, it recovers a much larger share of large-language-model errors on time-sensitive questions than static resources do.

Core claim

ChronoMedKG contains 460,497 evidence-linked triples, each tied to temporal components such as onset window or progression stage and backed by PMID-traceable evidence plus a multi-signal credibility score. It is produced by a disease-autonomous multi-agent pipeline in which several frontier LLMs independently extract relations from the literature; only relations supported by multi-model consensus, credibility filtering, and ontology alignment are retained. The graph adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases, and retrieval using it rescues 47-65 percent of long-tail failures on temporal questions, far

What carries the argument

The disease-autonomous multi-agent LLM pipeline with multi-model consensus, credibility filtering, and ontology alignment that extracts and validates temporal clinical associations from PubMed and PMC literature.

If this is right

  • Large language models drop roughly 30 points when moving from static to temporal clinical questions.
  • Retrieval with ChronoMedKG recovers between 47 and 65 percent of long-tail failures on temporal tasks.
  • The graph supplies temporal coverage for 6,250 diseases that lacked it in HPOA, Orphadata, or Phenopackets.
  • ChronoTQA provides eight task types that isolate temporal clinical reasoning from static controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be rerun periodically to keep the temporal labels current as new literature appears.
  • Electronic health record systems could query the graph to flag when a patient's current presentation deviates from expected disease timelines.
  • The approach of multi-model consensus extraction might be tested on non-biomedical domains that also require time-stamped relations, such as legal or historical event graphs.

Load-bearing premise

The assumption that multiple large language models can reliably converge on accurate temporal medical facts from the scientific literature and that the subsequent filtering steps remove most errors.

What would settle it

A blinded expert review of several hundred randomly sampled triples in which clinicians compare the stated onset window or progression stage against the cited source papers and find agreement below 80 percent.

Figures

Figures reproduced from arXiv: 2605.22734 by Farzaneh Firoozbakht, Jan Baumbach, Lukas Galke Poech, Md Shamim Ahmed, Richard R\"ottger.

Figure 1
Figure 1. Figure 1: ChronoMedKG: from what is associated to when it emerges. (A) Across 17,080 PrimeKG diseases, zero edges carry temporal annotations. (B) 4-agent pipeline ingests ∼2.5M public documents and produces literature-grounded temporal triples with PMID provenance, multi-LLM consensus, and six-signal credibility scoring. (C) Downstream outputs unique to ChronoMedKG. (D) Top-line numbers. Constructed de novo from pub… view at source ↗
Figure 2
Figure 2. Figure 2: Coverage and granularity of disease onset data. (A) ChronoMedKG covers 8,935 of 17,080 PrimeKG diseases (52.3%); 6,250 have onset data absent from every curated gold standard (+18.4 pp over Orphadata’s 33.9%). (B) DMD granularity contrast: Orphadata gives a single 1–5y range; ChronoMedKG records five per-phenotype onsets (walking delay 2–5y, Gowers sign 5–8y, loss of ambulation 8–12y, cardiomyopathy 10–18y… view at source ↗
Figure 3
Figure 3. Figure 3: Example temporal phenotype profile for Duchenne muscular dystrophy. Stages (presymptomatic, ambulatory, non-ambulatory, respiratory) partition the age axis; phenotype bars show the onset window for each clinical feature, colour￾coded by system (motor, cardiac, respiratory). The static-KG statement “DMD is associated with cardiomyopathy” is replaced in ChronoMedKG by a stage-aware statement: “cardiomyopathy… view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic odysseys in four representative PMC cases. Pink bars show the misdiagnosis period (from initial [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Classification of 2,563 ChronoMedKG–Orphadata onset-range disagreements. Only 7.3% are factually incorrect temporal ranges (no overlap, >10y gap from gold). 50.1% are strictly contained within the gold range, and a further 41.8% reflect boundary effects or granularity mismatches (e.g., ChronoMedKG’s “3–8y” vs. Orphadata’s “childhood”), not extraction errors. Percentages sum to 99.2% due to rounding. H.5 No… view at source ↗
Figure 6
Figure 6. Figure 6: Disease trajectory archetypes discovered by unsupervised clustering of 8,935 diseases with temporal features. Four [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evidence publication year distribution across 455K ChronoMedKG triples with PMID-traceable dates. Median year: [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mean reciprocal rank for TransE link prediction on disease–phenotype edges (3 random seeds; error bars are std). [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
read the original abstract

Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence-linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID-traceable evidence and a multi-signal credibility score. The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12-question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47-65% of their long-tail failures, against 17-29% for HPOA-RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval-augmented clinical systems that was previously absent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents ChronoMedKG, a temporally-grounded biomedical knowledge graph with 460,497 evidence-linked triples covering 13,431 diseases, each annotated with temporal components (onset window, progression stage) extracted via a disease-autonomous multi-agent LLM pipeline from PubMed/PMC literature using multi-model consensus, credibility filtering, and ontology alignment. It reports 92.7% agreement with Orphadata, temporal grounding for 6,250 diseases absent from prior resources, and introduces the ChronoTQA benchmark (3,341 questions) showing frontier LLMs drop ~30 points on temporal vs. static tasks while ChronoMedKG RAG rescues 47-65% of long-tail failures (vs. 17-29% for HPOA-RAG).

Significance. If the temporal extractions prove reliable, ChronoMedKG would address a clear gap in existing KGs (PrimeKG, Hetionet) by enabling longitudinal clinical reasoning and improving RAG performance on time-sensitive queries; the benchmark and rescue percentages provide a concrete, falsifiable test of utility. The multi-signal credibility scoring and PMID traceability are positive design choices that support reproducibility.

major comments (3)
  1. [Abstract] Abstract: The 92.7% agreement is measured against Orphadata, which supplies static phenotype-disease links but no temporal annotations; this provides no direct validation of the novel temporal components (onset windows, progression stages) that underpin the headline claims of rescuing 47-65% of long-tail failures and grounding 6,250 new diseases.
  2. [Methods] Methods / Construction pipeline: No error analysis, inter-annotator agreement, or human validation protocol is described for the temporal qualifiers extracted by the multi-model LLM consensus; systematic biases in interpreting phrases such as “early childhood” or “progressive after age 10” could propagate undetected into the 460k triples.
  3. [Evaluation] Evaluation: The ChronoTQA benchmark results and RAG rescue percentages presuppose accurate temporal labels; without a temporal-specific ground-truth subset or ablation on extraction accuracy, it is unclear whether the reported gains are attributable to correct temporal grounding or to other factors.
minor comments (2)
  1. [Methods] Clarify the exact definition and weighting of the multi-signal credibility score and the consensus threshold used for filtering the 13M raw extractions down to 460k triples.
  2. [Results] Provide the distribution of temporal component types across the 13,431 diseases to allow readers to assess coverage of onset vs. progression vs. other qualifiers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that identify key areas for strengthening the validation of temporal components in ChronoMedKG. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 92.7% agreement is measured against Orphadata, which supplies static phenotype-disease links but no temporal annotations; this provides no direct validation of the novel temporal components (onset windows, progression stages) that underpin the headline claims of rescuing 47-65% of long-tail failures and grounding 6,250 new diseases.

    Authors: We agree that the 92.7% agreement evaluates the static disease-phenotype associations against Orphadata and does not constitute direct validation of the temporal annotations. The temporal components are generated via the multi-agent LLM consensus pipeline with credibility filtering, and their practical value is shown by the differential performance on temporal versus static tasks in ChronoTQA together with the RAG rescue rates. In revision we will clarify this distinction in the abstract and add explicit language noting that temporal validation is supported indirectly through benchmark utility and the addition of temporal data for 6,250 previously ungrounded diseases. revision: yes

  2. Referee: [Methods] Methods / Construction pipeline: No error analysis, inter-annotator agreement, or human validation protocol is described for the temporal qualifiers extracted by the multi-model LLM consensus; systematic biases in interpreting phrases such as “early childhood” or “progressive after age 10” could propagate undetected into the 460k triples.

    Authors: The referee correctly notes the lack of reported error analysis or human validation specifically for temporal qualifiers. While multi-model consensus and multi-signal credibility scoring are intended to mitigate individual biases, the submitted manuscript does not include inter-annotator agreement metrics or a human review protocol for temporal phrase interpretation. We will add a dedicated error-analysis subsection to the Methods that reports results from a sampled human validation of temporal extractions, including agreement rates and discussion of potential systematic biases in onset and progression phrasing. revision: yes

  3. Referee: [Evaluation] Evaluation: The ChronoTQA benchmark results and RAG rescue percentages presuppose accurate temporal labels; without a temporal-specific ground-truth subset or ablation on extraction accuracy, it is unclear whether the reported gains are attributable to correct temporal grounding or to other factors.

    Authors: We acknowledge that the reported gains rest on the assumption of accurate temporal labels and that a dedicated temporal ground-truth subset or ablation would allow stronger causal claims. The benchmark design already contrasts temporal and static tasks, and the larger rescue effect with ChronoMedKG versus HPOA-RAG provides supporting evidence. In the revised manuscript we will include an ablation comparing RAG performance with and without temporal qualifiers and will report results on a manually verified temporal ground-truth subset to isolate the contribution of the temporal annotations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via external extraction and validation

full rationale

The paper constructs ChronoMedKG via a multi-agent LLM pipeline that extracts from external PubMed/PMC literature, retains only multi-model consensus relations after credibility filtering and ontology alignment to Orphadata, then reports empirical agreement (92.7%) and retrieval rescue rates (47-65%) on the ChronoTQA benchmark. These outcomes are measured results of the pipeline applied to independent literature and benchmarks rather than any redefinition of inputs, fitted parameters renamed as predictions, or self-citation chains that reduce claims to tautologies. No equations, ansatzes, or uniqueness theorems are invoked that collapse the temporal grounding or performance numbers back to the construction process itself.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The approach depends on assumptions about LLM reliability for temporal extraction and filtering criteria that are not independently verified beyond one external agreement score.

free parameters (2)
  • credibility score threshold
    Multi-signal credibility score used to filter relations; exact threshold or weighting not stated.
  • consensus requirement
    Number or agreement level across frontier LLMs required to retain a relation.
axioms (2)
  • domain assumption Frontier LLMs can reliably extract temporal clinical knowledge from literature when using consensus
    Core assumption enabling the multi-agent pipeline.
  • domain assumption Ontology alignment with resources like Orphadata preserves temporal meaning
    Invoked to integrate extracted relations.
invented entities (1)
  • temporal components (onset window, progression stage) no independent evidence
    purpose: To encode when each association becomes clinically relevant
    New attributes added to each triple in the graph.

pith-pipeline@v0.9.0 · 5895 in / 1528 out tokens · 59432 ms · 2026-05-22T05:38:39.791826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 3 internal anchors

  1. [1]

    The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

    Md Shamim Ahmed, Maja Dusanic, Moritz Nikolai Kirschner, Elisabeth Nyoungui, Jana Zschüntzsch, Lukas Galke Poech, and Richard Röttger. The provenance gap in clinical AI: Evidence-traceable temporal knowledge graphs for rare disease reasoning.arXiv preprint arXiv:2604.17114, Apr 2026. doi: 10.48550/arXiv.2604.17114. URL https://arxiv.org/ abs/2604.17114

  2. [2]

    PyKEEN 1.0: A Python library for training and evaluating knowledge graph embeddings.Journal of Machine Learning Research, 22(82):1–6, 2021

    Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Sahand Sharifzadeh, V olker Tresp, and Jens Lehmann. PyKEEN 1.0: A Python library for training and evaluating knowledge graph embeddings.Journal of Machine Learning Research, 22(82):1–6, 2021. arXiv:2007.14175

  3. [3]

    Translating embeddings for modeling multi-relational data

    Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. InAdvances in Neural Information Processing Systems, volume 26, 2013

  4. [4]

    Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023

    Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023. doi: 10.1038/s41597-023-01960-3

  5. [5]

    RareBench: Can LLMs serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24)

    Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, and Ting Chen. RareBench: Can LLMs serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). ACM, 2024. doi: 10.1145/3637528.3671576. URLhttps://doi.org/10.1145/3637528.3671576

  6. [6]

    HyTE: Hyperplane-based temporally aware knowledge graph embedding

    Shib Sankar Dasgupta, Swayambhu Nath Ray, and Partha Talukdar. HyTE: Hyperplane-based temporally aware knowledge graph embedding. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2001–2011, 2018. doi: 10.18653/v1/D18-1225

  7. [7]

    URL https://cacm.acm.org/research/ datasheets-for-datasets/

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723

  8. [8]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2009.03300

  9. [9]

    Systematic integration of biomedical knowledge prioritizes drugs for repurposing.eLife, 6:e26726, 2017

    Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini. Systematic integration of biomedical knowledge prioritizes drugs for repurposing.eLife, 6:e26726, 2017. doi: 10.7554/eLife.26726

  10. [10]

    The GA4GH Phenopacket schema defines a computable representation of clinical data.Nature Biotechnology, 40(6): 817–820, 2022

    Julius O B Jacobsen, Michael Baudis, Gareth S Baynam, et al. The GA4GH Phenopacket schema defines a computable representation of clinical data.Nature Biotechnology, 40(6): 817–820, 2022. doi: 10.1038/s41587-022-01357-4

  11. [11]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

  12. [12]

    PubMedQA: A Dataset for Biomedical Research Question Answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2567–2577, 2019. doi: 10.18653/v1/D19-1259

  13. [13]

    The Human Phenotype Ontology in 2024: phenotypes around the world.Nucleic Acids Research, 52(D1):D1333– D1346, 2024

    Sebastian Köhler, Michael Gargano, Nicolas Matentzoglu, et al. The Human Phenotype Ontology in 2024: phenotypes around the world.Nucleic Acids Research, 52(D1):D1333– D1346, 2024. doi: 10.1093/nar/gkad1005. 10

  14. [14]

    Self-alignment pretraining for biomedical entity representations

    Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. Self-alignment pretraining for biomedical entity representations. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4228–4238, 2021. doi: 10.18653/v1/2021.naacl-main.334

  15. [15]

    KARMA: Leveraging multi-agent LLMs for automated knowledge graph enrichment

    Yuxing Lu, Wei Wu, Xukai Zhao, Rui Peng, and Jinzhuo Wang. KARMA: Leveraging multi-agent LLMs for automated knowledge graph enrichment. InAdvances in Neural In- formation Processing Systems (NeurIPS), volume 38, 2025. NeurIPS 2025 Spotlight Poster. arXiv:2502.06472

  16. [16]

    T-Phenotype: Discovering phenotypes of predictive temporal patterns in disease progression

    Yuchao Qin, Mihaela van der Schaar, and Changhee Lee. T-Phenotype: Discovering phenotypes of predictive temporal patterns in disease progression. InProceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 ofProceedings of Machine Learning Research, pages 3466–3492, 2023

  17. [17]

    Database resources of the national center for biotechnology information.Nucleic Acids Research, 50(D1):D20–D26, 2022

    Eric W Sayers, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, Kathryn Funk, Chris Kelly, Sunghwan Kim, et al. Database resources of the national center for biotechnology information.Nucleic Acids Research, 50(D1):D20–D26, 2022. doi: 10.1093/nar/gkab1112

  18. [18]

    KGARevion: An AI agent for knowledge-intensive biomedical QA

    Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, and Marinka Zitnik. KGARevion: An AI agent for knowledge-intensive biomedical QA. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=tnB94WQGrn

  19. [19]

    RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

    Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. RotatE: Knowledge graph em- bedding by relational rotation in complex space. InInternational Conference on Learning Representations (ICLR), 2019. arXiv:1902.10197

  20. [20]

    Know-Evolve: Deep temporal reasoning for dynamic knowledge graphs

    Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. Know-Evolve: Deep temporal reasoning for dynamic knowledge graphs. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3462–3471, 2017

  21. [21]

    An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16:138, 2015

    George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16:138, 2015. doi: 10.1186/s12859-015-0564-6

  22. [22]

    Med- KGent: A large language model agent framework for constructing temporally evolving medical knowledge graph.arXiv preprint arXiv:2508.12393, 2025

    Duzhen Zhang, Zixiao Wang, Zhong-Zhi Li, Yahan Yu, Shuncheng Jia, Jiahua Dong, Haotian Xu, Xing Wu, Yingying Zhang, Tielin Zhang, Jie Yang, Xiuying Chen, and Le Song. Med- KGent: A large language model agent framework for constructing temporally evolving medical knowledge graph.arXiv preprint arXiv:2508.12393, 2025. doi: 10.48550/arXiv.2508.12393. URLhttp...

  23. [23]

    A comprehensive large- scale biomedical knowledge graph for AI-powered data-driven biomedical research.Nature Machine Intelligence, 7:602–614, 2025

    Yuan Zhang, Xin Sui, Feng Pan, Kaixian Yu, Keqiao Li, Shubo Tian, Arslan Erdengasileng, Qing Han, Wanjing Wang, Jianan Wang, Jian Wang, Donghu Sun, Henry Chung, Jun Zhou, Eric Zhou, Ben Lee, Peili Zhang, Xing Qiu, Tingting Zhao, and Jinfeng Zhang. A comprehensive large- scale biomedical knowledge graph for AI-powered data-driven biomedical research.Nature...

  24. [24]

    Automat- ing biomedical knowledge graph construction for context-aware scientific inference.bioRxiv,

    Yikai Zheng, Wanquan Liu, Bi Zeng, Yichun Feng, Xiawei Du, Lu Zhou, and Yixue Li. Automat- ing biomedical knowledge graph construction for context-aware scientific inference.bioRxiv,

  25. [25]

    proximal weakness

    doi: 10.64898/2026.01.14.699420. URL https://www.biorxiv.org/content/10. 64898/2026.01.14.699420v1. Preprint, posted 14 January 2026. Data, Code, and AI Usage ChronoMedKG (CC BY 4.0) and ChronoTQA are archived on Zenodo at https://doi.org/ 10.5281/zenodo.19697542; pipeline and experiment code (MIT) are on GitLab at https:// gitlab.sdu.dk/screen4care/chron...

  26. [26]

    TEMPORAL FACTS: onset ages, disease milestones, progression timelines, treatment timing, discovery dates

  27. [27]

    EVIDENCE-DATED FACTS: relationships anchored by publication year

  28. [28]

    CONDITIONAL FACTS: relationships that depend on age, stage, genetic subtype

  29. [29]

    triples": [ {

    STATIC FACTS: general relationships without temporal context Output format (JSON): { "triples": [ { "subject": "entity name", "subject_type": "disease|gene/protein|drug|phenotype|anatomy|...", "relation": "disease_protein|indication|disease_phenotype_positive|...", "object": "entity name", "object_type": "same vocabulary as subject_type", "confidence": "h...

  30. [30]

    norm ops not supported for complex y

    PrimeKG, by contrast, has zero evidence dates at the edge level, leaving no way to assess recency, supersession, or evidence evolution. I Extended Ablations I.1 Link Prediction with Temporal Features: Setup and Full Result To test whether ChronoMedKG’s temporal annotations provide useful signal for standard KG evaluation, we trained TransE [ 3] via PyKEEN...