ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

Farzaneh Firoozbakht; Jan Baumbach; Lukas Galke Poech; Md Shamim Ahmed; Richard R\"ottger

arxiv: 2605.22734 · v1 · pith:53LYGPKLnew · submitted 2026-05-21 · 💻 cs.CL

ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

Md Shamim Ahmed , Farzaneh Firoozbakht , Lukas Galke Poech , Jan Baumbach , Richard R\"ottger This is my paper

Pith reviewed 2026-05-22 05:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords biomedical knowledge graphtemporal reasoningclinical reasoningknowledge extractionlarge language modelsrare diseasesretrieval augmented generationbenchmark dataset

0 comments

The pith

ChronoMedKG adds explicit timing to disease associations so retrieval systems can handle questions that depend on when symptoms appear during a disease course.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing biomedical knowledge graphs treat associations as fixed facts and therefore cannot support clinical reasoning that turns on timing, such as whether a finding at age three points to one diagnosis while the same finding at age thirteen points to another. The authors build ChronoMedKG by running multiple frontier language models over PubMed and PMC literature in a disease-by-disease pipeline, keeping only triples that survive consensus, credibility scoring, and ontology alignment. The resulting graph supplies temporal labels such as onset windows and progression stages for more than thirteen thousand diseases and supplies the first such coverage for thousands of conditions that lacked it. When the graph is used for retrieval, it recovers a much larger share of large-language-model errors on time-sensitive questions than static resources do.

Core claim

ChronoMedKG contains 460,497 evidence-linked triples, each tied to temporal components such as onset window or progression stage and backed by PMID-traceable evidence plus a multi-signal credibility score. It is produced by a disease-autonomous multi-agent pipeline in which several frontier LLMs independently extract relations from the literature; only relations supported by multi-model consensus, credibility filtering, and ontology alignment are retained. The graph adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases, and retrieval using it rescues 47-65 percent of long-tail failures on temporal questions, far

What carries the argument

The disease-autonomous multi-agent LLM pipeline with multi-model consensus, credibility filtering, and ontology alignment that extracts and validates temporal clinical associations from PubMed and PMC literature.

If this is right

Large language models drop roughly 30 points when moving from static to temporal clinical questions.
Retrieval with ChronoMedKG recovers between 47 and 65 percent of long-tail failures on temporal tasks.
The graph supplies temporal coverage for 6,250 diseases that lacked it in HPOA, Orphadata, or Phenopackets.
ChronoTQA provides eight task types that isolate temporal clinical reasoning from static controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be rerun periodically to keep the temporal labels current as new literature appears.
Electronic health record systems could query the graph to flag when a patient's current presentation deviates from expected disease timelines.
The approach of multi-model consensus extraction might be tested on non-biomedical domains that also require time-stamped relations, such as legal or historical event graphs.

Load-bearing premise

The assumption that multiple large language models can reliably converge on accurate temporal medical facts from the scientific literature and that the subsequent filtering steps remove most errors.

What would settle it

A blinded expert review of several hundred randomly sampled triples in which clinicians compare the stated onset window or progression stage against the cited source papers and find agreement below 80 percent.

Figures

Figures reproduced from arXiv: 2605.22734 by Farzaneh Firoozbakht, Jan Baumbach, Lukas Galke Poech, Md Shamim Ahmed, Richard R\"ottger.

**Figure 1.** Figure 1: ChronoMedKG: from what is associated to when it emerges. (A) Across 17,080 PrimeKG diseases, zero edges carry temporal annotations. (B) 4-agent pipeline ingests ∼2.5M public documents and produces literature-grounded temporal triples with PMID provenance, multi-LLM consensus, and six-signal credibility scoring. (C) Downstream outputs unique to ChronoMedKG. (D) Top-line numbers. Constructed de novo from pub… view at source ↗

**Figure 2.** Figure 2: Coverage and granularity of disease onset data. (A) ChronoMedKG covers 8,935 of 17,080 PrimeKG diseases (52.3%); 6,250 have onset data absent from every curated gold standard (+18.4 pp over Orphadata’s 33.9%). (B) DMD granularity contrast: Orphadata gives a single 1–5y range; ChronoMedKG records five per-phenotype onsets (walking delay 2–5y, Gowers sign 5–8y, loss of ambulation 8–12y, cardiomyopathy 10–18y… view at source ↗

**Figure 3.** Figure 3: Example temporal phenotype profile for Duchenne muscular dystrophy. Stages (presymptomatic, ambulatory, non-ambulatory, respiratory) partition the age axis; phenotype bars show the onset window for each clinical feature, colourcoded by system (motor, cardiac, respiratory). The static-KG statement “DMD is associated with cardiomyopathy” is replaced in ChronoMedKG by a stage-aware statement: “cardiomyopathy… view at source ↗

**Figure 4.** Figure 4: Diagnostic odysseys in four representative PMC cases. Pink bars show the misdiagnosis period (from initial [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Classification of 2,563 ChronoMedKG–Orphadata onset-range disagreements. Only 7.3% are factually incorrect temporal ranges (no overlap, >10y gap from gold). 50.1% are strictly contained within the gold range, and a further 41.8% reflect boundary effects or granularity mismatches (e.g., ChronoMedKG’s “3–8y” vs. Orphadata’s “childhood”), not extraction errors. Percentages sum to 99.2% due to rounding. H.5 No… view at source ↗

**Figure 6.** Figure 6: Disease trajectory archetypes discovered by unsupervised clustering of 8,935 diseases with temporal features. Four [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: Evidence publication year distribution across 455K ChronoMedKG triples with PMID-traceable dates. Median year: [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: Mean reciprocal rank for TransE link prediction on disease–phenotype edges (3 random seeds; error bars are std). [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

read the original abstract

Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence-linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID-traceable evidence and a multi-signal credibility score. The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12-question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47-65% of their long-tail failures, against 17-29% for HPOA-RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval-augmented clinical systems that was previously absent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChronoMedKG adds temporal onset and progression labels to a large biomedical KG via LLM consensus extraction and shows retrieval gains on a new temporal benchmark, but the temporal claims rest on indirect validation.

read the letter

The main thing to know is that this paper builds a biomedical knowledge graph with explicit temporal components like onset windows and progression stages, extracted from literature through a multi-LLM pipeline, and tests it on ChronoTQA, a benchmark that mixes temporal and static clinical questions. The results indicate that frontier models drop performance on the temporal tasks and that pulling from this graph recovers more long-tail failures than static alternatives do.

Referee Report

3 major / 2 minor

Summary. The paper presents ChronoMedKG, a temporally-grounded biomedical knowledge graph with 460,497 evidence-linked triples covering 13,431 diseases, each annotated with temporal components (onset window, progression stage) extracted via a disease-autonomous multi-agent LLM pipeline from PubMed/PMC literature using multi-model consensus, credibility filtering, and ontology alignment. It reports 92.7% agreement with Orphadata, temporal grounding for 6,250 diseases absent from prior resources, and introduces the ChronoTQA benchmark (3,341 questions) showing frontier LLMs drop ~30 points on temporal vs. static tasks while ChronoMedKG RAG rescues 47-65% of long-tail failures (vs. 17-29% for HPOA-RAG).

Significance. If the temporal extractions prove reliable, ChronoMedKG would address a clear gap in existing KGs (PrimeKG, Hetionet) by enabling longitudinal clinical reasoning and improving RAG performance on time-sensitive queries; the benchmark and rescue percentages provide a concrete, falsifiable test of utility. The multi-signal credibility scoring and PMID traceability are positive design choices that support reproducibility.

major comments (3)

[Abstract] Abstract: The 92.7% agreement is measured against Orphadata, which supplies static phenotype-disease links but no temporal annotations; this provides no direct validation of the novel temporal components (onset windows, progression stages) that underpin the headline claims of rescuing 47-65% of long-tail failures and grounding 6,250 new diseases.
[Methods] Methods / Construction pipeline: No error analysis, inter-annotator agreement, or human validation protocol is described for the temporal qualifiers extracted by the multi-model LLM consensus; systematic biases in interpreting phrases such as “early childhood” or “progressive after age 10” could propagate undetected into the 460k triples.
[Evaluation] Evaluation: The ChronoTQA benchmark results and RAG rescue percentages presuppose accurate temporal labels; without a temporal-specific ground-truth subset or ablation on extraction accuracy, it is unclear whether the reported gains are attributable to correct temporal grounding or to other factors.

minor comments (2)

[Methods] Clarify the exact definition and weighting of the multi-signal credibility score and the consensus threshold used for filtering the 13M raw extractions down to 460k triples.
[Results] Provide the distribution of temporal component types across the 13,431 diseases to allow readers to assess coverage of onset vs. progression vs. other qualifiers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that identify key areas for strengthening the validation of temporal components in ChronoMedKG. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The 92.7% agreement is measured against Orphadata, which supplies static phenotype-disease links but no temporal annotations; this provides no direct validation of the novel temporal components (onset windows, progression stages) that underpin the headline claims of rescuing 47-65% of long-tail failures and grounding 6,250 new diseases.

Authors: We agree that the 92.7% agreement evaluates the static disease-phenotype associations against Orphadata and does not constitute direct validation of the temporal annotations. The temporal components are generated via the multi-agent LLM consensus pipeline with credibility filtering, and their practical value is shown by the differential performance on temporal versus static tasks in ChronoTQA together with the RAG rescue rates. In revision we will clarify this distinction in the abstract and add explicit language noting that temporal validation is supported indirectly through benchmark utility and the addition of temporal data for 6,250 previously ungrounded diseases. revision: yes
Referee: [Methods] Methods / Construction pipeline: No error analysis, inter-annotator agreement, or human validation protocol is described for the temporal qualifiers extracted by the multi-model LLM consensus; systematic biases in interpreting phrases such as “early childhood” or “progressive after age 10” could propagate undetected into the 460k triples.

Authors: The referee correctly notes the lack of reported error analysis or human validation specifically for temporal qualifiers. While multi-model consensus and multi-signal credibility scoring are intended to mitigate individual biases, the submitted manuscript does not include inter-annotator agreement metrics or a human review protocol for temporal phrase interpretation. We will add a dedicated error-analysis subsection to the Methods that reports results from a sampled human validation of temporal extractions, including agreement rates and discussion of potential systematic biases in onset and progression phrasing. revision: yes
Referee: [Evaluation] Evaluation: The ChronoTQA benchmark results and RAG rescue percentages presuppose accurate temporal labels; without a temporal-specific ground-truth subset or ablation on extraction accuracy, it is unclear whether the reported gains are attributable to correct temporal grounding or to other factors.

Authors: We acknowledge that the reported gains rest on the assumption of accurate temporal labels and that a dedicated temporal ground-truth subset or ablation would allow stronger causal claims. The benchmark design already contrasts temporal and static tasks, and the larger rescue effect with ChronoMedKG versus HPOA-RAG provides supporting evidence. In the revised manuscript we will include an ablation comparing RAG performance with and without temporal qualifiers and will report results on a manually verified temporal ground-truth subset to isolate the contribution of the temporal annotations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via external extraction and validation

full rationale

The paper constructs ChronoMedKG via a multi-agent LLM pipeline that extracts from external PubMed/PMC literature, retains only multi-model consensus relations after credibility filtering and ontology alignment to Orphadata, then reports empirical agreement (92.7%) and retrieval rescue rates (47-65%) on the ChronoTQA benchmark. These outcomes are measured results of the pipeline applied to independent literature and benchmarks rather than any redefinition of inputs, fitted parameters renamed as predictions, or self-citation chains that reduce claims to tautologies. No equations, ansatzes, or uniqueness theorems are invoked that collapse the temporal grounding or performance numbers back to the construction process itself.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The approach depends on assumptions about LLM reliability for temporal extraction and filtering criteria that are not independently verified beyond one external agreement score.

free parameters (2)

credibility score threshold
Multi-signal credibility score used to filter relations; exact threshold or weighting not stated.
consensus requirement
Number or agreement level across frontier LLMs required to retain a relation.

axioms (2)

domain assumption Frontier LLMs can reliably extract temporal clinical knowledge from literature when using consensus
Core assumption enabling the multi-agent pipeline.
domain assumption Ontology alignment with resources like Orphadata preserves temporal meaning
Invoked to integrate extracted relations.

invented entities (1)

temporal components (onset window, progression stage) no independent evidence
purpose: To encode when each association becomes clinically relevant
New attributes added to each triple in the graph.

pith-pipeline@v0.9.0 · 5895 in / 1528 out tokens · 59432 ms · 2026-05-22T05:38:39.791826+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ChronoMedKG is built using a multi-stage pipeline ... (1) Disease Profiler ... (2) Evidence Harvester ... (3) Knowledge Extractor ... (4) Quality Controller ... Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each association is tied to temporal components like onset window or progression stage ... six-signal credibility score.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 3 internal anchors

[1]

The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

Md Shamim Ahmed, Maja Dusanic, Moritz Nikolai Kirschner, Elisabeth Nyoungui, Jana Zschüntzsch, Lukas Galke Poech, and Richard Röttger. The provenance gap in clinical AI: Evidence-traceable temporal knowledge graphs for rare disease reasoning.arXiv preprint arXiv:2604.17114, Apr 2026. doi: 10.48550/arXiv.2604.17114. URL https://arxiv.org/ abs/2604.17114

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17114 2026
[2]

PyKEEN 1.0: A Python library for training and evaluating knowledge graph embeddings.Journal of Machine Learning Research, 22(82):1–6, 2021

Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Sahand Sharifzadeh, V olker Tresp, and Jens Lehmann. PyKEEN 1.0: A Python library for training and evaluating knowledge graph embeddings.Journal of Machine Learning Research, 22(82):1–6, 2021. arXiv:2007.14175

work page arXiv 2021
[3]

Translating embeddings for modeling multi-relational data

Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. InAdvances in Neural Information Processing Systems, volume 26, 2013

work page 2013
[4]

Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023

Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023. doi: 10.1038/s41597-023-01960-3

work page doi:10.1038/s41597-023-01960-3 2023
[5]

RareBench: Can LLMs serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24)

Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, and Ting Chen. RareBench: Can LLMs serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). ACM, 2024. doi: 10.1145/3637528.3671576. URLhttps://doi.org/10.1145/3637528.3671576

work page doi:10.1145/3637528.3671576 2024
[6]

HyTE: Hyperplane-based temporally aware knowledge graph embedding

Shib Sankar Dasgupta, Swayambhu Nath Ray, and Partha Talukdar. HyTE: Hyperplane-based temporally aware knowledge graph embedding. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2001–2011, 2018. doi: 10.18653/v1/D18-1225

work page doi:10.18653/v1/d18-1225 2018
[7]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723

work page doi:10.1145/3458723 2021
[8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Systematic integration of biomedical knowledge prioritizes drugs for repurposing.eLife, 6:e26726, 2017

Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini. Systematic integration of biomedical knowledge prioritizes drugs for repurposing.eLife, 6:e26726, 2017. doi: 10.7554/eLife.26726

work page doi:10.7554/elife.26726 2017
[10]

The GA4GH Phenopacket schema defines a computable representation of clinical data.Nature Biotechnology, 40(6): 817–820, 2022

Julius O B Jacobsen, Michael Baudis, Gareth S Baynam, et al. The GA4GH Phenopacket schema defines a computable representation of clinical data.Nature Biotechnology, 40(6): 817–820, 2022. doi: 10.1038/s41587-022-01357-4

work page doi:10.1038/s41587-022-01357-4 2022
[11]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

work page doi:10.3390/app11146421 2021
[12]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2567–2577, 2019. doi: 10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259 2019
[13]

The Human Phenotype Ontology in 2024: phenotypes around the world.Nucleic Acids Research, 52(D1):D1333– D1346, 2024

Sebastian Köhler, Michael Gargano, Nicolas Matentzoglu, et al. The Human Phenotype Ontology in 2024: phenotypes around the world.Nucleic Acids Research, 52(D1):D1333– D1346, 2024. doi: 10.1093/nar/gkad1005. 10

work page doi:10.1093/nar/gkad1005 2024
[14]

Self-alignment pretraining for biomedical entity representations

Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. Self-alignment pretraining for biomedical entity representations. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4228–4238, 2021. doi: 10.18653/v1/2021.naacl-main.334

work page doi:10.18653/v1/2021.naacl-main.334 2021
[15]

KARMA: Leveraging multi-agent LLMs for automated knowledge graph enrichment

Yuxing Lu, Wei Wu, Xukai Zhao, Rui Peng, and Jinzhuo Wang. KARMA: Leveraging multi-agent LLMs for automated knowledge graph enrichment. InAdvances in Neural In- formation Processing Systems (NeurIPS), volume 38, 2025. NeurIPS 2025 Spotlight Poster. arXiv:2502.06472

work page arXiv 2025
[16]

T-Phenotype: Discovering phenotypes of predictive temporal patterns in disease progression

Yuchao Qin, Mihaela van der Schaar, and Changhee Lee. T-Phenotype: Discovering phenotypes of predictive temporal patterns in disease progression. InProceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 ofProceedings of Machine Learning Research, pages 3466–3492, 2023

work page 2023
[17]

Database resources of the national center for biotechnology information.Nucleic Acids Research, 50(D1):D20–D26, 2022

Eric W Sayers, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, Kathryn Funk, Chris Kelly, Sunghwan Kim, et al. Database resources of the national center for biotechnology information.Nucleic Acids Research, 50(D1):D20–D26, 2022. doi: 10.1093/nar/gkab1112

work page doi:10.1093/nar/gkab1112 2022
[18]

KGARevion: An AI agent for knowledge-intensive biomedical QA

Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, and Marinka Zitnik. KGARevion: An AI agent for knowledge-intensive biomedical QA. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=tnB94WQGrn

work page 2025
[19]

RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. RotatE: Knowledge graph em- bedding by relational rotation in complex space. InInternational Conference on Learning Representations (ICLR), 2019. arXiv:1902.10197

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Know-Evolve: Deep temporal reasoning for dynamic knowledge graphs

Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. Know-Evolve: Deep temporal reasoning for dynamic knowledge graphs. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3462–3471, 2017

work page 2017
[21]

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16:138, 2015

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16:138, 2015. doi: 10.1186/s12859-015-0564-6

work page doi:10.1186/s12859-015-0564-6 2015
[22]

Med- KGent: A large language model agent framework for constructing temporally evolving medical knowledge graph.arXiv preprint arXiv:2508.12393, 2025

Duzhen Zhang, Zixiao Wang, Zhong-Zhi Li, Yahan Yu, Shuncheng Jia, Jiahua Dong, Haotian Xu, Xing Wu, Yingying Zhang, Tielin Zhang, Jie Yang, Xiuying Chen, and Le Song. Med- KGent: A large language model agent framework for constructing temporally evolving medical knowledge graph.arXiv preprint arXiv:2508.12393, 2025. doi: 10.48550/arXiv.2508.12393. URLhttp...

work page doi:10.48550/arxiv.2508.12393 2025
[23]

A comprehensive large- scale biomedical knowledge graph for AI-powered data-driven biomedical research.Nature Machine Intelligence, 7:602–614, 2025

Yuan Zhang, Xin Sui, Feng Pan, Kaixian Yu, Keqiao Li, Shubo Tian, Arslan Erdengasileng, Qing Han, Wanjing Wang, Jianan Wang, Jian Wang, Donghu Sun, Henry Chung, Jun Zhou, Eric Zhou, Ben Lee, Peili Zhang, Xing Qiu, Tingting Zhao, and Jinfeng Zhang. A comprehensive large- scale biomedical knowledge graph for AI-powered data-driven biomedical research.Nature...

work page doi:10.1038/s42256-025-01014-w 2025
[24]

Automat- ing biomedical knowledge graph construction for context-aware scientific inference.bioRxiv,

Yikai Zheng, Wanquan Liu, Bi Zeng, Yichun Feng, Xiawei Du, Lu Zhou, and Yixue Li. Automat- ing biomedical knowledge graph construction for context-aware scientific inference.bioRxiv,

work page
[25]

proximal weakness

doi: 10.64898/2026.01.14.699420. URL https://www.biorxiv.org/content/10. 64898/2026.01.14.699420v1. Preprint, posted 14 January 2026. Data, Code, and AI Usage ChronoMedKG (CC BY 4.0) and ChronoTQA are archived on Zenodo at https://doi.org/ 10.5281/zenodo.19697542; pipeline and experiment code (MIT) are on GitLab at https:// gitlab.sdu.dk/screen4care/chron...

work page doi:10.64898/2026.01.14.699420 2026
[26]

TEMPORAL FACTS: onset ages, disease milestones, progression timelines, treatment timing, discovery dates

work page
[27]

EVIDENCE-DATED FACTS: relationships anchored by publication year

work page
[28]

CONDITIONAL FACTS: relationships that depend on age, stage, genetic subtype

work page
[29]

triples": [ {

STATIC FACTS: general relationships without temporal context Output format (JSON): { "triples": [ { "subject": "entity name", "subject_type": "disease|gene/protein|drug|phenotype|anatomy|...", "relation": "disease_protein|indication|disease_phenotype_positive|...", "object": "entity name", "object_type": "same vocabulary as subject_type", "confidence": "h...

work page 2015
[30]

norm ops not supported for complex y

PrimeKG, by contrast, has zero evidence dates at the edge level, leaving no way to assess recency, supersession, or evidence evolution. I Extended Ablations I.1 Link Prediction with Temporal Features: Setup and Full Result To test whether ChronoMedKG’s temporal annotations provide useful signal for standard KG evaluation, we trained TransE [ 3] via PyKEEN...

work page doi:10.5281/zenodo.19697542

[1] [1]

The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

Md Shamim Ahmed, Maja Dusanic, Moritz Nikolai Kirschner, Elisabeth Nyoungui, Jana Zschüntzsch, Lukas Galke Poech, and Richard Röttger. The provenance gap in clinical AI: Evidence-traceable temporal knowledge graphs for rare disease reasoning.arXiv preprint arXiv:2604.17114, Apr 2026. doi: 10.48550/arXiv.2604.17114. URL https://arxiv.org/ abs/2604.17114

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.17114 2026

[2] [2]

PyKEEN 1.0: A Python library for training and evaluating knowledge graph embeddings.Journal of Machine Learning Research, 22(82):1–6, 2021

Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Sahand Sharifzadeh, V olker Tresp, and Jens Lehmann. PyKEEN 1.0: A Python library for training and evaluating knowledge graph embeddings.Journal of Machine Learning Research, 22(82):1–6, 2021. arXiv:2007.14175

work page arXiv 2021

[3] [3]

Translating embeddings for modeling multi-relational data

Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. InAdvances in Neural Information Processing Systems, volume 26, 2013

work page 2013

[4] [4]

Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023

Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine.Scientific Data, 10(1):67, 2023. doi: 10.1038/s41597-023-01960-3

work page doi:10.1038/s41597-023-01960-3 2023

[5] [5]

RareBench: Can LLMs serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24)

Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, and Ting Chen. RareBench: Can LLMs serve as rare diseases specialists? InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). ACM, 2024. doi: 10.1145/3637528.3671576. URLhttps://doi.org/10.1145/3637528.3671576

work page doi:10.1145/3637528.3671576 2024

[6] [6]

HyTE: Hyperplane-based temporally aware knowledge graph embedding

Shib Sankar Dasgupta, Swayambhu Nath Ray, and Partha Talukdar. HyTE: Hyperplane-based temporally aware knowledge graph embedding. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2001–2011, 2018. doi: 10.18653/v1/D18-1225

work page doi:10.18653/v1/d18-1225 2018

[7] [7]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723

work page doi:10.1145/3458723 2021

[8] [8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Systematic integration of biomedical knowledge prioritizes drugs for repurposing.eLife, 6:e26726, 2017

Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini. Systematic integration of biomedical knowledge prioritizes drugs for repurposing.eLife, 6:e26726, 2017. doi: 10.7554/eLife.26726

work page doi:10.7554/elife.26726 2017

[10] [10]

The GA4GH Phenopacket schema defines a computable representation of clinical data.Nature Biotechnology, 40(6): 817–820, 2022

Julius O B Jacobsen, Michael Baudis, Gareth S Baynam, et al. The GA4GH Phenopacket schema defines a computable representation of clinical data.Nature Biotechnology, 40(6): 817–820, 2022. doi: 10.1038/s41587-022-01357-4

work page doi:10.1038/s41587-022-01357-4 2022

[11] [11]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. doi: 10.3390/app11146421

work page doi:10.3390/app11146421 2021

[12] [12]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2567–2577, 2019. doi: 10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259 2019

[13] [13]

The Human Phenotype Ontology in 2024: phenotypes around the world.Nucleic Acids Research, 52(D1):D1333– D1346, 2024

Sebastian Köhler, Michael Gargano, Nicolas Matentzoglu, et al. The Human Phenotype Ontology in 2024: phenotypes around the world.Nucleic Acids Research, 52(D1):D1333– D1346, 2024. doi: 10.1093/nar/gkad1005. 10

work page doi:10.1093/nar/gkad1005 2024

[14] [14]

Self-alignment pretraining for biomedical entity representations

Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. Self-alignment pretraining for biomedical entity representations. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4228–4238, 2021. doi: 10.18653/v1/2021.naacl-main.334

work page doi:10.18653/v1/2021.naacl-main.334 2021

[15] [15]

KARMA: Leveraging multi-agent LLMs for automated knowledge graph enrichment

Yuxing Lu, Wei Wu, Xukai Zhao, Rui Peng, and Jinzhuo Wang. KARMA: Leveraging multi-agent LLMs for automated knowledge graph enrichment. InAdvances in Neural In- formation Processing Systems (NeurIPS), volume 38, 2025. NeurIPS 2025 Spotlight Poster. arXiv:2502.06472

work page arXiv 2025

[16] [16]

T-Phenotype: Discovering phenotypes of predictive temporal patterns in disease progression

Yuchao Qin, Mihaela van der Schaar, and Changhee Lee. T-Phenotype: Discovering phenotypes of predictive temporal patterns in disease progression. InProceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 ofProceedings of Machine Learning Research, pages 3466–3492, 2023

work page 2023

[17] [17]

Database resources of the national center for biotechnology information.Nucleic Acids Research, 50(D1):D20–D26, 2022

Eric W Sayers, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, Kathryn Funk, Chris Kelly, Sunghwan Kim, et al. Database resources of the national center for biotechnology information.Nucleic Acids Research, 50(D1):D20–D26, 2022. doi: 10.1093/nar/gkab1112

work page doi:10.1093/nar/gkab1112 2022

[18] [18]

KGARevion: An AI agent for knowledge-intensive biomedical QA

Xiaorui Su, Yibo Wang, Shanghua Gao, Xiaolong Liu, Valentina Giunchiglia, Djork-Arné Clevert, and Marinka Zitnik. KGARevion: An AI agent for knowledge-intensive biomedical QA. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=tnB94WQGrn

work page 2025

[19] [19]

RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. RotatE: Knowledge graph em- bedding by relational rotation in complex space. InInternational Conference on Learning Representations (ICLR), 2019. arXiv:1902.10197

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

Know-Evolve: Deep temporal reasoning for dynamic knowledge graphs

Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. Know-Evolve: Deep temporal reasoning for dynamic knowledge graphs. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 3462–3471, 2017

work page 2017

[21] [21]

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16:138, 2015

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BMC Bioinformatics, 16:138, 2015. doi: 10.1186/s12859-015-0564-6

work page doi:10.1186/s12859-015-0564-6 2015

[22] [22]

Med- KGent: A large language model agent framework for constructing temporally evolving medical knowledge graph.arXiv preprint arXiv:2508.12393, 2025

Duzhen Zhang, Zixiao Wang, Zhong-Zhi Li, Yahan Yu, Shuncheng Jia, Jiahua Dong, Haotian Xu, Xing Wu, Yingying Zhang, Tielin Zhang, Jie Yang, Xiuying Chen, and Le Song. Med- KGent: A large language model agent framework for constructing temporally evolving medical knowledge graph.arXiv preprint arXiv:2508.12393, 2025. doi: 10.48550/arXiv.2508.12393. URLhttp...

work page doi:10.48550/arxiv.2508.12393 2025

[23] [23]

A comprehensive large- scale biomedical knowledge graph for AI-powered data-driven biomedical research.Nature Machine Intelligence, 7:602–614, 2025

Yuan Zhang, Xin Sui, Feng Pan, Kaixian Yu, Keqiao Li, Shubo Tian, Arslan Erdengasileng, Qing Han, Wanjing Wang, Jianan Wang, Jian Wang, Donghu Sun, Henry Chung, Jun Zhou, Eric Zhou, Ben Lee, Peili Zhang, Xing Qiu, Tingting Zhao, and Jinfeng Zhang. A comprehensive large- scale biomedical knowledge graph for AI-powered data-driven biomedical research.Nature...

work page doi:10.1038/s42256-025-01014-w 2025

[24] [24]

Automat- ing biomedical knowledge graph construction for context-aware scientific inference.bioRxiv,

Yikai Zheng, Wanquan Liu, Bi Zeng, Yichun Feng, Xiawei Du, Lu Zhou, and Yixue Li. Automat- ing biomedical knowledge graph construction for context-aware scientific inference.bioRxiv,

work page

[25] [25]

proximal weakness

doi: 10.64898/2026.01.14.699420. URL https://www.biorxiv.org/content/10. 64898/2026.01.14.699420v1. Preprint, posted 14 January 2026. Data, Code, and AI Usage ChronoMedKG (CC BY 4.0) and ChronoTQA are archived on Zenodo at https://doi.org/ 10.5281/zenodo.19697542; pipeline and experiment code (MIT) are on GitLab at https:// gitlab.sdu.dk/screen4care/chron...

work page doi:10.64898/2026.01.14.699420 2026

[26] [26]

TEMPORAL FACTS: onset ages, disease milestones, progression timelines, treatment timing, discovery dates

work page

[27] [27]

EVIDENCE-DATED FACTS: relationships anchored by publication year

work page

[28] [28]

CONDITIONAL FACTS: relationships that depend on age, stage, genetic subtype

work page

[29] [29]

triples": [ {

STATIC FACTS: general relationships without temporal context Output format (JSON): { "triples": [ { "subject": "entity name", "subject_type": "disease|gene/protein|drug|phenotype|anatomy|...", "relation": "disease_protein|indication|disease_phenotype_positive|...", "object": "entity name", "object_type": "same vocabulary as subject_type", "confidence": "h...

work page 2015

[30] [30]

norm ops not supported for complex y

PrimeKG, by contrast, has zero evidence dates at the edge level, leaving no way to assess recency, supersession, or evidence evolution. I Extended Ablations I.1 Link Prediction with Temporal Features: Setup and Full Result To test whether ChronoMedKG’s temporal annotations provide useful signal for standard KG evaluation, we trained TransE [ 3] via PyKEEN...

work page doi:10.5281/zenodo.19697542