pith. machine review for the scientific record. sign in

arxiv: 2604.16422 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI· cs.LG

Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords biomedical knowledge injectionGraphRAGcontinual pretrainingUMLSPubMedQABioASQknowledge graphslanguage models
0
0 comments X

The pith

GraphRAG on a UMLS knowledge graph improves LLaMA 3-8B accuracy by more than 3 points on PubMedQA and 5 points on BioASQ without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two routes for adding structured biomedical knowledge from the UMLS Metathesaurus to language models. One route converts the graph into a 100-million-token text corpus and uses it for continual pretraining of BERT and BioBERT. The other route keeps the graph intact and queries it at inference time through a GraphRAG pipeline. Pretraining lifts performance on the base BERT model for knowledge-heavy tasks, yet the gains shrink once the base model is already BioBERT. GraphRAG applied to LLaMA 3-8B produces larger, more consistent lifts on the two question-answering benchmarks while keeping the knowledge source explicit and editable.

Core claim

A UMLS knowledge graph containing 3.4 million concepts and 34.2 million relations is stored in Neo4j; a textual version of the same graph is used to continually pretrain BERTUMLS and BioBERTUMLS; when the same graph is instead consulted at inference time via GraphRAG, LLaMA 3-8B gains more than 3 accuracy points on PubMedQA and 5 points on BioASQ with no parameter updates, while the pretraining route shows clear but smaller and more model-dependent benefits on BLURB tasks.

What carries the argument

The GraphRAG pipeline that retrieves multi-hop paths from the Neo4j UMLS graph at inference time and supplies them as context to the frozen language model.

If this is right

  • Knowledge can be refreshed by editing the graph without any model retraining.
  • Multi-hop relational reasoning becomes available through explicit graph traversal during retrieval.
  • Models that already encode substantial domain text show smaller additional benefit from the pretraining route.
  • Inference-time graph access preserves transparency because the retrieved relations can be inspected and audited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-plus-retrieval pattern could be tested on other domains that maintain large curated relation stores such as chemistry or regulatory data.
  • For models larger than 8B parameters the cost of continual pretraining may be avoided altogether by relying on GraphRAG for fact access.
  • A hybrid schedule that first pretrains on the text corpus and then adds GraphRAG at inference could combine the two observed benefits.

Load-bearing premise

That the 100-million-token text corpus extracted from the UMLS graph actually embeds the structured relations into model parameters rather than simply supplying more generic biomedical text.

What would settle it

Train a control model on an equal volume of unstructured biomedical text drawn from the same sources and verify whether it matches the PubMedQA and BioASQ gains reported for the UMLS-derived pretraining run.

Figures

Figures reproduced from arXiv: 2604.16422 by Jaafer Klila, Lamia Hadrich Belguith, Nasredine Semmar, Rahma Boujelben, Sondes Bannour Souihi.

Figure 1
Figure 1. Figure 1: Overview of the UMLS-Based Knowledge Injection Pipeline: Continual Pretraining on [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for constructing a biomedical knowledge graph from UMLS Metathesaurus. four tables to associate each concept with its names, synonyms, relationships, and semantic types. The resulting data was then stored as a graph in a Neo4j database. Graph Storage. The resulting Knowledge Graph, comprising 3,389,266 concepts and 1,005 unique relationship types, was stored in a Neo4j graph database. This storage… view at source ↗
Figure 3
Figure 3. Figure 3: Example of generating a textualized triple [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: A real-world biomedical question from [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds knowledge into model parameters, and (ii) Graph Retrieval-Augmented Generation (GraphRAG) that consults a knowledge graph at inference time. We first construct a large-scale biomedical knowledge graph from UMLS (3.4 million concepts and 34.2 million relations), stored in Neo4j for efficient querying. We then derive a ~100-million-token textual corpus from this graph to continually pretrain two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT). We evaluate these models on six BLURB (Biomedical Language Understanding and Reasoning Benchmark) datasets spanning five task types and evaluate GraphRAG on the two QA (Question Answering) datasets (PubMedQA, BioASQ). On BLURB tasks, BERTUMLS improves over BERT, with the largest gains on knowledge-intensive QA. Effects on BioBERT are more nuanced, suggesting diminishing returns when the base model already encodes substantial biomedical text knowledge. Finally, augmenting LLaMA 3-8B with our GraphRAG pipeline yields over than 3 points accuracy on PubMedQA and 5 points on BioASQ without any retraining, delivering transparent, multi-hop, and easily updated knowledge access. We release the processed UMLS Neo4j graph to support reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper constructs a UMLS-derived knowledge graph (3.4M concepts, 34.2M relations) stored in Neo4j and derives a ~100M-token textual corpus from it. It compares two injection strategies: continual pretraining of BERT and BioBERT on this corpus (yielding BERTUMLS and BioBERTUMLS), evaluated on six BLURB tasks, versus GraphRAG (Neo4j multi-hop queries) applied at inference to LLaMA 3-8B, evaluated on PubMedQA and BioASQ. The central claim is that GraphRAG delivers >3 accuracy points on PubMedQA and >5 on BioASQ with no retraining, while pretraining shows gains mainly for the base BERT model.

Significance. If the results hold after proper controls, the work would usefully demonstrate that inference-time graph retrieval can provide transparent, multi-hop, and easily updated biomedical knowledge access that complements parameter-based injection. The public release of the processed UMLS Neo4j graph is a concrete strength for reproducibility. The significance is currently limited by the absence of ablations that isolate the contribution of graph structure versus generic text augmentation.

major comments (2)
  1. [Results section (GraphRAG evaluation on PubMedQA/BioASQ)] Results section (GraphRAG evaluation on PubMedQA/BioASQ): the headline claim that GraphRAG yields >3 points on PubMedQA and >5 on BioASQ rests on the untested premise that structured multi-hop relations (rather than additional biomedical passages) drive the gains. No non-graph RAG baseline (e.g., BM25 or embedding retrieval over the identical ~100M-token UMLS-derived corpus) is reported, so the specific value of the graph cannot be isolated.
  2. [Methods and experimental setup] Methods and experimental setup: no training hyperparameters, baseline implementation details, statistical significance tests, or data-leakage analysis between the derived corpus and BLURB/PubMedQA/BioASQ benchmarks are provided. These omissions prevent assessment of whether the reported BLURB improvements for BERTUMLS are robust or reproducible.
minor comments (1)
  1. [Abstract] Abstract: 'over than 3 points' is grammatically incorrect and should read 'more than 3 points'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the need for stronger controls and details to support our claims. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Results section (GraphRAG evaluation on PubMedQA/BioASQ): the headline claim that GraphRAG yields >3 points on PubMedQA and >5 on BioASQ rests on the untested premise that structured multi-hop relations (rather than additional biomedical passages) drive the gains. No non-graph RAG baseline (e.g., BM25 or embedding retrieval over the identical ~100M-token UMLS-derived corpus) is reported, so the specific value of the graph cannot be isolated.

    Authors: We agree that a non-graph baseline is required to isolate the contribution of graph structure versus generic text augmentation. In the revised manuscript we will add BM25 and dense embedding retrieval baselines over the identical UMLS-derived corpus and report their performance on PubMedQA and BioASQ alongside GraphRAG. This will allow direct quantification of any additional benefit from multi-hop graph queries. revision: yes

  2. Referee: Methods and experimental setup: no training hyperparameters, baseline implementation details, statistical significance tests, or data-leakage analysis between the derived corpus and BLURB/PubMedQA/BioASQ benchmarks are provided. These omissions prevent assessment of whether the reported BLURB improvements for BERTUMLS are robust or reproducible.

    Authors: We acknowledge these omissions. The revised version will include complete training hyperparameters for continual pretraining, full baseline implementation details, statistical significance tests (paired t-tests and McNemar where appropriate), and an explicit data-leakage analysis checking n-gram overlap between the UMLS corpus and the evaluation sets. We will also release the preprocessing and training code to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external benchmarks and no self-referential derivations

full rationale

The paper constructs a UMLS graph and derived text corpus, performs continual pretraining on BERT/BioBERT variants, and evaluates GraphRAG on LLaMA 3-8B using PubMedQA and BioASQ. No equations, fitted parameters, or derivations appear that reduce by construction to the target results. BLURB and QA benchmarks are independent external datasets. No self-citation chains, uniqueness theorems, or ansatz smuggling are present in the manuscript. The claims rest on reported accuracy deltas rather than tautological redefinitions or renamings of known patterns. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that UMLS provides reliable structured knowledge and that converting the graph to text preserves enough relational structure for effective injection during pretraining.

axioms (1)
  • domain assumption UMLS Metathesaurus supplies accurate and comprehensive structured biomedical knowledge suitable for graph construction and text derivation
    Invoked when building the 3.4M-concept graph and deriving the pretraining corpus from it.

pith-pipeline@v0.9.0 · 5627 in / 1267 out tokens · 38672 ms · 2026-05-13T20:07:03.650548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Introduction Language Models (LMs) have rapidly transformed the field of Natural Language Processing (NLP), enabling machines to understand and generate human language with remarkable fluency. In the biomedicalfield,earlyPretrainedLanguageModels (PLMs), such as BioBERT (Lee et al., 2020), Clini- calBERT(Alsentzeretal.,2019)andPubMedBERT (Gu et al., 2021) ...

  2. [2]

    by pretraining on large-scale biomedical cor- pora, including PubMed1, PMC2, and MIMIC-III (Johnsonetal.,2016),yieldingstronggainsthrough richer contextualization of biomedical terminology and discourse. More recently, Large Language models (LLMs) such as GPT-4 (Achiam et al., 2023), DeepSeek (Liu et al., 2024), and Llama 4 models (AI@Meta, 2025), have re...

  3. [3]

    Related Work Injecting domain-specific knowledge is widely rec- ognized as a key lever for improving the reliabil- ity and usefulness of language models in special- ized settings. Recent surveys map four main adap- 4https://uts.nlm.nih.gov/uts/umls/home 5https://neo4j.com/ tation families: continual pre-training on domain- specific corpora, knowledge inje...

  4. [4]

    Knowledge Injection Methodology We investigate two complementary avenues for injecting structured biomedical knowledge into language models. First, we introduce BioBER- TUMLS and BERTUMLS by continuing pretrain- ing with knowledge derived from the UMLS knowl- edge graph followed by fine-tuning on downstream biomedical NLP tasks to evaluate their effective...

  5. [5]

    Autoimmune diseases cause of Autoim- mune opsoclonus myoclonus

    is a comprehensive aggregation of biomed- ical vocabularies and ontologies maintained by the U.S. National Library of Medicine. The UMLS-2024AA Metathesaurus integrates over 200 source vocabularies, offering a unified layer for mapping concepts across heterogeneous terminologies. UMLS is organized around three complementary resources: •TheMetathesaurus, w...

  6. [6]

    Is ibudilast effective for multiple sclerosis?

    Experimentation Our experiments compare two complementary strategies for injecting domain knowledge derived from UMLS-2024AA: (1) parametric injection via continual pretraining (yielding BERTUMLS and BioBERTUMLS), and (2) non-parametric injection via GraphRAG with LLaMA 3-8B over a Neo4j in- stance of the UMLS graph. Both strategies rely on the same UMLS ...

  7. [7]

    Using a LLama model alone often yields an incorrect answer (‘No’)

    Conclusion and Future Work We presented two complementary pathways for injecting structured biomedical knowledge into lan- Figure 5: A real-world biomedical question from BioASQ and the corresponding subgraph extracted from our Neo4j knowledge graph. Using a LLama model alone often yields an incorrect answer (‘No’). By injecting structured biomedical know...

  8. [8]

    Bibliographical References

  9. [9]

    Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Al- tenschmidt,SamAltman,ShyamalAnadkat,etal

    Model2vec: Turn any sentence transformer into a small fast model. Josh Achiam, Steven Adler, Sandhini Agar- wal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Al- tenschmidt,SamAltman,ShyamalAnadkat,etal

  10. [10]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774. AI@Meta. 2025. Llama 4 model card. https://github.com/meta-llama/ llama-models/blob/main/models/ llama4/MODEL_CARD.md. Accessed May 2025. Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly avail- able clinical bert embeddings.arX...

  11. [11]

    A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

    A multitask, multilingual, multimodal evalu- ation of chatgpt on reasoning, hallucination, and interactivity.arXiv preprint arXiv:2302.04023. KevinGBecker,KathleenCBarnes,TiffaniJBright, and S Alex Wang. 2004. The genetic association database.Nature genetics, 36(5):431–432. Elliot Bolton, Abhinav Venigalla, Michihiro Ya- sunaga, David Hall, Betty Xiong, T...

  12. [12]

    Retrieval-Augmented Generation with Graphs (GraphRAG)

    Domain-specific language model pretrain- ing for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23. Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halap- panavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al. 2024. Retrieval-augmented generationwithgraphs(g...

  13. [13]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al

    Rethinking with retrieval: Faithful large language model inference.arXiv preprint arXiv:2301.00303. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM...

  14. [14]

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark

    Biomedical question answering: a survey of approaches and challenges.ACM Computing Surveys (CSUR), 55(2):1–36. Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1)...

  15. [15]

    DeepSeek-V3 Technical Report

    Biomedrag: A retrieval augmented large language model for biomedicine.Journal of Biomedical Informatics, 162:104769. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Zhiqiang Liu, Chengtao Gan, Junjie Wang,...