pith. sign in

arxiv: 2510.09580 · v2 · submitted 2025-10-10 · 💻 cs.AI · cs.CL

GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Pith reviewed 2026-05-18 07:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords knowledge graphsneurosymbolic AIencoder-only modeldistillationunstructured datafactual accuracyvalidity
0
0 comments X

The pith

A compact graphical model extracts more reliable knowledge graphs from text than much larger language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GraphMERT as an 80M-parameter encoder-only model that turns unstructured text into factual and ontology-consistent knowledge graphs. It claims this creates the first scalable neurosymbolic system by pairing neural abstraction learning with explicit symbolic representations for verifiable reasoning. A sympathetic reader would care because it addresses the long-standing problem of building trustworthy, interpretable KGs without the hallucinations and prompt sensitivity common in large LLMs. The work demonstrates this on PubMed diabetes papers, where the small model produces KGs with substantially higher factual accuracy and validity than a 32B baseline.

Core claim

GraphMERT is a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations, forming a modular neurosymbolic stack in which neural components learn abstractions while symbolic KGs support verifiable reasoning, and it is the first efficient system to reach state-of-the-art benchmark accuracy with superior symbolic quality.

What carries the argument

GraphMERT, the 80M-parameter graphical encoder-only model that produces domain-specific KGs which are both factual with provenance and valid with ontology-consistent relations.

If this is right

  • Neurosymbolic systems can scale efficiently by using compact neural models to generate explicit symbolic KGs.
  • Domain-specific KGs for fields like medicine become practical to build automatically with higher reliability than current LLM methods.
  • Prompt sensitivity and hallucinated relations in KG extraction are reduced by distilling from both text and internal model representations.
  • Verifiable reasoning becomes available in applications that combine the neural encoder with the resulting symbolic graph.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular design could extend to other scientific domains by retraining GraphMERT on domain corpora while keeping the same symbolic validation layer.
  • Integration with existing ontologies might further boost validity without increasing model size.
  • The efficiency gains suggest testing whether similar small models can replace larger ones in additional neurosymbolic pipelines.

Load-bearing premise

The chosen metrics and baseline prompting fully capture KG reliability without hidden advantages for the proposed model.

What would settle it

An experiment that optimizes prompting and any post-processing for the 32B LLM baseline on the identical PubMed diabetes corpus and shows it matching or exceeding GraphMERT's FActScore and ValidityScore.

Figures

Figures reproduced from arXiv: 2510.09580 by Jiaxin Xiao, Margarita Belova, Niraj K. Jha, Shikhar Tuli.

Figure 1
Figure 1. Figure 1: A toy KG example from the medical domain. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the GraphMERT framework. It is trained on the fusion of syntactic and semantic examples (II) and augments syntactic data with semantic tails (I); an LLM helps determine the linguistic structure of tails proposed by GraphMERT (III). (I): Chain graph (Ic) combines syntactic knowledge from text corpora (Ib) with semantic examples and relations from a seed KG (Ia): Roots hold syntactic knowledge (i… view at source ↗
Figure 3
Figure 3. Figure 3: Chain graph. Roots are in orange, leaves are in blue. Conceptual representation (A, B): term [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Main GraphMERT architectural components. GraphMERT is a RoBERTa transformer with two modifications. (I) In the embedding layer, H-GAT encodes semantic triples. (IA) There are leaves connected to a root node; hence, the node feature is equal to the token embedding. (IB) There are leaves connected to a root node; H-GAT fuses leaves, relations, and head embeddings resulting in fused node feature. (II) In the … view at source ↗
Figure 5
Figure 5. Figure 5: Semantic embedding derivation on leaves (only three leaves are shown). [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Relation embedding training. The sequence with updated leaf embeddings is passed to the trans [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Data preparation for GraphMERT. To find the most relevant triples, we perform semantic similarity matching of triples to dataset sequences. The triple head should almost literally match one of the entities discovered in Step (I); from them, we pick the top triples whose tails are semantically close to the sequence. All matched triples are subject to the injection algorithm (III), which selects the top-scor… view at source ↗
Figure 8
Figure 8. Figure 8: GraphMERT Pipeline flowchart with temporal execution ordering of the main components. PubMed papers (training dataset) PubMed papers (training dataset) PubMed papers (training dataset) 5. Form final triples for this head (with multi-token tails) PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD MASK PAD PAD PAD PAD PAD PAD Sequence with all leaves empty Sequence with one masked leaf k tokens 3. Predict the masked le… view at source ↗
Figure 9
Figure 9. Figure 9: Prediction of triple tails. The trained GraphMERT predicts the top k tokens for a masked leaf and the chosen relation, resulting in a set of raw triples with the same head. related) triples that may not be explicitly mentioned in the sequence. However, if β is set too low, the output becomes flooded with triples that merely restate general truths, reflecting statistically dominant statements in the trainin… view at source ↗
Figure 10
Figure 10. Figure 10: Leafy chain graph encoded sequentially: 7-leaf case. The sequence has a fixed length of 1024. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: I. Forming triple tails for a given sequence with [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Validity check with GPT-5 Thinking, 100 random triples per keyword. The keywords are lined on [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: GraphRAG accuracies for different α and β. Bubble size corresponds to absolute accuracy and color indicates the accuracy gain relative to the LLM KG baseline (red denotes positive gain, blue denotes negative gain) [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
read the original abstract

Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades. A marriage of the neural and symbolic components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of purely neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side of the problem. However, automatically deriving reliable KGs from text corpora remains an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GraphMERT, an 80M-parameter graphical encoder-only model that distills reliable knowledge graphs from unstructured text. It positions GraphMERT + KG as the first efficient scalable neurosymbolic system achieving SOTA benchmark accuracy with superior symbolic output, concretely reporting 69.8% FActScore and 68.8% ValidityScore on PubMed diabetes text versus 40.2% and 43.0% for a 32B LLM baseline.

Significance. If the performance claims are substantiated with full experimental details, the work would be significant for neurosymbolic AI by showing that a compact graphical encoder can produce more factual and ontology-consistent KGs than much larger LLMs, offering a practical path to scalable, interpretable knowledge extraction from domain text.

major comments (2)
  1. [Abstract] Abstract: The abstract reports specific numerical results (69.8% FActScore, 68.8% ValidityScore for GraphMERT vs. 40.2%/43.0% for Qwen3-32B) but supplies no description of training procedure, architecture details, data splits, or statistical significance. This absence prevents verification of the central reliability advantage.
  2. [Abstract] Abstract: The superiority claim rests on the LLM baseline being prompted representatively, yet the text provides no prompt text, system instructions, few-shot examples, temperature settings, or ablation results on prompting variants, despite the paper's own emphasis on LLM prompt sensitivity and hallucinated relations. This leaves open whether the gap isolates GraphMERT's contribution or reflects unequal elicitation effort.
minor comments (1)
  1. [Abstract] The abstract could clarify the exact definition and computation of FActScore and ValidityScore, including any post-processing steps, to allow readers to assess how fully they capture reliability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We value the feedback on improving the clarity and verifiability of our claims regarding GraphMERT. We address each major comment below and commit to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports specific numerical results (69.8% FActScore, 68.8% ValidityScore for GraphMERT vs. 40.2%/43.0% for Qwen3-32B) but supplies no description of training procedure, architecture details, data splits, or statistical significance. This absence prevents verification of the central reliability advantage.

    Authors: We acknowledge that the abstract's brevity precludes inclusion of full methodological details, which limits immediate verification from the abstract alone. The complete manuscript details the training procedure in Section 3, architecture in Section 2, data splits in Section 4.1, and statistical significance testing in Section 5.3. To directly address this concern, we will revise the abstract to include concise references to these elements and explicit pointers to the relevant sections for full verification. revision: yes

  2. Referee: [Abstract] Abstract: The superiority claim rests on the LLM baseline being prompted representatively, yet the text provides no prompt text, system instructions, few-shot examples, temperature settings, or ablation results on prompting variants, despite the paper's own emphasis on LLM prompt sensitivity and hallucinated relations. This leaves open whether the gap isolates GraphMERT's contribution or reflects unequal elicitation effort.

    Authors: We agree this is an important point for ensuring a fair and reproducible comparison, particularly given the manuscript's discussion of LLM prompt sensitivity. The current version does not include the exact prompt configuration used for the Qwen3-32B baseline. In the revised manuscript, we will add the full prompt template, system instructions, any few-shot examples, temperature settings, and results from prompting ablations to the appendix or a dedicated experimental subsection. This will substantiate that the reported performance gap reflects GraphMERT's advantages rather than differences in elicitation effort. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new model and empirical benchmarks without reduction to inputs by construction

full rationale

The paper introduces GraphMERT as a new 80M-parameter graphical encoder-only model for distilling KGs from unstructured text, then reports concrete empirical results on PubMed diabetes text using FActScore (69.8% vs. 40.2%) and ValidityScore (68.8% vs. 43.0%) against a 32B LLM baseline. No derivation chain, equations, or self-referential steps appear in the provided text that would make any prediction equivalent to fitted inputs or prior self-citations by construction. The neurosymbolic stack description and SOTA claim are presented as outcomes of the new architecture and evaluation, not as tautological renamings or load-bearing self-citations. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities beyond the model name itself; all supporting details on training objectives, ontology definitions, and scoring functions are absent, leaving the ledger empty pending full text.

pith-pipeline@v0.9.0 · 5881 in / 1414 out tokens · 32076 ms · 2026-05-18T07:33:46.142514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    ISSN 0004-3702. doi: 10.1016/0004-3702(93)90068-M. ChuangLiu, ZelinYao, YibingZhan, XueqiMa, ShiruiPan, andWenbinHu. Gradformer: Graphtransformer with exponential decay, 2024a. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewi...

  2. [2]

    Large Language Models: A Survey

    ISSN 0004-3702. doi: 10.1016/0004-3702(80)90011-9. Special Issue on Non-Monotonic Logic. Dhruv Mehrotra and Tim Marchman. Perplexity is a bullshit machine, 2024. URLhttps://www.wired. com/story/perplexity-is-a-bullshit-machine/.WIRED, investigation documenting data scraping and multiple hallucinations/misattributions. Sewon Min, Kalpesh Krishna, Xinxi Lyu...

  3. [3]

    doi: 10.1016/j.fmre.2021.09.003

    ISSN 2667-3258. doi: 10.1016/j.fmre.2021.09.003. Xiaoye Wang, Nicole Xi Zhang, Hongyu He, Trang Nguyen, Kun-Hsing Yu, Hao Deng, Cynthia Brandt, Danielle S. Bitterman, Ling Pan, Ching-Yu Cheng, James Zou, and Dianbo Liu. Safety challenges of AI in medicine in the era of large language models, 2025b. arXiv:2409.18968 [cs.CY]. Yuhao Wang, Ruiyang Ren, Junyi ...

  4. [4]

    Drop all triples with a score less than thresholdα

  5. [5]

    Triple selection for each head:To balance contextual relevance with relation diversity: 1st priority: maximize injection score, 2nd priority: maintain relation diversity

    Make all triples unique: If a triple matches multiple sequences, retain the triple˜Twith the highest score, i.e., in the sequence to which the triple is most relevant: ˜T= arg max seq score(T(seq)) The second preprocessing step prevents overfitting in the semantic space on common triples. Triple selection for each head:To balance contextual relevance with...

  6. [6]

    Split relations into relation buckets based on the number of unique triples at stepkand assume that within each bucket all relations are equally diverse (e.g.,k= 20implies that relations with #triples 100-120, 120-140.., are treated as equally diverse)

  7. [7]

    Within each relation bucket, sort all triples by score regardless of relation

  8. [8]

    Start with the lowest-numbered bucket (rarest relations). Within it, start with the triple with the highest score and retain only it for its head, removing all other matched triples, which may have a higher score but may be in a higher relation bucket. As a result, one of the rarest possible relations in the dataset would survive for this head, increasing...

  9. [9]

    Order triples by score

  10. [10]

    Split into score buckets: Assume that within each score bucket, triples are equally good

  11. [11]

    isa” prevails in the seed KG, theGraphMERTKG is heavily skewed towards “associated_with

    Then, within each score bucket, apply Maximize diversity. Altogether, we group triples by how “low” the score is (higher scores are assigned to lower-bucket IDs). Then, within each score bucket, we favor relation types that are less frequent. Finally, we choose the highest-scoring triple for each head. The algorithm is implemented using the Pandas framewo...

  12. [12]

    myocardial infarction,

    Select a precise and medically-specific span (e.g., “myocardial infarction,” not “infarction”). Avoid generic terms like “disease,” “condition,” “patients,” and “comorbidity” without a specific context. When encountering vague descriptors like “complication,” “symptom,” or “effect,” always prefer explicitly named conditions or symptoms directly linked to ...

  13. [13]

    Keep original spelling, casing, and abbreviations from the sequence

  14. [14]

    Do not include COVID- related terms

    Choose only entities that add meaningful medical knowledge to the diabetes KG. Do not include COVID- related terms. Do not include head entities that describe findings in animal models (mice, rats, etc.)

  15. [15]

    60+” is too context-dependent). •“anxiety,

    A few examples of low-value entities youshould notinclude: •‘≥10 % weight reduction’(too context-dependent). •‘nhanes 2015 - 2018’(dataset/survey, not a medical entity). •‘semaglutide 2.4 mg’(includes a dosage, which can vary). •‘60+ women’(“60+” is too context-dependent). •“anxiety,” “home births,” “pregnant women,” “neonatal deaths,” “general practition...

  16. [16]

    Output (Incorrect)

    If it is not clear whether a term adds diabetes-specific knowledge, look at the context. If the text explicitly links the term to a diabetes-specific concept, include it. Otherwise, exclude it when mentioned only in a generic context. Include such terms when the sequence clearly links them to a diabetes-relevant gene, pathway, cell type, or therapeutic ef...

  17. [17]

    Identify candidate spans

  18. [18]

    Filter by medical precision and relevance rules

  19. [19]

    Diabetic retinopathy

    Confirm the entity’s relevance and contribution to the diabetes KG; discard low-value entities. Input format:sequence Output format:[‘head1’, ‘head2’, ...]. If none, output[]. 59 Few-shot Example for Entity Discovery Prompt Input: sequence: ..., its upstream regulator has the opposite effect (Han et al., 2013). Previous studies suggest that CHOP deteriora...

  20. [20]

    •For each head, find explicit mentions in the text

    Understand Input •Clearly understand the biomedical context from the sequence. •For each head, find explicit mentions in the text. •Check if each head is explicitly linked to other concepts or relations

  21. [21]

    Evaluate each head individually

    Use the list of allowed relations. Evaluate each head individually. Do not overuse the relation associated_with— apply it only when appropriate

  22. [22]

    key regulator of inflammation,

    For each head, list only plausible and supported relations. Return[]if none apply. Think concisely within⟨think⟩...⟨/think⟩. Immediately after, output JSON. 61 Few-shot Example for the Relation Matching Prompt Input: ...interleukin-1 R6, and receptor activator of nuclear factor kappa-B (RANK). Together, proteomic data suggest the targeting of several key ...

  23. [23]

    Organism: Plant; Fungus; Virus; Bacterium; Archaeon; Eukaryote; Vertebrate; Amphibian; Bird; Fish; Reptile; Mammal; Human

  24. [24]

    Anatomical Structure: Embryonic Structure; Anatomical Abnormality; Congenital Abnormality; Acquired Abnormality; Fully Formed Anatomical Structure; Body Part, Organ, or Organ Component; Tissue; Cell; Cell Component; Gene or Genome

  25. [25]

    Manufactured Object: Medical Device; Drug Delivery Device; Research Device; Clinical Drug

  26. [26]

    Substance: Chemical; Pharmacologic Substance; Antibiotic; Biomedical or Dental Material; Biologically Active Substance; Hormone; Enzyme; Vitamin; Immunologic Factor; Receptor; Indicator, Reagent, or Diagnostic Aid; Organic Chemical; Nucleic Acid, Nucleoside, or Nucleotide; Amino Acid, Peptide, or Protein; Inorganic Chemical; Element, Ion, or Isotope; Body...

  27. [27]

    Conceptual Entity: Idea or Concept; Body System; Body Space or Junction; Body Location or Region; Molecular Sequence; Nucleotide Sequence; Amino Acid Sequence; Carbohydrate Sequence; Geographic Area; Finding; Laboratory or Test Result; Sign or Symptom; Organism Attribute; Clinical Attribute; Intellectual Product; Occupation or Discipline; Organization; Gr...

  28. [28]

    Identify all entities corresponding to one of the 5 main entity types and relevant to diabetes, using the subcategory examples as guidance for classification. For each identified entity, extract the following information: - entity_name: Name of the entity, lowercase - entity_type: One of the following types: Organism, Anatomical Structure, Manufactured Ob...

  29. [29]

    Only use the 35 relationships that are in the predefined list

    From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are clearly related to each other according to the given text, and are medically meaningful. Only use the 35 relationships that are in the predefined list. Avoid relationships that are attached to entities that are too general, for example: patients, bodily f...

  30. [30]

    Use ##as the list delimiter

    Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use ##as the list delimiter

  31. [31]

    This is an example sentence supported by multiple data references[Data: <dataset name> (record ids); <dataset name> (record ids)]

    When finished, output<|COMPLETE|> 66 - Constraints and Guidelines - Strict Textual Grounding: Base all extractions only on the provided medical abstract. Do not use external knowledge or make assumptions beyond what is written. - Entity Filtering: Only extract the entities whose type is present in the provided 5 Entity Type, and only extract entities that...