Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Jean G. Rosario; Josef Hardi; Marcos Martinez-Romero; Mark A. Musen; Martin J. O'Connor; Stephen A. Fisher

arxiv: 2604.08552 · v1 · submitted 2026-03-10 · 💻 cs.DB · cs.AI

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Josef Hardi , Martin J. O'Connor , Marcos Martinez-Romero , Jean G. Rosario , Stephen A. Fisher , Mark A. Musen This is my paper

Pith reviewed 2026-05-15 12:37 UTC · model grok-4.3

classification 💻 cs.DB cs.AI

keywords biomedical metadataLLM agentterminology servicesontology constraintsmetadata standardizationFAIR datalegacy recordsHuBMAP

0 comments

The pith

Augmenting an LLM with real-time queries to biomedical terminology services improves metadata standardization accuracy over the model alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that an LLM agent equipped with live tool calls to authoritative terminology services produces higher-accuracy standardized metadata than an LLM relying solely on static prompts and its training data. This matters because incomplete or non-standard metadata in biomedical datasets blocks findability, interoperability, and reuse under FAIR principles. The system converts ontology constraints into dynamic queries that fetch canonically correct terms on demand instead of treating them as fixed text. Evaluation across 839 HuBMAP legacy records against an expert gold standard shows consistent gains for both ontology-constrained fields and free-text fields.

Core claim

We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields.

What carries the argument

LLM agent with real-time tool access to biomedical terminology services that fetches canonically correct terms to enforce metadata field constraints.

Load-bearing premise

The expert-curated gold standard is treated as ground truth and the real-time terminology services always return canonically correct terms without introducing new errors.

What would settle it

Evaluating the system on an independent set of legacy records where terminology service outputs diverge from the gold standard and observing no accuracy gain from tool access would falsify the claim.

read the original abstract

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical, scalable approach to automated standardization of biomedical metadata.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an LLM-based agent for standardizing legacy biomedical metadata that augments the model with real-time queries to authoritative terminology services (e.g., for ontology-constrained fields). It evaluates the approach on 839 HuBMAP legacy records against an expert-curated gold standard using exact-match accuracy, claiming consistent improvements over the base LLM alone for both ontology-constrained and non-ontology-constrained fields.

Significance. If the empirical results hold, the work demonstrates a practical, scalable method for producing machine-actionable FAIR metadata from non-compliant legacy records. The real-time tool-use design directly addresses the limitation of static prompt-based ontology constraints noted in prior work, and the evaluation on real HuBMAP data provides a concrete test of applicability in a high-stakes biomedical domain.

major comments (2)

[Evaluation] Evaluation section: the manuscript reports consistent accuracy gains but supplies no quantitative metrics (exact-match percentages, per-field breakdowns, confidence intervals, or statistical significance tests) in the main text or tables; without these numbers the central claim cannot be assessed for effect size or robustness.
[Methods] Methods section on gold-standard construction: the protocol for selecting and expert-curating the 839 HuBMAP records is described at too high a level; reproducibility requires explicit criteria for record inclusion, inter-annotator agreement statistics, and handling of ambiguous terms.

minor comments (2)

[System Architecture] The terminology services queried (e.g., specific endpoints or versions) should be named with URLs or DOIs for reproducibility.
[Results] Figure captions and axis labels in the results figures need to state the exact metric (exact-match accuracy) and sample size (n=839) explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and methods.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the manuscript reports consistent accuracy gains but supplies no quantitative metrics (exact-match percentages, per-field breakdowns, confidence intervals, or statistical significance tests) in the main text or tables; without these numbers the central claim cannot be assessed for effect size or robustness.

Authors: We agree that quantitative metrics are required to allow readers to evaluate effect size and robustness. The original manuscript emphasized the direction of improvement without including the supporting numbers in the main text. In the revised version we will add exact-match accuracy percentages for the base LLM and the tool-augmented system, per-field breakdowns, 95% confidence intervals, and statistical significance results (McNemar’s test) to the Evaluation section, accompanied by a new summary table. revision: yes
Referee: [Methods] Methods section on gold-standard construction: the protocol for selecting and expert-curating the 839 HuBMAP records is described at too high a level; reproducibility requires explicit criteria for record inclusion, inter-annotator agreement statistics, and handling of ambiguous terms.

Authors: We acknowledge that the gold-standard protocol is presented at a high level. The 839 records were chosen as a representative subset of legacy HuBMAP metadata and curated by domain experts. In the revision we will expand the Methods section to state the explicit inclusion criteria, report inter-annotator agreement (Cohen’s kappa), and describe the consensus procedure used for ambiguous terms. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a systems description and empirical evaluation of an LLM agent augmented with real-time terminology service queries for biomedical metadata standardization. The central result—an observed accuracy improvement on 839 HuBMAP records against an expert-curated gold standard—is obtained through direct experimental comparison of LLM-only versus tool-augmented conditions using exact-match metrics. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation; the evaluation protocol relies on external data and services rather than reducing to internal definitions or prior author results by construction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new postulated entities appear in the abstract; the system uses existing LLMs and external terminology services.

pith-pipeline@v0.9.0 · 5494 in / 991 out tokens · 23033 ms · 2026-05-15T12:37:55.673155+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

The FAIR Guiding Principles for scientific data management and stewardship

Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018

work page 2016
[2]

The variable quality of metadata about biological samples used in biomedical experiments

Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Sci Data. 2019 Feb 19;6:190021

work page 2019
[3]

Minimum information about a microarray experiment (MIAME)—toward standards for microarray data

Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet. 2001 Dec;29(4):365–71

work page 2001
[4]

Modeling community standards for metadata as templates makes data FAIR

Musen MA, O’Connor MJ, Schultes E, Martínez-Romero M, Hardi J, Graybeal J. Modeling community standards for metadata as templates makes data FAIR. Sci Data. 2022 Nov 12;9(1):696

work page 2022
[5]

The CEDAR Workbench: An ontology-assisted environment for authoring metadata that describe scientific experiments

Gonçalves RS, O’Connor MJ, Martínez-Romero M, Egyedi AL, Willrett D, Graybeal J, et al. The CEDAR Workbench: An ontology-assisted environment for authoring metadata that describe scientific experiments. Semant Web ISWC. 2017 Oct;10588:103–10

work page 2017
[6]

Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards

Musen MA, O’Connor MJ, Hardi J, Martínez-Romero M. Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards. AI Mag [Internet]. 2026 Mar;47(1). Available from: http://dx.doi.org/10.1002/aaai.70048

work page doi:10.1002/aaai.70048 2026
[7]

The center for expanded data annotation and retrieval

Musen MA, Bean CA, Cheung KH, Dumontier M, Durante KA, Gevaert O, et al. The center for expanded data annotation and retrieval. J Am Med Inform Assoc. 2015 Nov;22(6):1148–52

work page 2015
[8]

Structured knowledge base enhances effective use of large language models for metadata curation

Sundaram SS, Solomon B, Khatri A, Laumas A, Khatri P, Musen MA. Structured knowledge base enhances effective use of large language models for metadata curation. AMIA Annu Symp Proc. 2024;2024:1050–8

work page 2024
[9]

Toward total recall: Enhancing data FAIRness through AI-driven metadata standardization

Sundaram SS, Gonçalves RS, Musen MA. Toward total recall: Enhancing data FAIRness through AI-driven metadata standardization. Gigascience. 03 2026;giag019

work page 2026
[10]

[cited 2026 Mar 2]

Introducing the Model Context Protocol [Internet]. [cited 2026 Mar 2]. Available from: https://www.anthropic.com/news/model-context-protocol

work page 2026
[11]

BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies

Vendetti J, Harris NL, Dorf MV, Skrenchuk A, Caufield JH, Gonçalves RS, et al. BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies. Nucleic Acids Res. 2025 Jul 7;53(W1):W84–94

work page 2025
[12]

Predicting failures of LLMs to link biomedical ontology terms to identifiers: Evidence across models and ontologies

Hier DB, Platt SK, Obafemi-Ajayi T. Predicting failures of LLMs to link biomedical ontology terms to identifiers: Evidence across models and ontologies. In: 2025 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE

work page 2025
[13]

Generalizable and scalable multistage biomedical concept normalization leveraging large language models

Dobbins NJ. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Res Synth Methods. 2025 May;16(3):479–90

work page 2025
[14]

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics [Internet]. 2024 Mar 4;40(3). Available from: https://doi.org/10.1093/bioinformatics/btae104

work page doi:10.1093/bioinformatics/btae104 2024
[15]

The human body at cellular resolution: the NIH Human Biomolecular Atlas Program

HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature. 2019 Oct;574(7777):187–92

work page 2019
[16]

Github; [cited 2026 Mar 9]

langchain: The agent engineering platform [Internet]. Github; [cited 2026 Mar 9]. Available from: https://github.com/langchain-ai/langchain

work page 2026
[17]

Github; [cited 2026 Mar 9]

langgraph: Build resilient language agents as graphs [Internet]. Github; [cited 2026 Mar 9]. Available from: https://github.com/langchain-ai/langgraph

work page 2026
[18]

[cited 2026 Mar 9]

LangSmith: AI Agent & LLM Observability Platform [Internet]. [cited 2026 Mar 9]. Available from: https://www.langchain.com/langsmith/

work page 2026

[1] [1]

The FAIR Guiding Principles for scientific data management and stewardship

Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018

work page 2016

[2] [2]

The variable quality of metadata about biological samples used in biomedical experiments

Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Sci Data. 2019 Feb 19;6:190021

work page 2019

[3] [3]

Minimum information about a microarray experiment (MIAME)—toward standards for microarray data

Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet. 2001 Dec;29(4):365–71

work page 2001

[4] [4]

Modeling community standards for metadata as templates makes data FAIR

Musen MA, O’Connor MJ, Schultes E, Martínez-Romero M, Hardi J, Graybeal J. Modeling community standards for metadata as templates makes data FAIR. Sci Data. 2022 Nov 12;9(1):696

work page 2022

[5] [5]

The CEDAR Workbench: An ontology-assisted environment for authoring metadata that describe scientific experiments

Gonçalves RS, O’Connor MJ, Martínez-Romero M, Egyedi AL, Willrett D, Graybeal J, et al. The CEDAR Workbench: An ontology-assisted environment for authoring metadata that describe scientific experiments. Semant Web ISWC. 2017 Oct;10588:103–10

work page 2017

[6] [6]

Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards

Musen MA, O’Connor MJ, Hardi J, Martínez-Romero M. Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards. AI Mag [Internet]. 2026 Mar;47(1). Available from: http://dx.doi.org/10.1002/aaai.70048

work page doi:10.1002/aaai.70048 2026

[7] [7]

The center for expanded data annotation and retrieval

Musen MA, Bean CA, Cheung KH, Dumontier M, Durante KA, Gevaert O, et al. The center for expanded data annotation and retrieval. J Am Med Inform Assoc. 2015 Nov;22(6):1148–52

work page 2015

[8] [8]

Structured knowledge base enhances effective use of large language models for metadata curation

Sundaram SS, Solomon B, Khatri A, Laumas A, Khatri P, Musen MA. Structured knowledge base enhances effective use of large language models for metadata curation. AMIA Annu Symp Proc. 2024;2024:1050–8

work page 2024

[9] [9]

Toward total recall: Enhancing data FAIRness through AI-driven metadata standardization

Sundaram SS, Gonçalves RS, Musen MA. Toward total recall: Enhancing data FAIRness through AI-driven metadata standardization. Gigascience. 03 2026;giag019

work page 2026

[10] [10]

[cited 2026 Mar 2]

Introducing the Model Context Protocol [Internet]. [cited 2026 Mar 2]. Available from: https://www.anthropic.com/news/model-context-protocol

work page 2026

[11] [11]

BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies

Vendetti J, Harris NL, Dorf MV, Skrenchuk A, Caufield JH, Gonçalves RS, et al. BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies. Nucleic Acids Res. 2025 Jul 7;53(W1):W84–94

work page 2025

[12] [12]

Predicting failures of LLMs to link biomedical ontology terms to identifiers: Evidence across models and ontologies

Hier DB, Platt SK, Obafemi-Ajayi T. Predicting failures of LLMs to link biomedical ontology terms to identifiers: Evidence across models and ontologies. In: 2025 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE

work page 2025

[13] [13]

Generalizable and scalable multistage biomedical concept normalization leveraging large language models

Dobbins NJ. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Res Synth Methods. 2025 May;16(3):479–90

work page 2025

[14] [14]

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics [Internet]. 2024 Mar 4;40(3). Available from: https://doi.org/10.1093/bioinformatics/btae104

work page doi:10.1093/bioinformatics/btae104 2024

[15] [15]

The human body at cellular resolution: the NIH Human Biomolecular Atlas Program

HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature. 2019 Oct;574(7777):187–92

work page 2019

[16] [16]

Github; [cited 2026 Mar 9]

langchain: The agent engineering platform [Internet]. Github; [cited 2026 Mar 9]. Available from: https://github.com/langchain-ai/langchain

work page 2026

[17] [17]

Github; [cited 2026 Mar 9]

langgraph: Build resilient language agents as graphs [Internet]. Github; [cited 2026 Mar 9]. Available from: https://github.com/langchain-ai/langgraph

work page 2026

[18] [18]

[cited 2026 Mar 9]

LangSmith: AI Agent & LLM Observability Platform [Internet]. [cited 2026 Mar 9]. Available from: https://www.langchain.com/langsmith/

work page 2026