pith. sign in

arxiv: 2604.08552 · v1 · submitted 2026-03-10 · 💻 cs.DB · cs.AI

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Pith reviewed 2026-05-15 12:37 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords biomedical metadataLLM agentterminology servicesontology constraintsmetadata standardizationFAIR datalegacy recordsHuBMAP
0
0 comments X

The pith

Augmenting an LLM with real-time queries to biomedical terminology services improves metadata standardization accuracy over the model alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that an LLM agent equipped with live tool calls to authoritative terminology services produces higher-accuracy standardized metadata than an LLM relying solely on static prompts and its training data. This matters because incomplete or non-standard metadata in biomedical datasets blocks findability, interoperability, and reuse under FAIR principles. The system converts ontology constraints into dynamic queries that fetch canonically correct terms on demand instead of treating them as fixed text. Evaluation across 839 HuBMAP legacy records against an expert gold standard shows consistent gains for both ontology-constrained fields and free-text fields.

Core claim

We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields.

What carries the argument

LLM agent with real-time tool access to biomedical terminology services that fetches canonically correct terms to enforce metadata field constraints.

Load-bearing premise

The expert-curated gold standard is treated as ground truth and the real-time terminology services always return canonically correct terms without introducing new errors.

What would settle it

Evaluating the system on an independent set of legacy records where terminology service outputs diverge from the gold standard and observing no accuracy gain from tool access would falsify the claim.

read the original abstract

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical, scalable approach to automated standardization of biomedical metadata.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an LLM-based agent for standardizing legacy biomedical metadata that augments the model with real-time queries to authoritative terminology services (e.g., for ontology-constrained fields). It evaluates the approach on 839 HuBMAP legacy records against an expert-curated gold standard using exact-match accuracy, claiming consistent improvements over the base LLM alone for both ontology-constrained and non-ontology-constrained fields.

Significance. If the empirical results hold, the work demonstrates a practical, scalable method for producing machine-actionable FAIR metadata from non-compliant legacy records. The real-time tool-use design directly addresses the limitation of static prompt-based ontology constraints noted in prior work, and the evaluation on real HuBMAP data provides a concrete test of applicability in a high-stakes biomedical domain.

major comments (2)
  1. [Evaluation] Evaluation section: the manuscript reports consistent accuracy gains but supplies no quantitative metrics (exact-match percentages, per-field breakdowns, confidence intervals, or statistical significance tests) in the main text or tables; without these numbers the central claim cannot be assessed for effect size or robustness.
  2. [Methods] Methods section on gold-standard construction: the protocol for selecting and expert-curating the 839 HuBMAP records is described at too high a level; reproducibility requires explicit criteria for record inclusion, inter-annotator agreement statistics, and handling of ambiguous terms.
minor comments (2)
  1. [System Architecture] The terminology services queried (e.g., specific endpoints or versions) should be named with URLs or DOIs for reproducibility.
  2. [Results] Figure captions and axis labels in the results figures need to state the exact metric (exact-match accuracy) and sample size (n=839) explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and methods.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the manuscript reports consistent accuracy gains but supplies no quantitative metrics (exact-match percentages, per-field breakdowns, confidence intervals, or statistical significance tests) in the main text or tables; without these numbers the central claim cannot be assessed for effect size or robustness.

    Authors: We agree that quantitative metrics are required to allow readers to evaluate effect size and robustness. The original manuscript emphasized the direction of improvement without including the supporting numbers in the main text. In the revised version we will add exact-match accuracy percentages for the base LLM and the tool-augmented system, per-field breakdowns, 95% confidence intervals, and statistical significance results (McNemar’s test) to the Evaluation section, accompanied by a new summary table. revision: yes

  2. Referee: [Methods] Methods section on gold-standard construction: the protocol for selecting and expert-curating the 839 HuBMAP records is described at too high a level; reproducibility requires explicit criteria for record inclusion, inter-annotator agreement statistics, and handling of ambiguous terms.

    Authors: We acknowledge that the gold-standard protocol is presented at a high level. The 839 records were chosen as a representative subset of legacy HuBMAP metadata and curated by domain experts. In the revision we will expand the Methods section to state the explicit inclusion criteria, report inter-annotator agreement (Cohen’s kappa), and describe the consensus procedure used for ambiguous terms. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a systems description and empirical evaluation of an LLM agent augmented with real-time terminology service queries for biomedical metadata standardization. The central result—an observed accuracy improvement on 839 HuBMAP records against an expert-curated gold standard—is obtained through direct experimental comparison of LLM-only versus tool-augmented conditions using exact-match metrics. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation; the evaluation protocol relies on external data and services rather than reducing to internal definitions or prior author results by construction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new postulated entities appear in the abstract; the system uses existing LLMs and external terminology services.

pith-pipeline@v0.9.0 · 5494 in / 991 out tokens · 23033 ms · 2026-05-15T12:37:55.673155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    The FAIR Guiding Principles for scientific data management and stewardship

    Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018

  2. [2]

    The variable quality of metadata about biological samples used in biomedical experiments

    Gonçalves RS, Musen MA. The variable quality of metadata about biological samples used in biomedical experiments. Sci Data. 2019 Feb 19;6:190021

  3. [3]

    Minimum information about a microarray experiment (MIAME)—toward standards for microarray data

    Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet. 2001 Dec;29(4):365–71

  4. [4]

    Modeling community standards for metadata as templates makes data FAIR

    Musen MA, O’Connor MJ, Schultes E, Martínez-Romero M, Hardi J, Graybeal J. Modeling community standards for metadata as templates makes data FAIR. Sci Data. 2022 Nov 12;9(1):696

  5. [5]

    The CEDAR Workbench: An ontology-assisted environment for authoring metadata that describe scientific experiments

    Gonçalves RS, O’Connor MJ, Martínez-Romero M, Egyedi AL, Willrett D, Graybeal J, et al. The CEDAR Workbench: An ontology-assisted environment for authoring metadata that describe scientific experiments. Semant Web ISWC. 2017 Oct;10588:103–10

  6. [6]

    Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards

    Musen MA, O’Connor MJ, Hardi J, Martínez-Romero M. Knowledge engineering for open science: Building and deploying knowledge bases for metadata standards. AI Mag [Internet]. 2026 Mar;47(1). Available from: http://dx.doi.org/10.1002/aaai.70048

  7. [7]

    The center for expanded data annotation and retrieval

    Musen MA, Bean CA, Cheung KH, Dumontier M, Durante KA, Gevaert O, et al. The center for expanded data annotation and retrieval. J Am Med Inform Assoc. 2015 Nov;22(6):1148–52

  8. [8]

    Structured knowledge base enhances effective use of large language models for metadata curation

    Sundaram SS, Solomon B, Khatri A, Laumas A, Khatri P, Musen MA. Structured knowledge base enhances effective use of large language models for metadata curation. AMIA Annu Symp Proc. 2024;2024:1050–8

  9. [9]

    Toward total recall: Enhancing data FAIRness through AI-driven metadata standardization

    Sundaram SS, Gonçalves RS, Musen MA. Toward total recall: Enhancing data FAIRness through AI-driven metadata standardization. Gigascience. 03 2026;giag019

  10. [10]

    [cited 2026 Mar 2]

    Introducing the Model Context Protocol [Internet]. [cited 2026 Mar 2]. Available from: https://www.anthropic.com/news/model-context-protocol

  11. [11]

    BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies

    Vendetti J, Harris NL, Dorf MV, Skrenchuk A, Caufield JH, Gonçalves RS, et al. BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies. Nucleic Acids Res. 2025 Jul 7;53(W1):W84–94

  12. [12]

    Predicting failures of LLMs to link biomedical ontology terms to identifiers: Evidence across models and ontologies

    Hier DB, Platt SK, Obafemi-Ajayi T. Predicting failures of LLMs to link biomedical ontology terms to identifiers: Evidence across models and ontologies. In: 2025 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE

  13. [13]

    Generalizable and scalable multistage biomedical concept normalization leveraging large language models

    Dobbins NJ. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Res Synth Methods. 2025 May;16(3):479–90

  14. [14]

    Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

    Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics [Internet]. 2024 Mar 4;40(3). Available from: https://doi.org/10.1093/bioinformatics/btae104

  15. [15]

    The human body at cellular resolution: the NIH Human Biomolecular Atlas Program

    HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature. 2019 Oct;574(7777):187–92

  16. [16]

    Github; [cited 2026 Mar 9]

    langchain: The agent engineering platform [Internet]. Github; [cited 2026 Mar 9]. Available from: https://github.com/langchain-ai/langchain

  17. [17]

    Github; [cited 2026 Mar 9]

    langgraph: Build resilient language agents as graphs [Internet]. Github; [cited 2026 Mar 9]. Available from: https://github.com/langchain-ai/langgraph

  18. [18]

    [cited 2026 Mar 9]

    LangSmith: AI Agent & LLM Observability Platform [Internet]. [cited 2026 Mar 9]. Available from: https://www.langchain.com/langsmith/