pith. sign in

arxiv: 2306.16191 · v2 · submitted 2023-06-28 · 💻 cs.DL

OpenCitations Meta

Pith reviewed 2026-05-24 08:25 UTC · model grok-4.3

classification 💻 cs.DL
keywords OpenCitationsbibliographic metadataSemantic Webpersistent identifiersdata curationopen scienceSPARQL endpointCC0 license
0
0 comments X

The pith

OpenCitations Meta merges metadata from Crossref, DataCite and PubMed into the largest Semantic Web bibliographic database and assigns its own persistent identifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenCitations Meta as a new open database of bibliographic metadata for publications involved in OpenCitations citations. It draws records from Crossref, DataCite and PubMed, applies an automated curation process, and releases everything under CC0. By assigning OMIDs it disambiguates entries that appear under different external identifiers and covers publications that have none. Internal storage of the metadata removes the need for live external API calls during queries. The system also records full provenance for every change.

Core claim

OpenCitations Meta stores bibliographic metadata for scholarly publications cited within the OpenCitations infrastructure, following the OpenCitations Data Model and published under CC0. It ingests data from Crossref, DataCite and PubMed to become the largest bibliographic metadata collection that uses Semantic Web technologies. It creates OMIDs for every resource so that publications described by different external PIDs can be unified and so that works without external PIDs can still participate in citations. Metadata is hosted internally rather than fetched on demand, and an automated pipeline performs deduplication, error correction, enrichment and complete provenance tracking.

What carries the argument

OpenCitations Meta Identifiers (OMIDs) together with the automated curation pipeline that follows the OpenCitations Data Model.

If this is right

  • Publications described by different external PIDs such as a DOI and a PMID become a single record.
  • Citations involving publications that lack any external PID can still be recorded and queried.
  • Query responses no longer depend on live calls to external APIs, raising performance.
  • Every metadata change carries full provenance, making data integrity traceable.
  • Access is available through SPARQL, REST APIs and bulk dumps while remaining fully interoperable with other Semantic Web resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analyses that combine citation links with bibliographic details can be performed inside a single local store rather than across multiple external services.
  • The CC0 release and provenance records create a foundation that other projects could reuse or extend without legal or technical friction.
  • If the curation pipeline proves reliable over time, the database could serve as a reference point for checking completeness of other open metadata collections.
  • The same internal-hosting pattern could be applied to citation data itself to further reduce external dependencies.

Load-bearing premise

The automated curation pipeline can deduplicate records, correct errors and enrich metadata from heterogeneous sources without introducing systematic new errors or losing coverage.

What would settle it

A sample audit that finds the same publication assigned two different OMIDs or that finds source records from Crossref, DataCite or PubMed that are absent from the Meta database after the claimed ingestion.

read the original abstract

OpenCitations Meta is a new database for open bibliographic metadata of scholarly publications involved in the citations indexed by the OpenCitations infrastructure, adhering to Open Science principles and published under a CC0 license to promote maximum reuse. It presently incorporates bibliographic metadata for publications recorded in Crossref, DataCite and PubMed, making it the largest bibliographic metadata source using Semantic Web technologies. It assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs) to all bibliographic resources, enabling it both to disambiguate publications described using different external PIDS (e.g., a DOI in Crossref and a PMID in PubMed), and to handle citations involving publications lacking external PIDs. By hosting bibliographic metadata internally, OpenCitations Meta eliminates its former reliance on API calls to external resources and thus enhances performance in response to user queries. Its automated data curation, following the OpenCitations Data Model, includes deduplication, error correction, metadata enrichment and full provenance tracking, ensuring transparency and traceability of data and bolstering confidence in data integrity, a feature unparalleled in other bibliographic databases. Its commitment to Semantic Web standards ensures superior interoperability compared to other machine-readable formats, with availability via a SPARQL endpoint, REST APIs and data dumps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents OpenCitations Meta, a new open bibliographic metadata database adhering to Open Science principles and published under CC0. It aggregates metadata for publications from Crossref, DataCite and PubMed, assigns OMIDs to enable disambiguation across external PIDs and to handle publications without PIDs, hosts metadata internally to eliminate external API calls, performs automated curation (deduplication, error correction, enrichment, provenance tracking) per the OpenCitations Data Model, and exposes data via SPARQL endpoint, REST APIs and dumps. The abstract asserts that this makes it the largest bibliographic metadata source using Semantic Web technologies.

Significance. If the scale, curation accuracy and provenance claims hold, the work provides a substantial open infrastructure contribution: a large-scale, interoperable Semantic Web bibliographic resource that improves query performance over prior external-API reliance and offers transparent, traceable data not matched by other bibliographic databases. This directly supports reuse, interoperability and scholarly analysis under open-science principles.

major comments (1)
  1. [Abstract] Abstract: the claim that OpenCitations Meta is 'the largest bibliographic metadata source using Semantic Web technologies' is unsupported by any reported counts of unique publications, OMIDs or citations, and by any explicit comparison to other RDF-based collections (e.g., Wikidata scholarly items). Without these figures the size assertion remains unevaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of OpenCitations Meta. We address the single major comment below and will revise the manuscript to strengthen the unsupported claim.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that OpenCitations Meta is 'the largest bibliographic metadata source using Semantic Web technologies' is unsupported by any reported counts of unique publications, OMIDs or citations, and by any explicit comparison to other RDF-based collections (e.g., Wikidata scholarly items). Without these figures the size assertion remains unevaluated.

    Authors: We agree that the size claim in the abstract is currently unsupported, as the manuscript provides no explicit counts of unique publications, OMIDs or citations, nor any direct comparison to other Semantic Web resources such as Wikidata. In the revised manuscript we will add these quantitative figures (drawn from the integrated Crossref, DataCite and PubMed sources) together with a concise comparison to relevant RDF collections, either substantiating the claim or qualifying it appropriately. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted results; database construction paper with no self-referential predictions

full rationale

The paper describes construction of OpenCitations Meta by ingesting and curating bibliographic metadata from external sources (Crossref, DataCite, PubMed). It assigns OMIDs, performs deduplication and enrichment, and exposes data via SPARQL/REST. No equations, parameters, predictions, or derivations appear in the provided text. Claims about size and uniqueness are presented as direct consequences of the aggregation process rather than outputs derived from the database itself. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The work is self-contained as a report of infrastructure building.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and infrastructure paper; no free parameters, mathematical axioms or invented scientific entities are introduced. The central claim rests on the existence and correct operation of the described data integration pipeline.

pith-pipeline@v0.9.0 · 5756 in / 1056 out tokens · 16061 ms · 2026-05-24T08:25:10.293504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex

    cs.DL 2023-12 unverdicted novelty 4.0

    Authors map entities between OpenCitations Meta and OpenAlex to add identifiers and evaluate bibliographic metadata consistency.