OpenCitations Meta

Arcangelo Massari; David Shotton; Fabio Mariani; Ivan Heibi; Silvio Peroni

arxiv: 2306.16191 · v2 · submitted 2023-06-28 · 💻 cs.DL

OpenCitations Meta

Arcangelo Massari , Fabio Mariani , Ivan Heibi , Silvio Peroni , David Shotton This is my paper

Pith reviewed 2026-05-24 08:25 UTC · model grok-4.3

classification 💻 cs.DL

keywords OpenCitationsbibliographic metadataSemantic Webpersistent identifiersdata curationopen scienceSPARQL endpointCC0 license

0 comments

The pith

OpenCitations Meta merges metadata from Crossref, DataCite and PubMed into the largest Semantic Web bibliographic database and assigns its own persistent identifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenCitations Meta as a new open database of bibliographic metadata for publications involved in OpenCitations citations. It draws records from Crossref, DataCite and PubMed, applies an automated curation process, and releases everything under CC0. By assigning OMIDs it disambiguates entries that appear under different external identifiers and covers publications that have none. Internal storage of the metadata removes the need for live external API calls during queries. The system also records full provenance for every change.

Core claim

OpenCitations Meta stores bibliographic metadata for scholarly publications cited within the OpenCitations infrastructure, following the OpenCitations Data Model and published under CC0. It ingests data from Crossref, DataCite and PubMed to become the largest bibliographic metadata collection that uses Semantic Web technologies. It creates OMIDs for every resource so that publications described by different external PIDs can be unified and so that works without external PIDs can still participate in citations. Metadata is hosted internally rather than fetched on demand, and an automated pipeline performs deduplication, error correction, enrichment and complete provenance tracking.

What carries the argument

OpenCitations Meta Identifiers (OMIDs) together with the automated curation pipeline that follows the OpenCitations Data Model.

If this is right

Publications described by different external PIDs such as a DOI and a PMID become a single record.
Citations involving publications that lack any external PID can still be recorded and queried.
Query responses no longer depend on live calls to external APIs, raising performance.
Every metadata change carries full provenance, making data integrity traceable.
Access is available through SPARQL, REST APIs and bulk dumps while remaining fully interoperable with other Semantic Web resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analyses that combine citation links with bibliographic details can be performed inside a single local store rather than across multiple external services.
The CC0 release and provenance records create a foundation that other projects could reuse or extend without legal or technical friction.
If the curation pipeline proves reliable over time, the database could serve as a reference point for checking completeness of other open metadata collections.
The same internal-hosting pattern could be applied to citation data itself to further reduce external dependencies.

Load-bearing premise

The automated curation pipeline can deduplicate records, correct errors and enrich metadata from heterogeneous sources without introducing systematic new errors or losing coverage.

What would settle it

A sample audit that finds the same publication assigned two different OMIDs or that finds source records from Crossref, DataCite or PubMed that are absent from the Meta database after the claimed ingestion.

read the original abstract

OpenCitations Meta is a new database for open bibliographic metadata of scholarly publications involved in the citations indexed by the OpenCitations infrastructure, adhering to Open Science principles and published under a CC0 license to promote maximum reuse. It presently incorporates bibliographic metadata for publications recorded in Crossref, DataCite and PubMed, making it the largest bibliographic metadata source using Semantic Web technologies. It assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs) to all bibliographic resources, enabling it both to disambiguate publications described using different external PIDS (e.g., a DOI in Crossref and a PMID in PubMed), and to handle citations involving publications lacking external PIDs. By hosting bibliographic metadata internally, OpenCitations Meta eliminates its former reliance on API calls to external resources and thus enhances performance in response to user queries. Its automated data curation, following the OpenCitations Data Model, includes deduplication, error correction, metadata enrichment and full provenance tracking, ensuring transparency and traceability of data and bolstering confidence in data integrity, a feature unparalleled in other bibliographic databases. Its commitment to Semantic Web standards ensures superior interoperability compared to other machine-readable formats, with availability via a SPARQL endpoint, REST APIs and data dumps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenCitations Meta adds OMIDs, internal hosting, and provenance tracking to aggregated metadata from Crossref, DataCite and PubMed, but the largest-source claim has no counts or benchmarks to back it.

read the letter

The core contribution is a deployed database that pulls bibliographic records from those three sources, assigns OMIDs to disambiguate across PIDs and cover items without external identifiers, stores everything locally, and keeps full provenance on the curation steps. Internal hosting removes the old API dependency, which should improve query speed, and the CC0 plus SPARQL/REST/dump access follows through on the open-science pitch. The OMID scheme and the decision to track every change are the parts that feel like actual engineering progress over prior OpenCitations work. The curation pipeline description covers deduplication, error fixes, and enrichment, which addresses a real pain point in cross-source data. The main gap is the size claim. The text states it is now the largest Semantic Web bibliographic collection but supplies no unique-publication totals, no OMID counts, and no side-by-side numbers against Wikidata or other RDF collections. That leaves the headline assertion untested. The abstract-only view also leaves the actual error rates and coverage losses from the automated pipeline unexamined, though nothing in the description contradicts itself. This is a resource paper aimed at people building citation graphs or knowledge bases who want an open, queryable alternative to commercial indexes. It is worth a serious referee round so the implementation details and scale numbers can be checked before wider adoption.

Referee Report

1 major / 0 minor

Summary. The paper presents OpenCitations Meta, a new open bibliographic metadata database adhering to Open Science principles and published under CC0. It aggregates metadata for publications from Crossref, DataCite and PubMed, assigns OMIDs to enable disambiguation across external PIDs and to handle publications without PIDs, hosts metadata internally to eliminate external API calls, performs automated curation (deduplication, error correction, enrichment, provenance tracking) per the OpenCitations Data Model, and exposes data via SPARQL endpoint, REST APIs and dumps. The abstract asserts that this makes it the largest bibliographic metadata source using Semantic Web technologies.

Significance. If the scale, curation accuracy and provenance claims hold, the work provides a substantial open infrastructure contribution: a large-scale, interoperable Semantic Web bibliographic resource that improves query performance over prior external-API reliance and offers transparent, traceable data not matched by other bibliographic databases. This directly supports reuse, interoperability and scholarly analysis under open-science principles.

major comments (1)

[Abstract] Abstract: the claim that OpenCitations Meta is 'the largest bibliographic metadata source using Semantic Web technologies' is unsupported by any reported counts of unique publications, OMIDs or citations, and by any explicit comparison to other RDF-based collections (e.g., Wikidata scholarly items). Without these figures the size assertion remains unevaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of OpenCitations Meta. We address the single major comment below and will revise the manuscript to strengthen the unsupported claim.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that OpenCitations Meta is 'the largest bibliographic metadata source using Semantic Web technologies' is unsupported by any reported counts of unique publications, OMIDs or citations, and by any explicit comparison to other RDF-based collections (e.g., Wikidata scholarly items). Without these figures the size assertion remains unevaluated.

Authors: We agree that the size claim in the abstract is currently unsupported, as the manuscript provides no explicit counts of unique publications, OMIDs or citations, nor any direct comparison to other Semantic Web resources such as Wikidata. In the revised manuscript we will add these quantitative figures (drawn from the integrated Crossref, DataCite and PubMed sources) together with a concise comparison to relevant RDF collections, either substantiating the claim or qualifying it appropriately. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted results; database construction paper with no self-referential predictions

full rationale

The paper describes construction of OpenCitations Meta by ingesting and curating bibliographic metadata from external sources (Crossref, DataCite, PubMed). It assigns OMIDs, performs deduplication and enrichment, and exposes data via SPARQL/REST. No equations, parameters, predictions, or derivations appear in the provided text. Claims about size and uniqueness are presented as direct consequences of the aggregation process rather than outputs derived from the database itself. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The work is self-contained as a report of infrastructure building.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and infrastructure paper; no free parameters, mathematical axioms or invented scientific entities are introduced. The central claim rests on the existence and correct operation of the described data integration pipeline.

pith-pipeline@v0.9.0 · 5756 in / 1056 out tokens · 16061 ms · 2026-05-24T08:25:10.293504+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex
cs.DL 2023-12 unverdicted novelty 4.0

Authors map entities between OpenCitations Meta and OpenAlex to add identifiers and evaluate bibliographic metadata consistency.