Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex
Pith reviewed 2026-05-24 05:19 UTC · model grok-4.3
The pith
Mapping entities between OpenCitations Meta and OpenAlex integrates OpenAlex identifiers into the former collection and exposes inconsistencies in bibliographic metadata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The mapping procedure successfully aligns a substantial portion of entities between the two collections, thereby permitting the insertion of OpenAlex identifiers into OpenCitations Meta records, and the comparison of matched records reveals measurable differences in how the same bibliographic resources are described in each dataset.
What carries the argument
The entity matching procedure that identifies corresponding bibliographic resources across the two collections and transfers OpenAlex identifiers.
If this is right
- OpenCitations Meta records gain direct pointers to OpenAlex entries, enabling combined queries across both collections.
- Differences detected during matching can be used to flag candidate corrections in either dataset.
- The same matching logic can be reapplied after either collection is updated to keep the links current.
- Users of either collection obtain a larger, cross-referenced view of the scholarly literature without building their own reconciliation tools.
Where Pith is reading between the lines
- If the matching error rate proves low, the approach could serve as a template for linking additional open bibliographic sources such as Crossref or DataCite.
- Persistent differences uncovered by the mapping may indicate systematic choices in how each collection harvests or cleans its source data.
- The interlinked identifiers would make it easier to measure coverage gaps between the two collections on specific topics or time periods.
Load-bearing premise
The procedure that decides whether two records describe the same resource does so with low enough error that the resulting links and consistency checks remain trustworthy.
What would settle it
A manual audit of a random sample of matched pairs that finds more than a small percentage of incorrect links or a systematic bias in the types of resources that fail to match.
read the original abstract
This study describes the methodology and analyses the results of the process of mapping entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collections. Furthermore, analysing the output of the mapping provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable tool for identifying potential inconsistencies in the processed data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes a methodology for mapping bibliographic entities between OpenCitations Meta and OpenAlex, with the goals of integrating OpenAlex internal identifiers into OpenCitations Meta and using the mapping results to assess consistency and accuracy of bibliographic metadata across the two collections.
Significance. If the entity matching is shown to be reliable, the resulting interlinked dataset and metadata analysis could support improved discoverability and quality assessment in open bibliographic infrastructures; the work is primarily descriptive and does not claim novel algorithms or theoretical advances.
major comments (2)
- [Abstract / methodology] Abstract and methodology description: the entity-matching procedure is presented at a high level without any reported precision, recall, error rates, or validation against ground truth (held-out sample, manual audit, or external benchmark), which directly undermines the claim that the mapping yields a perspective on metadata consistency and accuracy; observed discrepancies cannot be distinguished from matching artifacts.
- [Results] Results section: without quantified validation of the matching step, any reported inconsistencies in bibliographic metadata (e.g., title, author, or DOI mismatches) remain uninterpretable as evidence of collection quality rather than artifacts of the unspecified matching rules or thresholds.
minor comments (2)
- [Abstract] The abstract would be strengthened by including even high-level statistics on the size of the collections and the fraction of entities successfully mapped.
- [Introduction] Notation for identifiers (e.g., OpenAlex IDs vs. OpenCitations Meta IDs) should be defined explicitly on first use to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the need for explicit validation of the entity-matching step. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / methodology] Abstract and methodology description: the entity-matching procedure is presented at a high level without any reported precision, recall, error rates, or validation against ground truth (held-out sample, manual audit, or external benchmark), which directly undermines the claim that the mapping yields a perspective on metadata consistency and accuracy; observed discrepancies cannot be distinguished from matching artifacts.
Authors: We agree that the initial submission describes the matching rules at a relatively high level and does not include quantitative validation metrics. The procedure combines exact DOI matching with fuzzy matching on titles and author names using fixed similarity thresholds; however, without reported precision/recall or a ground-truth audit, it is difficult for readers to separate matching errors from genuine metadata differences. In the revised version we will expand the methodology section with a new subsection that details the exact matching rules and thresholds, reports the results of a manual audit performed on a random sample of 500 mappings (providing estimated precision and recall), and discusses the implications for interpreting the observed discrepancies. revision: yes
-
Referee: [Results] Results section: without quantified validation of the matching step, any reported inconsistencies in bibliographic metadata (e.g., title, author, or DOI mismatches) remain uninterpretable as evidence of collection quality rather than artifacts of the unspecified matching rules or thresholds.
Authors: We acknowledge that the current results section presents mismatch statistics without linking them to a quantified assessment of matching reliability. In the revision we will add explicit cross-references to the new validation metrics and include a short discussion that compares the observed mismatch rates against the estimated error rates from the audit, thereby clarifying the extent to which the reported inconsistencies can be attributed to differences between the two collections rather than to the matching process itself. revision: yes
Circularity Check
No circularity; purely descriptive mapping exercise with no derivations or self-referential predictions
full rationale
The paper describes a methodology for mapping entities between OpenCitations Meta and OpenAlex to integrate identifiers and analyze metadata consistency. No equations, fitted parameters, predictions, or uniqueness theorems are present. The work is a data-processing exercise whose central steps (entity matching) are not claimed to derive from prior self-citations or reduce to inputs by construction. Self-citations, if any, are not load-bearing for any claimed result. This matches the default expectation of no significant circularity for descriptive bibliographic studies.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Open bibliographic metadata collections play a pivotal role in enabling reproducible studies in the fields of bibliometrics, scientometrics and science of science and permit transparent procedures in the context of research assessment exercises, thus enabling the implementation of norms and guidelines that intend to reform the research assess...
work page 2024
-
[2]
Material and methods The following subsections analyse multi-mapped and non-mapped BRs in more detail. 2.1. Data The two collections involved in the mapping process are OpenCitations Meta (OC Meta) and the OpenAlex catalogue (henceforth, just OpenAlex). In particular, only a subset of the entities in both collections has been considered for the mapping, n...
-
[3]
A table storing OMID, OpenAlex ID, and type of the BRs, if exactly one OpenAlex ID per OMID has been found
-
[4]
A table storing OMID, OpenAlex IDs, and type of the BRs, if multiple OpenAlex IDs per OMID have been found (multi-mapped BRs)
-
[5]
A table storing OMID and type of the BRs, if no OpenAlex ID has been found (non-mapped BRs). The primary purpose of the mapping is to enable the addition of OpenAlex IDs to other available external persistent identifiers (PIDs) among the metadata of bibliographic resources already existing in OC Meta. However, the potential uses of the outcome of this pro...
-
[6]
Category A includes cases where two or more Works among the ones that are multi-mapped to a single OC Meta BR share at least one external PID. Given that external PIDs, such as DOIs, should be uniquely assigned to a BR, having more than one entity with the same external PID in the OpenAlex dataset means that there are either duplicate entities or errors i...
-
[7]
in the case of having a version of record and one or more preprint and/or postprint versions
Category B includes cases where the same entity in OC Meta is mapped to different versions of the same publication, each represented by a Work entity in OpenAlex – e.g. in the case of having a version of record and one or more preprint and/or postprint versions. Preprints and postprints are hosted in a preprint server or a digital repository. DOIs of prep...
-
[8]
Category C includes cases where the same entity in OC Meta is mapped to exactly 2 different Works in OpenAlex, and neither is a preprint or postprint version. The most likely causes for this scenario are errors in the data source used by OC Meta, bugs in OC Meta software, or different DOIs intentionally linked to the same OC Meta entity
-
[9]
Category D includes cases where the same entity in OC Meta is mapped to multiple preprint versions of the same publication, each represented by a Work entity in OpenAlex. This typology is similar to category B, but it only includes preprint versions and detects them by checking for version number (e.g. “/v1”) in the DOI value
-
[10]
Category E includes cases where the same entity in OC Meta is mapped to multiple preprint versions of the same publication, each represented by a Work entity in OpenAlex. This typology is similar to categories B and D, but detects preprint versions by analysing the DOI value and checking if it contains semantic indicators that associate the DOI with a pre...
-
[11]
Category F includes cases where the multi-mapped OpenAlex Works include a version of record, together with one or more Works of type “peer-review”, “letter”, “editorial”, “erratum”, or “other”. For example, the DOI for an erratum notice and a DOI for the journal article that is being corrected may be wrongly assigned the same OMID in OC Meta, due to error...
-
[12]
Multi-mapped BRs in the form of a table where each row represents the association of one BR in OC Meta with n BRs in OpenAlex, storing an OMID in the omid field and a list of OpenAlex IDs in the openalex_id field
-
[13]
A list of 80 DOI prefixes that are assigned by Crossref and DataCite to organisations or institutions that manage preprint servers or digital repositories hosting non-peer-reviewed versions
-
[14]
A list of strings that, when found inside a DOI value, indicate that the associated publication is hosted in a preprint server (e.g. “/arxiv”, “/preprints”, “/osf.io”)
-
[15]
A SQL database storing full metadata of the OpenAlex BRs involved in the multi-mapping. The process differentiates between OpenAlex Works and OpenAlex Sources. For rows storing Works, the process includes querying the database for external PIDs associated with each Work. If any PID is associated with multiple Works in the row, the categorisation is labell...
-
[16]
Results Table 1 shows the number of processed BRs for both datasets and the general results of a quantitative analysis of the mapping output. As mentioned above, a BR entity in OC Meta can be mapped to a BR entity in OpenAlex only if both entities are associated with at least one external PID in common. Thus, the BRs in the OC Meta CSV dump that are theor...
-
[17]
This study highlighted problems and inconsistencies within the used datasets
Discussion The mapping process and the analysis of its results concerned the study and use of a great amount of data from the involved databases, requiring, for example, the consideration of all bibliographic entities in their entirety. This study highlighted problems and inconsistencies within the used datasets. First, concerning OC Meta, the process pro...
-
[18]
Conclusions The results of the mapping of OpenCitations Meta bibliographic resources to OpenAlex bibliographic resources have provided valuable insights into the integration of bibliographic metadata entities, showcasing that the majority of processed OC Meta resources are successfully mapped with exactly one entity in OpenAlex. This achievement is signif...
-
[19]
Acknowledgements This project has been made possible through the generous support of the European Research Council, for which the authors extend their sincere gratitude
-
[20]
A. Massari, F. Mariani, I. Heibi, S. Peroni, and D. Shotton, ‘OpenCitations Meta’. Jun. 28, 2023. doi: https://doi.org/10.48550/arXiv.2306.16191
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.16191 2023
-
[21]
OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts
J. Priem, H. Piwowar, and R. Orr, ‘OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts’, presented at the 26th Internation Conference on Science and Technology Indicators, arXiv, 2022. doi: 10.48550/ARXIV.2205.01833
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01833 2022
-
[22]
Daquino et al., ‘The OpenCitations Data Model’, in The Semantic Web – ISWC 2020, J
M. Daquino et al., ‘The OpenCitations Data Model’, in The Semantic Web – ISWC 2020, J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne, and L. Kagal, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020, pp. 447–463. doi: 10.1007/978-3-030-62466-8_28
-
[23]
M. Daquino, A. Massari, S. Peroni, and D. Shotton, ‘The OpenCitations Data Model’. figshare,
-
[24]
doi: 10.6084/M9.FIGSHARE.3443876.V8
-
[25]
doi: https://doi.org/10.6084/m9.figshare.21747461.v5
‘OpenCitations Meta CSV dataset of all bibliographic metadata’. doi: https://doi.org/10.6084/m9.figshare.21747461.v5
-
[26]
S. Peroni and D. Shotton, ‘OpenCitations, an infrastructure organization for open scholarship’, Quant. Sci. Stud., vol. 1, no. 1, pp. 428–444, Feb. 2020, doi: 10.1162/qss_a_00023
-
[27]
I. Heibi, S. Peroni, and D. Shotton, ‘Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations’, Scientometrics, vol. 121, no. 2, pp. 1213–1228, Nov. 2019, doi: 10.1007/s11192-019-03217-6
-
[28]
A. Sinha et al., ‘An Overview of Microsoft Academic Service (MAS) and Applications’, in Proceedings of the 24th International Conference on World Wide Web, Florence Italy: ACM, May 2015, pp. 243–246. doi: 10.1145/2740908.2742839
- [29]
-
[30]
Morrison, ‘Directory of Open Access Journals (DOAJ)’, Charlest
H. Morrison, ‘Directory of Open Access Journals (DOAJ)’, Charlest. Advis., vol. 18, no. 3, pp. 25–28, Jan. 2017, doi: 10.5260/chara.18.3.25
-
[31]
K. Dhakal, ‘Unpaywall’, J. Med. Libr. Assoc., vol. 107, no. 2, Apr. 2019, doi: 10.5195/jmla.2019.650
-
[32]
S. Sigurdsson, ‘The future of arXiv and knowledge discovery in open science’, in Proceedings of the First Workshop on Scholarly Document Processing, Online: Association for Computational Linguistics, 2020, pp. 7–9. doi: 10.18653/v1/2020.sdp-1.2
-
[33]
Shared.’, 2013, doi: 10.25495/7GXK-RD71
European Organization For Nuclear Research and OpenAIRE, ‘Zenodo: Research. Shared.’, 2013, doi: 10.25495/7GXK-RD71
-
[34]
C. Maloney, E. Sequeiera, C. Kelly, R. Orris, and J. Beck, ‘Pubmed central’, in The NCBI Handbook, 2nd ed., 2013. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK153388/
work page 2013
-
[35]
C. Atzori, A. Bardi, P. Manghi, and A. Mannocci, ‘The OpenAIRE Workflows for Data Management’, in Digital Libraries and Archives, vol. 733, C. Grana and L. Baraldi, Eds., in Communications in Computer and Information Science, vol. 733. , Cham: Springer International Publishing, 2017, pp. 95–107. doi: 10.1007/978-3-319-68130-6_8
-
[36]
Hara, ‘Introduction of Japan Link Center (JaLC)’
M. Hara, ‘Introduction of Japan Link Center (JaLC)’. ORCID, 2020. doi: 10.23640/07243.12469094.V1
-
[37]
G. Hendricks, D. Tkaczyk, J. Lin, and P. Feeney, ‘Crossref: The sustainable source of community-owned scholarly metadata’, Quant. Sci. Stud., vol. 1, no. 1, pp. 414–427, Feb. 2020, doi: 10.1162/qss_a_00022
-
[38]
Brase, ‘Datacite - A Global Registration Agency for Research Data’, SSRN Electron
J. Brase, ‘Datacite - A Global Registration Agency for Research Data’, SSRN Electron. J., 2010, doi: 10.2139/ssrn.1639998
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.