pith. sign in

arxiv: 2312.16523 · v2 · submitted 2023-12-27 · 💻 cs.DL

Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlex

Pith reviewed 2026-05-24 05:19 UTC · model grok-4.3

classification 💻 cs.DL
keywords bibliographic metadataentity mappingOpenCitations MetaOpenAlexdata integrationmetadata consistencyopen scholarly dataidentifier alignment
0
0 comments X

The pith

Mapping entities between OpenCitations Meta and OpenAlex integrates OpenAlex identifiers into the former collection and exposes inconsistencies in bibliographic metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a methodology for matching bibliographic entities across OpenCitations Meta and OpenAlex, two large open collections. Its main goal is to add OpenAlex internal identifiers to records already held in OpenCitations Meta, creating direct links between the two datasets. The authors also examine the results of this matching to assess how consistent and accurate the metadata descriptions are in each collection. A sympathetic reader would care because successful mapping would allow researchers to combine data from both sources without manual reconciliation and would flag places where existing records disagree on basic facts such as authorship or publication details.

Core claim

The mapping procedure successfully aligns a substantial portion of entities between the two collections, thereby permitting the insertion of OpenAlex identifiers into OpenCitations Meta records, and the comparison of matched records reveals measurable differences in how the same bibliographic resources are described in each dataset.

What carries the argument

The entity matching procedure that identifies corresponding bibliographic resources across the two collections and transfers OpenAlex identifiers.

If this is right

  • OpenCitations Meta records gain direct pointers to OpenAlex entries, enabling combined queries across both collections.
  • Differences detected during matching can be used to flag candidate corrections in either dataset.
  • The same matching logic can be reapplied after either collection is updated to keep the links current.
  • Users of either collection obtain a larger, cross-referenced view of the scholarly literature without building their own reconciliation tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the matching error rate proves low, the approach could serve as a template for linking additional open bibliographic sources such as Crossref or DataCite.
  • Persistent differences uncovered by the mapping may indicate systematic choices in how each collection harvests or cleans its source data.
  • The interlinked identifiers would make it easier to measure coverage gaps between the two collections on specific topics or time periods.

Load-bearing premise

The procedure that decides whether two records describe the same resource does so with low enough error that the resulting links and consistency checks remain trustworthy.

What would settle it

A manual audit of a random sample of matched pairs that finds more than a small percentage of incorrect links or a systematic bias in the types of resources that fail to match.

read the original abstract

This study describes the methodology and analyses the results of the process of mapping entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collections. Furthermore, analysing the output of the mapping provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable tool for identifying potential inconsistencies in the processed data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper describes a methodology for mapping bibliographic entities between OpenCitations Meta and OpenAlex, with the goals of integrating OpenAlex internal identifiers into OpenCitations Meta and using the mapping results to assess consistency and accuracy of bibliographic metadata across the two collections.

Significance. If the entity matching is shown to be reliable, the resulting interlinked dataset and metadata analysis could support improved discoverability and quality assessment in open bibliographic infrastructures; the work is primarily descriptive and does not claim novel algorithms or theoretical advances.

major comments (2)
  1. [Abstract / methodology] Abstract and methodology description: the entity-matching procedure is presented at a high level without any reported precision, recall, error rates, or validation against ground truth (held-out sample, manual audit, or external benchmark), which directly undermines the claim that the mapping yields a perspective on metadata consistency and accuracy; observed discrepancies cannot be distinguished from matching artifacts.
  2. [Results] Results section: without quantified validation of the matching step, any reported inconsistencies in bibliographic metadata (e.g., title, author, or DOI mismatches) remain uninterpretable as evidence of collection quality rather than artifacts of the unspecified matching rules or thresholds.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including even high-level statistics on the size of the collections and the fraction of entities successfully mapped.
  2. [Introduction] Notation for identifiers (e.g., OpenAlex IDs vs. OpenCitations Meta IDs) should be defined explicitly on first use to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the need for explicit validation of the entity-matching step. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / methodology] Abstract and methodology description: the entity-matching procedure is presented at a high level without any reported precision, recall, error rates, or validation against ground truth (held-out sample, manual audit, or external benchmark), which directly undermines the claim that the mapping yields a perspective on metadata consistency and accuracy; observed discrepancies cannot be distinguished from matching artifacts.

    Authors: We agree that the initial submission describes the matching rules at a relatively high level and does not include quantitative validation metrics. The procedure combines exact DOI matching with fuzzy matching on titles and author names using fixed similarity thresholds; however, without reported precision/recall or a ground-truth audit, it is difficult for readers to separate matching errors from genuine metadata differences. In the revised version we will expand the methodology section with a new subsection that details the exact matching rules and thresholds, reports the results of a manual audit performed on a random sample of 500 mappings (providing estimated precision and recall), and discusses the implications for interpreting the observed discrepancies. revision: yes

  2. Referee: [Results] Results section: without quantified validation of the matching step, any reported inconsistencies in bibliographic metadata (e.g., title, author, or DOI mismatches) remain uninterpretable as evidence of collection quality rather than artifacts of the unspecified matching rules or thresholds.

    Authors: We acknowledge that the current results section presents mismatch statistics without linking them to a quantified assessment of matching reliability. In the revision we will add explicit cross-references to the new validation metrics and include a short discussion that compares the observed mismatch rates against the estimated error rates from the audit, thereby clarifying the extent to which the reported inconsistencies can be attributed to differences between the two collections rather than to the matching process itself. revision: yes

Circularity Check

0 steps flagged

No circularity; purely descriptive mapping exercise with no derivations or self-referential predictions

full rationale

The paper describes a methodology for mapping entities between OpenCitations Meta and OpenAlex to integrate identifiers and analyze metadata consistency. No equations, fitted parameters, predictions, or uniqueness theorems are present. The work is a data-processing exercise whose central steps (entity matching) are not claimed to derive from prior self-citations or reduce to inputs by construction. Self-citations, if any, are not load-bearing for any claimed result. This matches the default expectation of no significant circularity for descriptive bibliographic studies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a methodological description of a data-mapping process; it introduces no free parameters, mathematical axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5602 in / 899 out tokens · 23151 ms · 2026-05-24T05:19:43.692041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

  1. [1]

    Material and methods

    Introduction Open bibliographic metadata collections play a pivotal role in enabling reproducible studies in the fields of bibliometrics, scientometrics and science of science and permit transparent procedures in the context of research assessment exercises, thus enabling the implementation of norms and guidelines that intend to reform the research assess...

  2. [2]

    Material and methods The following subsections analyse multi-mapped and non-mapped BRs in more detail. 2.1. Data The two collections involved in the mapping process are OpenCitations Meta (OC Meta) and the OpenAlex catalogue (henceforth, just OpenAlex). In particular, only a subset of the entities in both collections has been considered for the mapping, n...

  3. [3]

    A table storing OMID, OpenAlex ID, and type of the BRs, if exactly one OpenAlex ID per OMID has been found

  4. [4]

    A table storing OMID, OpenAlex IDs, and type of the BRs, if multiple OpenAlex IDs per OMID have been found (multi-mapped BRs)

  5. [5]

    A table storing OMID and type of the BRs, if no OpenAlex ID has been found (non-mapped BRs). The primary purpose of the mapping is to enable the addition of OpenAlex IDs to other available external persistent identifiers (PIDs) among the metadata of bibliographic resources already existing in OC Meta. However, the potential uses of the outcome of this pro...

  6. [6]

    Category A includes cases where two or more Works among the ones that are multi-mapped to a single OC Meta BR share at least one external PID. Given that external PIDs, such as DOIs, should be uniquely assigned to a BR, having more than one entity with the same external PID in the OpenAlex dataset means that there are either duplicate entities or errors i...

  7. [7]

    in the case of having a version of record and one or more preprint and/or postprint versions

    Category B includes cases where the same entity in OC Meta is mapped to different versions of the same publication, each represented by a Work entity in OpenAlex – e.g. in the case of having a version of record and one or more preprint and/or postprint versions. Preprints and postprints are hosted in a preprint server or a digital repository. DOIs of prep...

  8. [8]

    The most likely causes for this scenario are errors in the data source used by OC Meta, bugs in OC Meta software, or different DOIs intentionally linked to the same OC Meta entity

    Category C includes cases where the same entity in OC Meta is mapped to exactly 2 different Works in OpenAlex, and neither is a preprint or postprint version. The most likely causes for this scenario are errors in the data source used by OC Meta, bugs in OC Meta software, or different DOIs intentionally linked to the same OC Meta entity

  9. [9]

    This typology is similar to category B, but it only includes preprint versions and detects them by checking for version number (e.g

    Category D includes cases where the same entity in OC Meta is mapped to multiple preprint versions of the same publication, each represented by a Work entity in OpenAlex. This typology is similar to category B, but it only includes preprint versions and detects them by checking for version number (e.g. “/v1”) in the DOI value

  10. [10]

    /arxiv” or “/zenodo

    Category E includes cases where the same entity in OC Meta is mapped to multiple preprint versions of the same publication, each represented by a Work entity in OpenAlex. This typology is similar to categories B and D, but detects preprint versions by analysing the DOI value and checking if it contains semantic indicators that associate the DOI with a pre...

  11. [11]

    peer-review

    Category F includes cases where the multi-mapped OpenAlex Works include a version of record, together with one or more Works of type “peer-review”, “letter”, “editorial”, “erratum”, or “other”. For example, the DOI for an erratum notice and a DOI for the journal article that is being corrected may be wrongly assigned the same OMID in OC Meta, due to error...

  12. [12]

    Multi-mapped BRs in the form of a table where each row represents the association of one BR in OC Meta with n BRs in OpenAlex, storing an OMID in the omid field and a list of OpenAlex IDs in the openalex_id field

  13. [13]

    A list of 80 DOI prefixes that are assigned by Crossref and DataCite to organisations or institutions that manage preprint servers or digital repositories hosting non-peer-reviewed versions

  14. [14]

    /arxiv”, “/preprints

    A list of strings that, when found inside a DOI value, indicate that the associated publication is hosted in a preprint server (e.g. “/arxiv”, “/preprints”, “/osf.io”)

  15. [15]

    D. Otherwise, an assessment is made for DOI prefixes associated with preprint servers, leading to categorisations such as “B

    A SQL database storing full metadata of the OpenAlex BRs involved in the multi-mapping. The process differentiates between OpenAlex Works and OpenAlex Sources. For rows storing Works, the process includes querying the database for external PIDs associated with each Work. If any PID is associated with multiple Works in the row, the categorisation is labell...

  16. [16]

    br/06602375171

    Results Table 1 shows the number of processed BRs for both datasets and the general results of a quantitative analysis of the mapping output. As mentioned above, a BR entity in OC Meta can be mapped to a BR entity in OpenAlex only if both entities are associated with at least one external PID in common. Thus, the BRs in the OC Meta CSV dump that are theor...

  17. [17]

    This study highlighted problems and inconsistencies within the used datasets

    Discussion The mapping process and the analysis of its results concerned the study and use of a great amount of data from the involved databases, requiring, for example, the consideration of all bibliographic entities in their entirety. This study highlighted problems and inconsistencies within the used datasets. First, concerning OC Meta, the process pro...

  18. [18]

    This achievement is significant, as it allows for the direct ingestion of OpenAlex IDs into the metadata of the corresponding bibliographic resources in OC Meta

    Conclusions The results of the mapping of OpenCitations Meta bibliographic resources to OpenAlex bibliographic resources have provided valuable insights into the integration of bibliographic metadata entities, showcasing that the majority of processed OC Meta resources are successfully mapped with exactly one entity in OpenAlex. This achievement is signif...

  19. [19]

    Acknowledgements This project has been made possible through the generous support of the European Research Council, for which the authors extend their sincere gratitude

  20. [20]

    OpenCitations Meta

    A. Massari, F. Mariani, I. Heibi, S. Peroni, and D. Shotton, ‘OpenCitations Meta’. Jun. 28, 2023. doi: https://doi.org/10.48550/arXiv.2306.16191

  21. [21]

    OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

    J. Priem, H. Piwowar, and R. Orr, ‘OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts’, presented at the 26th Internation Conference on Science and Technology Indicators, arXiv, 2022. doi: 10.48550/ARXIV.2205.01833

  22. [22]

    Daquino et al., ‘The OpenCitations Data Model’, in The Semantic Web – ISWC 2020, J

    M. Daquino et al., ‘The OpenCitations Data Model’, in The Semantic Web – ISWC 2020, J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne, and L. Kagal, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020, pp. 447–463. doi: 10.1007/978-3-030-62466-8_28

  23. [23]

    Daquino, A

    M. Daquino, A. Massari, S. Peroni, and D. Shotton, ‘The OpenCitations Data Model’. figshare,

  24. [24]

    doi: 10.6084/M9.FIGSHARE.3443876.V8

  25. [25]

    doi: https://doi.org/10.6084/m9.figshare.21747461.v5

    ‘OpenCitations Meta CSV dataset of all bibliographic metadata’. doi: https://doi.org/10.6084/m9.figshare.21747461.v5

  26. [26]

    Peroni and D

    S. Peroni and D. Shotton, ‘OpenCitations, an infrastructure organization for open scholarship’, Quant. Sci. Stud., vol. 1, no. 1, pp. 428–444, Feb. 2020, doi: 10.1162/qss_a_00023

  27. [27]

    Heibi, S

    I. Heibi, S. Peroni, and D. Shotton, ‘Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations’, Scientometrics, vol. 121, no. 2, pp. 1213–1228, Nov. 2019, doi: 10.1007/s11192-019-03217-6

  28. [28]

    A. Sinha et al., ‘An Overview of Microsoft Academic Service (MAS) and Applications’, in Proceedings of the 24th International Conference on World Wide Web, Florence Italy: ACM, May 2015, pp. 243–246. doi: 10.1145/2740908.2742839

  29. [29]

    Canese, J

    K. Canese, J. Jentsch, and C. Myers, ‘PubMed: The Bibliographic Database’, in The NCBI Handbook, 2nd ed., 2013, p. 9. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK153385/

  30. [30]

    Morrison, ‘Directory of Open Access Journals (DOAJ)’, Charlest

    H. Morrison, ‘Directory of Open Access Journals (DOAJ)’, Charlest. Advis., vol. 18, no. 3, pp. 25–28, Jan. 2017, doi: 10.5260/chara.18.3.25

  31. [31]

    Dhakal, ‘Unpaywall’, J

    K. Dhakal, ‘Unpaywall’, J. Med. Libr. Assoc., vol. 107, no. 2, Apr. 2019, doi: 10.5195/jmla.2019.650

  32. [32]

    S. Sigurdsson, ‘The future of arXiv and knowledge discovery in open science’, in Proceedings of the First Workshop on Scholarly Document Processing, Online: Association for Computational Linguistics, 2020, pp. 7–9. doi: 10.18653/v1/2020.sdp-1.2

  33. [33]

    Shared.’, 2013, doi: 10.25495/7GXK-RD71

    European Organization For Nuclear Research and OpenAIRE, ‘Zenodo: Research. Shared.’, 2013, doi: 10.25495/7GXK-RD71

  34. [34]

    Maloney, E

    C. Maloney, E. Sequeiera, C. Kelly, R. Orris, and J. Beck, ‘Pubmed central’, in The NCBI Handbook, 2nd ed., 2013. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK153388/

  35. [35]

    Atzori, A

    C. Atzori, A. Bardi, P. Manghi, and A. Mannocci, ‘The OpenAIRE Workflows for Data Management’, in Digital Libraries and Archives, vol. 733, C. Grana and L. Baraldi, Eds., in Communications in Computer and Information Science, vol. 733. , Cham: Springer International Publishing, 2017, pp. 95–107. doi: 10.1007/978-3-319-68130-6_8

  36. [36]

    Hara, ‘Introduction of Japan Link Center (JaLC)’

    M. Hara, ‘Introduction of Japan Link Center (JaLC)’. ORCID, 2020. doi: 10.23640/07243.12469094.V1

  37. [37]

    Hendricks, D

    G. Hendricks, D. Tkaczyk, J. Lin, and P. Feeney, ‘Crossref: The sustainable source of community-owned scholarly metadata’, Quant. Sci. Stud., vol. 1, no. 1, pp. 414–427, Feb. 2020, doi: 10.1162/qss_a_00022

  38. [38]

    Brase, ‘Datacite - A Global Registration Agency for Research Data’, SSRN Electron

    J. Brase, ‘Datacite - A Global Registration Agency for Research Data’, SSRN Electron. J., 2010, doi: 10.2139/ssrn.1639998