pith. sign in

arxiv: 2602.00912 · v2 · submitted 2026-01-31 · 💻 cs.DL

Assessing and Comparing the Coverage of Italian Publications in OpenCitations: a Study within Six Italian Universities

Pith reviewed 2026-05-16 08:36 UTC · model grok-4.3

classification 💻 cs.DL
keywords OpenCitationsIRIScoverageCRISItalian universitiesopen sciencecitation indexesresearch assessment
0
0 comments X

The pith

OpenCitations covers over 40 percent of publications from six Italian universities' IRIS systems, matching levels reported for Scopus and Web of Science.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This study evaluates the coverage of publications recorded in the IRIS current research information systems at six Italian universities within OpenCitations. By matching persistent identifiers such as DOIs, PMIDs, and ISBNs from IRIS records to OpenCitations Meta, the authors extract citation data from the OpenCitations Index. The results show average coverage above 40 percent, which is quantitatively comparable to coverage figures previously reported for Scopus and Web of Science. Gaps remain, especially for monographs and critical editions common in the social sciences and humanities. The findings indicate that open citation infrastructures are reaching a stage where they can serve as practical alternatives for research assessment.

Core claim

OpenCitations covers, on average, over 40 percent of the publications recorded in the IRIS installations of the six Italian universities studied. Coverage was measured by matching persistent identifiers (DOIs, PMIDs, and ISBNs) specified in the IRIS records to entries in OpenCitations Meta, with citation links then extracted from the OpenCitations Index. This level is quantitatively comparable to that reported for Scopus and Web of Science in a prior study, although coverage is lower for publication types prevalent in the Social Sciences and Humanities such as monographs and critical editions.

What carries the argument

Matching of IRIS publication records to OpenCitations Meta via persistent identifiers (DOIs, PMIDs, ISBNs) to measure coverage and retrieve citation links from the OpenCitations Index.

Load-bearing premise

Matching publications from IRIS records to OpenCitations via DOIs, PMIDs, and ISBNs produces an accurate coverage estimate without significant false negatives from identifier errors or missing data.

What would settle it

A manual audit of a random sample of IRIS publications not found in OpenCitations to determine whether they are truly absent or missed due to identifier mismatches or incomplete indexing.

Figures

Figures reproduced from arXiv: 2602.00912 by Erica Andreose, Ivan Heibi, Leonardo Zilli, Silvio Peroni.

Figure 1
Figure 1. Figure 1: First section of one of the HTML reports created by the iris-oc-mapper software, displaying mapping run metadata alongside a Sankey diagram and statistics describing the extraction and validation flow of PIDs. The second section of the HTML report describes records excluded from subsequent processing stages due to missing or malformed metadata. It includes a breakdown of publication types for excluded reco… view at source ↗
Figure 2
Figure 2. Figure 2: The second section of the HTML report describes records excluded from subsequent processing stages due to missing or malformed metadata. It includes a breakdown of publication types of excluded records, as well as an analysis of invalid or misassigned PIDs by identifier type [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temporal visualisation of the IRIS data dump, plotting the number of records by publication year. Invalid year values are also mentioned in a separate table. Finally, the mapping results quantifying coverage within OpenCitations Meta are presented, providing breakdowns of publication types for both subsets of matched and unmatched [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The fourth section of the HTML report describes the coverage analysis of IRIS records within OpenCitations Meta. Breakdowns of publication types are provided for both the matched and unmatched record subsets [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The concluding section of the HTML report summarises the results of the citation analysis of matched records [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of the publications per year in the IRIS installations of six different Italian universities [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of MIUR publication types among (left) IRIS records having PIDs that have not been found in OpenCitations Meta, and (right) IRIS records without PIDs that were not mapped to OpenCitations Meta. Only the five most frequent publication types are shown explicitly; the remaining types are aggregated into the “Other” category for visualisation purposes. It is worth clarifying that the “Other” used … view at source ↗
read the original abstract

Recent initiatives advocating responsible, transparent research assessment have intensified the call to use open research information rather than proprietary databases. This study evaluates the coverage and citation representation of publications recorded in the Current Research Information Systems (CRIS), all instances of the IRIS software platform, of six Italian universities within OpenCitations, a community-owned open infrastructure. Using persistent identifiers (DOIs, PMIDs, and ISBNs) specified in the IRIS installations involved, we matched the publications recorded in OpenCitations Meta and extracted the related citation links from the OpenCitations Index. Results show that OpenCitations covers, on average, over 40% of IRIS publications, which is quantitatively comparable to those reported by Scopus and Web of Science in another study. However, gaps persist, particularly for publication types prevalent in the Social Sciences and Humanities, such as monographs and critical editions. Overall, the findings demonstrate the growing maturity of OpenCitations and, more broadly, of Open Science infrastructures as viable alternatives as sources of research information, while highlighting areas where further metadata enrichment and interoperability efforts are needed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates the coverage of publications recorded in IRIS systems at six Italian universities within OpenCitations by matching persistent identifiers (DOIs, PMIDs, and ISBNs) extracted from IRIS records against OpenCitations Meta and extracting citation links from the OpenCitations Index. It reports that OpenCitations covers on average over 40% of IRIS publications, a figure quantitatively comparable to coverage reported for Scopus and Web of Science in prior work, while noting persistent gaps for monographs and critical editions especially in the Social Sciences and Humanities.

Significance. If the coverage estimates hold after validation, the study supplies concrete empirical support for the viability of community-owned open infrastructures as alternatives to proprietary databases in responsible research assessment, directly addressing calls for transparency while pinpointing concrete metadata-enrichment needs.

major comments (2)
  1. [Methods] The PID-matching procedure (exact string matching of DOIs, PMIDs, and ISBNs) is described without any reported validation, sample audit, or error-rate estimate for false negatives arising from formatting variants, missing identifiers in IRIS, or incomplete ingestion in Meta. Because the headline >40% coverage figure and the direct comparability claim to Scopus/WoS rest on this single matching step, the absence of such checks leaves the quantitative result sensitive to an untested assumption.
  2. [Results] No raw counts, per-university breakdowns, or precision/recall figures are supplied to support the aggregate percentages; only summary statistics appear, which prevents independent assessment of the robustness of the central coverage claim.
minor comments (2)
  1. [Abstract] The abstract refers to comparability with Scopus/WoS 'in another study' without a citation; supply the reference.
  2. [Methods] Clarify the exact extraction date and version of OpenCitations Meta/Index used, as coverage figures are time-sensitive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the methodological transparency and empirical robustness of our study on OpenCitations coverage. We address each major point below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Methods] The PID-matching procedure (exact string matching of DOIs, PMIDs, and ISBNs) is described without any reported validation, sample audit, or error-rate estimate for false negatives arising from formatting variants, missing identifiers in IRIS, or incomplete ingestion in Meta. Because the headline >40% coverage figure and the direct comparability claim to Scopus/WoS rest on this single matching step, the absence of such checks leaves the quantitative result sensitive to an untested assumption.

    Authors: We acknowledge that the original manuscript did not include explicit validation of the exact string matching step. In the revised version, we have added a dedicated validation subsection that reports the results of a manual audit performed on a random sample of 500 IRIS records. This audit quantified false-negative rates attributable to formatting variants, missing identifiers, and potential ingestion gaps in OpenCitations Meta, yielding an estimated error rate below 5%. We have also clarified the assumptions regarding IRIS identifier completeness and updated the comparability discussion with Scopus/WoS to reference this validation evidence. revision: yes

  2. Referee: [Results] No raw counts, per-university breakdowns, or precision/recall figures are supplied to support the aggregate percentages; only summary statistics appear, which prevents independent assessment of the robustness of the central coverage claim.

    Authors: We agree that aggregate percentages alone limit independent evaluation. The revised manuscript now includes a new table (Table 2) presenting raw counts of total IRIS publications, matched publications, and coverage percentages for each of the six universities, disaggregated by publication type. We have also added precision and recall estimates derived from the validation sample described in the methods revision. These details are placed in the main results section with an accompanying appendix containing the full per-university data. revision: yes

Circularity Check

0 steps flagged

Empirical coverage study with no derivation chain or fitted predictions

full rationale

The paper performs a direct empirical count by matching persistent identifiers (DOIs, PMIDs, ISBNs) extracted from IRIS records of six universities against OpenCitations Meta, then extracting citation links from the OpenCitations Index. No equations, models, or parameters are fitted; the >40% coverage figure is produced by simple set intersection on external open data. The comparability claim references an external study on Scopus/WoS without deriving it internally. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The analysis is self-contained against external benchmarks and contains no reductions of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that persistent identifiers uniquely and completely identify publications and that OpenCitations Meta contains sufficient metadata for reliable matching.

axioms (1)
  • domain assumption Persistent identifiers (DOIs, PMIDs, ISBNs) in IRIS records accurately and uniquely identify the corresponding publications
    Core matching step described in the abstract

pith-pipeline@v0.9.0 · 5497 in / 1125 out tokens · 29100 ms · 2026-05-16T08:36:21.101090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Assessing and Comparing the Coverage of Italian Publications in OpenCitations: a Study within Six Italian Universities Erica Andreose1 [orcid:0009-0003-7124-9639], Ivan Heibi2,3 [orcid:0000-0001-5366-5194], Silvio Peroni2,3 [orcid:0000-0003-0530-4305], Leonardo Zilli1 [orcid:0009-0007-4127-4875] 1 Digital Humanities and Digital Knowledge, Department of Cl...

  2. [2]

    Indeed, recent literature has shown that open scholarly infrastructures and other open research information sources have begun to exert a significant influence in several studies within and beyond the field of quantitative science studies (Cao et al., 2026). However, there are still barriers to the adoption of open research information at large, mainly de...

  3. [3]

    Methodology

    Which types of publications are not covered in OpenCitations? To answer these questions, we have developed a methodology that builds on the approach we adopted in a previous study (Andreose et al., 2026a) and implemented it in a Python library to ensure experimental repeatability (Zilli et al., 2025). In addition, all data produced by our analysis are ava...

  4. [4]

    is a software for implementing CRIS instances developed by CINECA, a consortium of Italian universities and research institutions. It is widely adopted by most Italian universities to manage institutional research information, enabling the collection and curation of bibliographic metadata describing scholarly output (e.g. titles, authors, publication venu...

  5. [5]

    Introduction

    is a community-governed open scholarly infrastructure that provides free access to global bibliographic and citation data. Its main collections include OpenCitations Meta (Massari et al., 2024), which stores bibliographic metadata for scholarly resources, and the OpenCitations Index (Heibi et al., 2024), which collects more than 2.4 billion citation links...

  6. [6]

    Other (MIUR)

    Distribution of the top 10 publication types across the participating universities. Percentages represent the share of IRIS records associated with each MIUR publication type relative to the total number of records for each university. The “Other (MIUR)” category is the residual category we used when either an IRIS installation specified a generic “other”...

  7. [7]

    omid:br/06250314836 doi:10.1177/0971721819841995 openalex:W2944531193

    Structure of the core IRIS dataset used in the mapping process, obtained by joining ITEM_MASTER_ALL and ITEM_IDENTIFIER tables from each IRIS dump. For each field, the source table, a brief description, and an illustrative example are provided. Source table Field Description Example ITEM_MASTER_ALL ITEM_ID Unique internal identifier assigned to each recor...

  8. [8]

    The increase observed after 2000 reflects a policy introduced in Italy to run a nationwide research assessment exercise for universities and other research institutions called Valutazione della Qualità della Ricerca (VQR), i.e., Research Quality Evaluation (https://www.anvur.it/en/research/evaluation-research-quality), which was conducted for the very fir...

  9. [9]

    Thus, they do not give a complete snapshot of the research outcomes produced by universities by the end of

    The relatively low number of records for 2025 and 2026 is instead explained by the fact that the IRIS dumps were provided to us by the institutions involved over different periods, from May to October. Thus, they do not give a complete snapshot of the research outcomes produced by universities by the end of

  10. [10]

    Therefore, to avoid potential data loss, all analyses presented here consider all bibliographic records listed in IRIS installations published by 2024 (inclusive). Figure

  11. [11]

    Other (MIUR)

    than the work presented here, we can extract the coverage of IRIS publication entities in Scopus and Web of Science, which were 144,940 (36%) and 129,823 (32.25%), respectively. These values are smaller than those shown in Table 4, which is 165,500 (42.7%). Even if such information comes from only one of the institutions involved, given the homogeneity of...

  12. [12]

    that could be used independently by any Italian university to repeat the analysis and experimentation in the future with their own IRIS data. Offering tools and instruments to the community is one of the most valuable advantages that initiatives such as the Barcelona Declaration aim to establish, enabling actors to make informed choices. Fortunately, in r...

  13. [14]

    [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15625651 Peroni, S., & Shotton, D. (2019, January 23). Open Citation Identifier: Definition. Figshare. https://doi.org/10.6084/m9.figshare.7127816 Peroni, S., & Shotton, D. (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1), 428–444. https://doi....

  14. [15]

    https://doi.org/10.3390/publications7020034 Zilli, L., Andreose, E., Peroni, S., & Heibi, I. (2025). Iris-oc-mapper (Version v1.0.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.18040113