Biomedical Open Source Software: Crucial Packages and Hidden Heroes

Andrew Nesbitt; Boris Veytsman; Daniel Mietchen; Eva Maxfield Brown; James Howison; Jo\~ao Felipe Pimentel; Laurent H\'ebert-Dufresne; Stephan Druskat

arxiv: 2404.06672 · v6 · submitted 2024-04-10 · 💻 cs.SE · cs.CY

Biomedical Open Source Software: Crucial Packages and Hidden Heroes

Eva Maxfield Brown , Stephan Druskat , Laurent H\'ebert-Dufresne , James Howison , Daniel Mietchen , Andrew Nesbitt , Jo\~ao Felipe Pimentel , Boris Veytsman This is my paper

Pith reviewed 2026-05-24 02:23 UTC · model grok-4.3

classification 💻 cs.SE cs.CY

keywords software dependenciescentrality metricsbiomedical researchopen source softwarePyPICRANBioconductordependency networks

0 comments

The pith

Centrality metrics on software dependency networks identify the foundational packages biomedical research depends on most.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors extract software mentions from biomedical papers and trace their upstream dependencies across three ecosystems. They define centrality measures on the resulting dependency graphs to rank packages by how many others rely on them, directly or indirectly. This approach surfaces packages that sit deep in the stack and are rarely named in papers themselves. The work demonstrates that citation or mention data alone misses much of the actual infrastructure. If the ranking holds, stakeholders can direct maintenance and funding toward the packages whose failure would affect the largest share of research.

Core claim

Using the CZ Software Mentions Dataset, the paper builds directed dependency graphs for packages drawn from PyPI, CRAN, and Bioconductor that appear in biomedical literature, then computes centrality scores on those graphs; the packages that receive the highest scores are presented as the critical, often invisible, components of the biomedical software ecosystem.

What carries the argument

Centrality metrics computed on the directed graph whose nodes are software packages and whose edges point from a package to its upstream dependencies.

If this is right

High-centrality packages can be flagged for priority maintenance and funding because their removal would affect the largest number of research workflows.
The same network construction can be repeated on other scientific domains to locate their own hidden foundational packages.
Metrics that combine direct mentions with indirect dependency reach can replace simple citation counts when evaluating software impact.
Ecosystem maintainers gain a quantitative way to decide which packages deserve dedicated support staff or long-term archiving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If centrality rankings remain stable across successive yearly snapshots of the dataset, they could serve as an early-warning system for packages drifting into critical status.
The method could be extended by weighting edges according to how often a dependency is actually invoked in code, rather than treating every declared dependency equally.
Cross-ecosystem comparison might reveal whether one language community (Python versus R) concentrates risk in fewer foundational packages than the other.

Load-bearing premise

The CZ Software Mentions Dataset supplies a representative sample of the packages and dependency links actually used in biomedical papers.

What would settle it

A fresh, independent extraction of software mentions from a new corpus of biomedical papers that produces a materially different top-ranked set of packages by the same centrality measures.

Figures

Figures reproduced from arXiv: 2404.06672 by Andrew Nesbitt, Boris Veytsman, Daniel Mietchen, Eva Maxfield Brown, James Howison, Jo\~ao Felipe Pimentel, Laurent H\'ebert-Dufresne, Stephan Druskat.

**Figure 1.** Figure 1: Classification of software packages inspired by Stokes’ classification system in [21]. “Nebraska” packages are software projects which have few mentions in research articles, but are highly central in a dependency network. “Pasteur” packages are both highly visible with lots of mentions and are highly central in a dependency network. networks [19]. At present, though, the situation is quite different: some… view at source ↗

**Figure 2.** Figure 2: (a) Network visualization of software packages from three ecosystems (from CRAN in green, PyPI in blue, and Bioconductor in pink) connected through their dependencies within their ecosystem and interconnected through papers that mention them. We label the top 3 most central packages in each ecosystem: ggplot2 [33], SAM [34], and PRISMA [35] for CRAN, velvet [36], tophat and pymol [37] for PyPI and DeSeq2 [… view at source ↗

**Figure 3.** Figure 3: Distribution of packages by Katz centrality and counts of their mentions in papers. Katz centrality is calculated for an unweighted graph, for a weighted graph with all nodes, or just for the largest connected cluster (LCC) for each ecosystem. In the calculations, we assumed β = 1. November 7, 2025 9/20 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Despite the importance of scientific software for research, it is often not formally recognized and rewarded. This is especially true for foundational libraries, which are hidden below packages visible to the users (and thus doubly hidden, since even the packages directly used in research are frequently not visible in the paper). Research stakeholders like funders, infrastructure providers, and other organizations need to understand the complex network of computer programs that contemporary research relies upon. In this work, we use the CZ Software Mentions Dataset to map the upstream dependencies of software used in biomedical papers and find the packages critical to scientific software ecosystems. We propose centrality metrics for the network of software dependencies, analyze three ecosystems (PyPi, CRAN, Bioconductor), and determine the packages with the highest centrality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies network centrality to mentions-derived dependency graphs in three ecosystems but leaves the CZ dataset's representativeness for biomedical use unvalidated.

read the letter

This paper takes the CZ Software Mentions Dataset, builds dependency networks for packages mentioned in biomedical papers, and ranks them by centrality in PyPI, CRAN, and Bioconductor to highlight foundational but hidden libraries. The core move is reasonable: it connects paper-level mentions to upstream dependencies in a way that could inform priorities for scientific infrastructure support. It does a solid job of scoping the work to three established ecosystems and using an existing public dataset rather than starting from scratch. That keeps the analysis grounded in real extraction data instead of abstract claims. The framing around doubly hidden packages is clear and matches the practical problem of recognizing software that papers never cite directly. The main limitation is the data foundation. The abstract and stress-test note give no coverage checks, no comparison against full-text corpora like PubMed Central, and no bias analysis for the mentions extraction. If the dataset under-samples certain layers or ecosystems, the centrality rankings won't reliably identify what is actually critical. The text also skips details on graph construction from mentions or the exact centrality formulas, so the results can't be judged for robustness from what's provided. This sits squarely in research software studies. Readers working on software sustainability, funding models, or infrastructure policy could use the rankings as an exploratory map, but they would need the methods section expanded before treating the lists as actionable. The work shows straightforward engagement with the literature on software impact and avoids overclaiming. It deserves peer review so referees can check the validation steps and graph details; the idea is concrete enough to justify the time even if it needs revisions on data representativeness.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that the CZ Software Mentions Dataset can be used to map upstream dependencies of software mentioned in biomedical papers, that centrality metrics can be defined on the resulting dependency networks, and that analysis of the PyPI, CRAN, and Bioconductor ecosystems reveals the packages with highest centrality that are critical yet hidden in scientific software stacks.

Significance. If the dataset is shown to be representative and the centrality definitions are made explicit and reproducible, the work could help funders and infrastructure providers identify foundational packages that merit greater recognition. The multi-ecosystem scope is a constructive feature. The grounding in an external mention dataset is noted as a positive, data-driven approach.

major comments (2)

[Data and Methods] Data section: no coverage statistics, comparison against an independent corpus (e.g., PubMed Central full-text), or bias analysis is supplied to establish that the CZ Software Mentions Dataset supplies a representative sample of packages and dependency relations actually invoked in biomedical papers. This assumption is load-bearing for the upstream-dependency mapping and all subsequent centrality rankings.
[Methods] Centrality definition: the manuscript does not supply explicit formulas or pseudocode for the proposed centrality metrics on the dependency graphs, nor does it report how the graphs are constructed from mentions (e.g., edge-weighting, handling of transitive dependencies). Without these, the claim that the highest-centrality packages are the “critical” ones cannot be evaluated.

minor comments (1)

[Abstract] Abstract: the three ecosystems are named but the scale of the extracted networks (number of nodes/edges) is not stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional transparency and validation will strengthen the manuscript. We agree that both the representativeness of the dataset and the explicit definition of centrality metrics require elaboration. We will revise the manuscript to incorporate these elements as detailed below.

read point-by-point responses

Referee: [Data and Methods] Data section: no coverage statistics, comparison against an independent corpus (e.g., PubMed Central full-text), or bias analysis is supplied to establish that the CZ Software Mentions Dataset supplies a representative sample of packages and dependency relations actually invoked in biomedical papers. This assumption is load-bearing for the upstream-dependency mapping and all subsequent centrality rankings.

Authors: We agree that the manuscript should demonstrate the representativeness of the CZ Software Mentions Dataset. The current version relies on the dataset without providing coverage statistics or bias analysis. In the revision we will add a dedicated subsection to the Data section that reports: the total number of papers and unique packages extracted; basic coverage metrics such as the fraction of biomedical papers containing software mentions; a comparison against a random sample of PubMed Central full-text articles (reporting overlap in mentioned packages); and a brief discussion of potential biases (e.g., field or ecosystem skew). These additions will directly support the validity of the downstream dependency mapping and centrality results. revision: yes
Referee: [Methods] Centrality definition: the manuscript does not supply explicit formulas or pseudocode for the proposed centrality metrics on the dependency graphs, nor does it report how the graphs are constructed from mentions (e.g., edge-weighting, handling of transitive dependencies). Without these, the claim that the highest-centrality packages are the “critical” ones cannot be evaluated.

Authors: We acknowledge that the manuscript describes the centrality metrics at a conceptual level but omits explicit formulas, pseudocode, and graph-construction details. In the revised Methods section we will insert: (i) the precise mathematical definitions of the centrality measures applied to the directed dependency graphs (including any adaptations of standard metrics such as degree or betweenness); (ii) pseudocode outlining the graph-construction procedure from the mention data; (iii) the chosen edge-weighting scheme (mention frequency); and (iv) the decision to use direct dependencies only, with a short justification for not computing transitive closures. These additions will make the “critical package” identification fully reproducible and evaluable. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical mapping from external dataset using standard network metrics

full rationale

The paper constructs dependency networks from the CZ Software Mentions Dataset (an external resource) and applies standard centrality metrics to rank packages in PyPI, CRAN, and Bioconductor. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided abstract or described approach. Results are presented as direct extractions and rankings from the input data rather than derivations that reduce to the paper's own definitions or prior outputs by construction. The analysis remains self-contained against the external dataset benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that the named dataset faithfully captures real dependency usage.

pith-pipeline@v0.9.0 · 5690 in / 1101 out tokens · 19072 ms · 2026-05-24T02:23:45.417190+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

[1]

Scientific Software Production: Incentives and Collaboration

Howison J, Herbsleb JD. Scientific Software Production: Incentives and Collaboration. In: Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work. CSCW ’11. New York, NY, USA: Association for Computing Machinery; 2011. p. 513–522. Available from: https://doi.org/10.1145/1958824.1958904

work page doi:10.1145/1958824.1958904 2011
[2]

Understanding the scientific software ecosystem and its impact: Current and future measures

Howison J, Deelman E, McLennan MJ, Ferreira da Silva R, Herbsleb JD. Understanding the scientific software ecosystem and its impact: Current and future measures. Research Evaluation. 2015;24(4):454–470. doi:10.1093/reseval/rvv014

work page doi:10.1093/reseval/rvv014 2015
[3]

The unsung heroes of scientific software

Singh Chawla D. The unsung heroes of scientific software. Nature. 2016;529(7584):115–116. doi:10.1038/529115a

work page doi:10.1038/529115a 2016
[4]

Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature

Howison J, Bullard J. Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. JASIST. 2016;67(9):2137–2155. doi:10.1002/asi.23538. November 7, 2025 16/20

work page doi:10.1002/asi.23538 2016
[5]

We need to talk about the lack of investment in digital research infrastructure

Knowles R, Mateen BA, Yehudi Y. We need to talk about the lack of investment in digital research infrastructure. Nature Computational Science. 2021;1(3):169–171. doi:10.1038/s43588-021-00048-5

work page doi:10.1038/s43588-021-00048-5 2021
[6]

Don’t Mention It: An Approach to Assess Challenges to Using Software Mentions for Citation and Discoverability Research

Druskat S, Hong NPC, Buzzard S, Konovalov O, Kornek P. Don’t Mention It: An Approach to Assess Challenges to Using Software Mentions for Citation and Discoverability Research. arXiv. 2024;2024(arXiv:2402.14602). doi:10.48550/arXiv.2402.14602

work page doi:10.48550/arxiv.2402.14602 2024
[7]

SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Schindler D, Bensmann F, Dietze S, Kr¨ uger F. SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. New York, NY, USA: Association for Computing Machinery; 2021. p. 4574–4583. Available from: https://doi.org/10.1...

work page doi:10.1145/3459637.3482017 2021
[8]

SoftCite dataset: A dataset of software mentions in biomedical and economic research publications

Du C, Cohoon J, Lopez P, Howison J. SoftCite dataset: A dataset of software mentions in biomedical and economic research publications. JASIST. 2021;72(7):870–884. doi:10.1002/asi.24454

work page doi:10.1002/asi.24454 2021
[9]

CZ Software Mentions: A large dataset of software mentions in the biomedical literature; 2022

Istrate AM, Veytsman B, Li D, Taraborelli D, Torkar M, Williams I. CZ Software Mentions: A large dataset of software mentions in the biomedical literature; 2022. Available from: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c

work page doi:10.5061/dryad.6wwpzgn2c 2022
[10]

A large dataset of software mentions in the biomedical literature

Istrate AM, Li D, Taraborelli D, Torkar M, Veytsman B, Williams I. A large dataset of software mentions in the biomedical literature. arXiv. 2022;doi:10.48550/ARXIV.2209.00693

work page doi:10.48550/arxiv.2209.00693 2022
[11]

Guiding Development Work Across a Software Ecosystem by Visualizing Usage Data

Bogart C, Howison J, Herbsleb J. Guiding Development Work Across a Software Ecosystem by Visualizing Usage Data. arXiv e-prints. 2020; p. arXiv:2012.05987. doi:10.48550/arXiv.2012.05987

work page doi:10.48550/arxiv.2012.05987 2020
[12]

The Nebraska problem in open source software development

Hatta M. The Nebraska problem in open source software development. Annals of Business Administrative Science. 2022;21(5):91–102. doi:10.7880/abas.0220914a

work page doi:10.7880/abas.0220914a 2022
[13]

What we know about the xz utils backdoor that almost infected the world; 2024

Goodin D. What we know about the xz utils backdoor that almost infected the world; 2024. Ars Technica. Available from: https://arstechnica.com/security/2024/04/ what-we-know-about-the-xz-utils-backdoor-that-almost-infected-the-world/

work page 2024
[14]

Computational reproducibility of Jupyter notebooks from biomedical publications

Samuel S, Mietchen D. Computational reproducibility of Jupyter notebooks from biomedical publications. GigaScience. 2024;13. doi:10.1093/GIGASCIENCE/GIAD113

work page doi:10.1093/gigascience/giad113 2024
[15]

Dependency; 2020

Munroe RP. Dependency; 2020. Available from:https://xkcd.com/2347/

work page 2020
[16]

Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products

Katz DS. Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. Journal of Open Research Software. 2014;doi:10.5334/jors.be

work page doi:10.5334/jors.be 2014
[17]

Implementing Transitive Credit with JSON-LD

Katz DS, Smith AM. Implementing Transitive Credit with JSON-LD. arXiv. 2014;doi:10.48550/arXiv.1407.5117

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1407.5117 2014
[18]

Citation File Format; 2021

Druskat S, Spaaks JH, Chue Hong N, Haines R, Baker J, Bliven S, et al.. Citation File Format; 2021. Available from: https://doi.org/10.5281/zenodo.5171937. November 7, 2025 17/20

work page doi:10.5281/zenodo.5171937 2021
[19]

Software and Dependencies in Research Citation Graphs

Druskat S. Software and Dependencies in Research Citation Graphs. Computing in Science & Engineering. 2020;22(2):8–21. doi:10.1109/MCSE.2019.2952840

work page doi:10.1109/mcse.2019.2952840 2020
[20]

When and How to Make Breaking Changes: Policies and Practices in 18 Open Source Software Ecosystems

Bogart C, K¨ astner C, Herbsleb J, Thung F. When and How to Make Breaking Changes: Policies and Practices in 18 Open Source Software Ecosystems. ACM Trans Softw Eng Methodol. 2021;30(4). doi:10.1145/3447245

work page doi:10.1145/3447245 2021
[21]

Pasteur’s Quadrant: Basic Science and Technological Innovation

Stokes DE. Pasteur’s Quadrant: Basic Science and Technological Innovation. Washington, D. C.: Brookings Institute Press; 1997

work page 1997
[22]

Exploring the dependencies of the CZI mentions dataset; 2023

Brown EM, Nesbitt A, H´ ebert-Dufresne L, Veytsman B, Pimentel JaF, Druskat S, et al.. Exploring the dependencies of the CZI mentions dataset; 2023. Available from:https://github.com/borisveytsman/SoftwareImpactHackathon2023_ Tracing_dependencies

work page 2023
[23]

Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science; 2023

Nesbitt A. Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science; 2023. Zenodo

work page 2023
[24]

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

Priem J, Piwowar H, Orr R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv e-prints. 2022; p. arXiv:2205.01833. doi:10.48550/arXiv.2205.01833

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01833 2022
[25]

A Dependency Graph for 460,000 Papers and Their Software Mentions from the CZI Software Mentions Dataset; 2023

Brown EM. A Dependency Graph for 460,000 Papers and Their Software Mentions from the CZI Software Mentions Dataset; 2023. Available from: https://doi.org/10.5281/zenodo.10048132

work page doi:10.5281/zenodo.10048132 2023
[26]

GEXF File Format; 2009

GEXF Working Group. GEXF File Format; 2009. Available from: https://gexf.net/

work page 2009
[27]

Three Perspectives on Centrality

Borgatti SP, Everett MG. Three Perspectives on Centrality. In: Light R, Moody J, editors. The Oxford Handbook of Social Networks. Oxford University Press

work page
[28]

Some unique properties of eigenvector centrality

Bonacich P. Some unique properties of eigenvector centrality. Social Networks. 2007;29(4):555–564. doi:10.1016/J.SOCNET.2007.04.002

work page doi:10.1016/j.socnet.2007.04.002 2007
[29]

The anatomy of a large-scale hypertextual Web search engine

Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks. 1998;30(1-7):107–117. doi:10.1016/S0169-7552(98)00110-X

work page doi:10.1016/s0169-7552(98)00110-x 1998
[30]

A new status index derived from sociometric analysis

Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43. doi:10.1007/BF02289026

work page doi:10.1007/bf02289026 1953
[31]

Diffusion of Innovations, 5th Edition

Rogers EM. Diffusion of Innovations, 5th Edition. Free Press; 2003

work page 2003
[32]

A Survey of Models and Algorithms for Social Influence Analysis

Sun J, Tang J. A Survey of Models and Algorithms for Social Influence Analysis. In: Aggarwal CC, editor. Social Network Data Analytics. Boston, MA: Springer US; 2011. p. 177–214

work page 2011
[33]

Wickham H. ggplot2. Wiley interdisciplinary reviews: computational statistics. 2011;3(2):180–185

work page 2011
[34]

Sparse additive models

Ravikumar P, Lafferty J, Liu H, Wasserman L. Sparse additive models. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2009;71(5):1009–1030

work page 2009
[35]

Learning stateful models for network honeypots

Krueger T, Gascon H, Kr¨ amer N, Rieck K. Learning stateful models for network honeypots. In: Proceedings of the 5th ACM workshop on Security and artificial intelligence; 2012. p. 37–48. November 7, 2025 18/20

work page 2012
[36]

Velvet; 2015

Wood S. Velvet; 2015. Available from:https://pypi.org/project/velvet

work page 2015
[37]

Pymol: An open-source molecular graphics tool

DeLano WL, et al. Pymol: An open-source molecular graphics tool. CCP4 Newsl protein crystallogr. 2002;40(1):82–92

work page 2002
[38]

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology. 2014;15(12):550

work page 2014
[39]

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. bioinformatics. 2010;26(1):139–140

work page 2010
[40]

Limma: linear models for microarray data

Smyth GK. Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. Springer; 2005. p. 397–420

work page 2005
[41]

limma powers differential expression analyses for RNA-sequencing and microarray studies

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research. 2015;43(7):e47–e47

work page 2015
[42]

Concentration and dependency ratios [English translation of the original 1909 paper]

Gini C. Concentration and dependency ratios [English translation of the original 1909 paper]. Rivista di Politica Economica. 1997;87:769–789

work page 1909
[43]

vctrs: Vector Helpers; 2023

Wickham H, Henry L, Vaughan D. vctrs: Vector Helpers; 2023. Available from: https://CRAN.R-project.org/package=vctrs

work page 2023
[44]

withr: Run Code ‘With’ Temporarily Modified Global State; 2024

Hester J, Henry L, M¨ uller K, Ushey K, Wickham H, Chang W. withr: Run Code ‘With’ Temporarily Modified Global State; 2024. Available from: https://CRAN.R-project.org/package=withr

work page 2024
[45]

isoband: Generate Isolines and Isobands from Regularly Spaced Elevation Grids; 2022

Wickham H, Wilke CO, Pedersen TL. isoband: Generate Isolines and Isobands from Regularly Spaced Elevation Grids; 2022. Available from: https://CRAN.R-project.org/package=isoband

work page 2022
[46]

newick; 2021

Schultz D, Ebbert M, De Coster W. newick; 2021. Available from: https://pypi.org/project/pauvre/

work page 2021
[47]

Newick; 2025

Forkel R. Newick; 2025. Available from:https://pypi.org/project/newick/

work page 2025
[48]

setuptools; 2025

Python Packaging Authority. setuptools; 2025. Available from: https://pypi.org/project/setuptools/

work page 2025
[49]

Welcome to the tidyverse

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Fran¸ cois R, et al. Welcome to the tidyverse. Journal of Open Source Software. 2019;4(43):1686. doi:10.21105/joss.01686

work page doi:10.21105/joss.01686 2019
[50]

Velvet [Software]

Zerbino DR, Foret S, Gurney JM, Slater G, Birney E, Marshall J, et al. Velvet [Software]. Software Heritage. 2014

work page 2014
[51]

Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs

Zerbino DR, Birney E. Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs. Genome Research. 2008;18(5):821–829. doi:10.1101/gr.074492.107

work page doi:10.1101/gr.074492.107 2008
[52]

TopHat; 2012

The TopHat developers. TopHat; 2012. Available from: https://pypi.org/project/TopHat

work page 2012
[53]

TopHat: Discovering Splice Junctions with RNA-Seq

Trapnell C, Pachter L, Salzberg SL. TopHat: Discovering Splice Junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–1111. doi:10.1093/bioinformatics/btp120. November 7, 2025 19/20

work page doi:10.1093/bioinformatics/btp120 2009
[54]

TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions

Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions. Genome Biology. 2013;14(4):R36. doi:10.1186/gb-2013-14-4-r36

work page doi:10.1186/gb-2013-14-4-r36 2013
[55]

GraphPad prism, data analysis, and scientific graphing

Swift ML. GraphPad prism, data analysis, and scientific graphing. Journal of chemical information and computer sciences. 1997;37(2):411–412

work page 1997
[56]

Gephi: An Open Source Software for Exploring and Manipulating Networks

Bastian M, Heymann S, Jacomy M. Gephi: An Open Source Software for Exploring and Manipulating Networks. In: International AAAI Conference on Weblogs and Social Media. AAAI; 2009. p. 361–362. Available from: http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154

work page 2009
[57]

An updated set of basic linear algebra subprograms (BLAS)

Blackford LS, Petitet A, Pozo R, Remington K, Whaley RC, Demmel J, et al. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software. 2002;28(2):135–151

work page 2002
[58]

LAPACK users’ guide

Anderson E, Bai Z, Bischof C, Blackford LS, Demmel J, Dongarra J, et al. LAPACK users’ guide. SIAM; 1999

work page 1999
[59]

The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative

Trujillo MZ, H´ ebert-Dufresne L, Bagrow J. The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative. EPJ Data Science. 2022;11(1):31

work page 2022
[60]

Support scientific software infrastructure by requiring SBOMs for federally funded research; 2024

Howison J, Ram K. Support scientific software infrastructure by requiring SBOMs for federally funded research; 2024. Available from: https://fas.org/publication/sboms-hardware/. November 7, 2025 20/20

work page 2024

[1] [1]

Scientific Software Production: Incentives and Collaboration

Howison J, Herbsleb JD. Scientific Software Production: Incentives and Collaboration. In: Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work. CSCW ’11. New York, NY, USA: Association for Computing Machinery; 2011. p. 513–522. Available from: https://doi.org/10.1145/1958824.1958904

work page doi:10.1145/1958824.1958904 2011

[2] [2]

Understanding the scientific software ecosystem and its impact: Current and future measures

Howison J, Deelman E, McLennan MJ, Ferreira da Silva R, Herbsleb JD. Understanding the scientific software ecosystem and its impact: Current and future measures. Research Evaluation. 2015;24(4):454–470. doi:10.1093/reseval/rvv014

work page doi:10.1093/reseval/rvv014 2015

[3] [3]

The unsung heroes of scientific software

Singh Chawla D. The unsung heroes of scientific software. Nature. 2016;529(7584):115–116. doi:10.1038/529115a

work page doi:10.1038/529115a 2016

[4] [4]

Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature

Howison J, Bullard J. Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. JASIST. 2016;67(9):2137–2155. doi:10.1002/asi.23538. November 7, 2025 16/20

work page doi:10.1002/asi.23538 2016

[5] [5]

We need to talk about the lack of investment in digital research infrastructure

Knowles R, Mateen BA, Yehudi Y. We need to talk about the lack of investment in digital research infrastructure. Nature Computational Science. 2021;1(3):169–171. doi:10.1038/s43588-021-00048-5

work page doi:10.1038/s43588-021-00048-5 2021

[6] [6]

Don’t Mention It: An Approach to Assess Challenges to Using Software Mentions for Citation and Discoverability Research

Druskat S, Hong NPC, Buzzard S, Konovalov O, Kornek P. Don’t Mention It: An Approach to Assess Challenges to Using Software Mentions for Citation and Discoverability Research. arXiv. 2024;2024(arXiv:2402.14602). doi:10.48550/arXiv.2402.14602

work page doi:10.48550/arxiv.2402.14602 2024

[7] [7]

SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Schindler D, Bensmann F, Dietze S, Kr¨ uger F. SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. New York, NY, USA: Association for Computing Machinery; 2021. p. 4574–4583. Available from: https://doi.org/10.1...

work page doi:10.1145/3459637.3482017 2021

[8] [8]

SoftCite dataset: A dataset of software mentions in biomedical and economic research publications

Du C, Cohoon J, Lopez P, Howison J. SoftCite dataset: A dataset of software mentions in biomedical and economic research publications. JASIST. 2021;72(7):870–884. doi:10.1002/asi.24454

work page doi:10.1002/asi.24454 2021

[9] [9]

CZ Software Mentions: A large dataset of software mentions in the biomedical literature; 2022

Istrate AM, Veytsman B, Li D, Taraborelli D, Torkar M, Williams I. CZ Software Mentions: A large dataset of software mentions in the biomedical literature; 2022. Available from: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c

work page doi:10.5061/dryad.6wwpzgn2c 2022

[10] [10]

A large dataset of software mentions in the biomedical literature

Istrate AM, Li D, Taraborelli D, Torkar M, Veytsman B, Williams I. A large dataset of software mentions in the biomedical literature. arXiv. 2022;doi:10.48550/ARXIV.2209.00693

work page doi:10.48550/arxiv.2209.00693 2022

[11] [11]

Guiding Development Work Across a Software Ecosystem by Visualizing Usage Data

Bogart C, Howison J, Herbsleb J. Guiding Development Work Across a Software Ecosystem by Visualizing Usage Data. arXiv e-prints. 2020; p. arXiv:2012.05987. doi:10.48550/arXiv.2012.05987

work page doi:10.48550/arxiv.2012.05987 2020

[12] [12]

The Nebraska problem in open source software development

Hatta M. The Nebraska problem in open source software development. Annals of Business Administrative Science. 2022;21(5):91–102. doi:10.7880/abas.0220914a

work page doi:10.7880/abas.0220914a 2022

[13] [13]

What we know about the xz utils backdoor that almost infected the world; 2024

Goodin D. What we know about the xz utils backdoor that almost infected the world; 2024. Ars Technica. Available from: https://arstechnica.com/security/2024/04/ what-we-know-about-the-xz-utils-backdoor-that-almost-infected-the-world/

work page 2024

[14] [14]

Computational reproducibility of Jupyter notebooks from biomedical publications

Samuel S, Mietchen D. Computational reproducibility of Jupyter notebooks from biomedical publications. GigaScience. 2024;13. doi:10.1093/GIGASCIENCE/GIAD113

work page doi:10.1093/gigascience/giad113 2024

[15] [15]

Dependency; 2020

Munroe RP. Dependency; 2020. Available from:https://xkcd.com/2347/

work page 2020

[16] [16]

Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products

Katz DS. Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. Journal of Open Research Software. 2014;doi:10.5334/jors.be

work page doi:10.5334/jors.be 2014

[17] [17]

Implementing Transitive Credit with JSON-LD

Katz DS, Smith AM. Implementing Transitive Credit with JSON-LD. arXiv. 2014;doi:10.48550/arXiv.1407.5117

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1407.5117 2014

[18] [18]

Citation File Format; 2021

Druskat S, Spaaks JH, Chue Hong N, Haines R, Baker J, Bliven S, et al.. Citation File Format; 2021. Available from: https://doi.org/10.5281/zenodo.5171937. November 7, 2025 17/20

work page doi:10.5281/zenodo.5171937 2021

[19] [19]

Software and Dependencies in Research Citation Graphs

Druskat S. Software and Dependencies in Research Citation Graphs. Computing in Science & Engineering. 2020;22(2):8–21. doi:10.1109/MCSE.2019.2952840

work page doi:10.1109/mcse.2019.2952840 2020

[20] [20]

When and How to Make Breaking Changes: Policies and Practices in 18 Open Source Software Ecosystems

Bogart C, K¨ astner C, Herbsleb J, Thung F. When and How to Make Breaking Changes: Policies and Practices in 18 Open Source Software Ecosystems. ACM Trans Softw Eng Methodol. 2021;30(4). doi:10.1145/3447245

work page doi:10.1145/3447245 2021

[21] [21]

Pasteur’s Quadrant: Basic Science and Technological Innovation

Stokes DE. Pasteur’s Quadrant: Basic Science and Technological Innovation. Washington, D. C.: Brookings Institute Press; 1997

work page 1997

[22] [22]

Exploring the dependencies of the CZI mentions dataset; 2023

Brown EM, Nesbitt A, H´ ebert-Dufresne L, Veytsman B, Pimentel JaF, Druskat S, et al.. Exploring the dependencies of the CZI mentions dataset; 2023. Available from:https://github.com/borisveytsman/SoftwareImpactHackathon2023_ Tracing_dependencies

work page 2023

[23] [23]

Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science; 2023

Nesbitt A. Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science; 2023. Zenodo

work page 2023

[24] [24]

OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

Priem J, Piwowar H, Orr R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv e-prints. 2022; p. arXiv:2205.01833. doi:10.48550/arXiv.2205.01833

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01833 2022

[25] [25]

A Dependency Graph for 460,000 Papers and Their Software Mentions from the CZI Software Mentions Dataset; 2023

Brown EM. A Dependency Graph for 460,000 Papers and Their Software Mentions from the CZI Software Mentions Dataset; 2023. Available from: https://doi.org/10.5281/zenodo.10048132

work page doi:10.5281/zenodo.10048132 2023

[26] [26]

GEXF File Format; 2009

GEXF Working Group. GEXF File Format; 2009. Available from: https://gexf.net/

work page 2009

[27] [27]

Three Perspectives on Centrality

Borgatti SP, Everett MG. Three Perspectives on Centrality. In: Light R, Moody J, editors. The Oxford Handbook of Social Networks. Oxford University Press

work page

[28] [28]

Some unique properties of eigenvector centrality

Bonacich P. Some unique properties of eigenvector centrality. Social Networks. 2007;29(4):555–564. doi:10.1016/J.SOCNET.2007.04.002

work page doi:10.1016/j.socnet.2007.04.002 2007

[29] [29]

The anatomy of a large-scale hypertextual Web search engine

Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks. 1998;30(1-7):107–117. doi:10.1016/S0169-7552(98)00110-X

work page doi:10.1016/s0169-7552(98)00110-x 1998

[30] [30]

A new status index derived from sociometric analysis

Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43. doi:10.1007/BF02289026

work page doi:10.1007/bf02289026 1953

[31] [31]

Diffusion of Innovations, 5th Edition

Rogers EM. Diffusion of Innovations, 5th Edition. Free Press; 2003

work page 2003

[32] [32]

A Survey of Models and Algorithms for Social Influence Analysis

Sun J, Tang J. A Survey of Models and Algorithms for Social Influence Analysis. In: Aggarwal CC, editor. Social Network Data Analytics. Boston, MA: Springer US; 2011. p. 177–214

work page 2011

[33] [33]

Wickham H. ggplot2. Wiley interdisciplinary reviews: computational statistics. 2011;3(2):180–185

work page 2011

[34] [34]

Sparse additive models

Ravikumar P, Lafferty J, Liu H, Wasserman L. Sparse additive models. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2009;71(5):1009–1030

work page 2009

[35] [35]

Learning stateful models for network honeypots

Krueger T, Gascon H, Kr¨ amer N, Rieck K. Learning stateful models for network honeypots. In: Proceedings of the 5th ACM workshop on Security and artificial intelligence; 2012. p. 37–48. November 7, 2025 18/20

work page 2012

[36] [36]

Velvet; 2015

Wood S. Velvet; 2015. Available from:https://pypi.org/project/velvet

work page 2015

[37] [37]

Pymol: An open-source molecular graphics tool

DeLano WL, et al. Pymol: An open-source molecular graphics tool. CCP4 Newsl protein crystallogr. 2002;40(1):82–92

work page 2002

[38] [38]

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology. 2014;15(12):550

work page 2014

[39] [39]

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. bioinformatics. 2010;26(1):139–140

work page 2010

[40] [40]

Limma: linear models for microarray data

Smyth GK. Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. Springer; 2005. p. 397–420

work page 2005

[41] [41]

limma powers differential expression analyses for RNA-sequencing and microarray studies

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research. 2015;43(7):e47–e47

work page 2015

[42] [42]

Concentration and dependency ratios [English translation of the original 1909 paper]

Gini C. Concentration and dependency ratios [English translation of the original 1909 paper]. Rivista di Politica Economica. 1997;87:769–789

work page 1909

[43] [43]

vctrs: Vector Helpers; 2023

Wickham H, Henry L, Vaughan D. vctrs: Vector Helpers; 2023. Available from: https://CRAN.R-project.org/package=vctrs

work page 2023

[44] [44]

withr: Run Code ‘With’ Temporarily Modified Global State; 2024

Hester J, Henry L, M¨ uller K, Ushey K, Wickham H, Chang W. withr: Run Code ‘With’ Temporarily Modified Global State; 2024. Available from: https://CRAN.R-project.org/package=withr

work page 2024

[45] [45]

isoband: Generate Isolines and Isobands from Regularly Spaced Elevation Grids; 2022

Wickham H, Wilke CO, Pedersen TL. isoband: Generate Isolines and Isobands from Regularly Spaced Elevation Grids; 2022. Available from: https://CRAN.R-project.org/package=isoband

work page 2022

[46] [46]

newick; 2021

Schultz D, Ebbert M, De Coster W. newick; 2021. Available from: https://pypi.org/project/pauvre/

work page 2021

[47] [47]

Newick; 2025

Forkel R. Newick; 2025. Available from:https://pypi.org/project/newick/

work page 2025

[48] [48]

setuptools; 2025

Python Packaging Authority. setuptools; 2025. Available from: https://pypi.org/project/setuptools/

work page 2025

[49] [49]

Welcome to the tidyverse

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Fran¸ cois R, et al. Welcome to the tidyverse. Journal of Open Source Software. 2019;4(43):1686. doi:10.21105/joss.01686

work page doi:10.21105/joss.01686 2019

[50] [50]

Velvet [Software]

Zerbino DR, Foret S, Gurney JM, Slater G, Birney E, Marshall J, et al. Velvet [Software]. Software Heritage. 2014

work page 2014

[51] [51]

Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs

Zerbino DR, Birney E. Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs. Genome Research. 2008;18(5):821–829. doi:10.1101/gr.074492.107

work page doi:10.1101/gr.074492.107 2008

[52] [52]

TopHat; 2012

The TopHat developers. TopHat; 2012. Available from: https://pypi.org/project/TopHat

work page 2012

[53] [53]

TopHat: Discovering Splice Junctions with RNA-Seq

Trapnell C, Pachter L, Salzberg SL. TopHat: Discovering Splice Junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–1111. doi:10.1093/bioinformatics/btp120. November 7, 2025 19/20

work page doi:10.1093/bioinformatics/btp120 2009

[54] [54]

TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions

Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions. Genome Biology. 2013;14(4):R36. doi:10.1186/gb-2013-14-4-r36

work page doi:10.1186/gb-2013-14-4-r36 2013

[55] [55]

GraphPad prism, data analysis, and scientific graphing

Swift ML. GraphPad prism, data analysis, and scientific graphing. Journal of chemical information and computer sciences. 1997;37(2):411–412

work page 1997

[56] [56]

Gephi: An Open Source Software for Exploring and Manipulating Networks

Bastian M, Heymann S, Jacomy M. Gephi: An Open Source Software for Exploring and Manipulating Networks. In: International AAAI Conference on Weblogs and Social Media. AAAI; 2009. p. 361–362. Available from: http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154

work page 2009

[57] [57]

An updated set of basic linear algebra subprograms (BLAS)

Blackford LS, Petitet A, Pozo R, Remington K, Whaley RC, Demmel J, et al. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software. 2002;28(2):135–151

work page 2002

[58] [58]

LAPACK users’ guide

Anderson E, Bai Z, Bischof C, Blackford LS, Demmel J, Dongarra J, et al. LAPACK users’ guide. SIAM; 1999

work page 1999

[59] [59]

The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative

Trujillo MZ, H´ ebert-Dufresne L, Bagrow J. The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative. EPJ Data Science. 2022;11(1):31

work page 2022

[60] [60]

Support scientific software infrastructure by requiring SBOMs for federally funded research; 2024

Howison J, Ram K. Support scientific software infrastructure by requiring SBOMs for federally funded research; 2024. Available from: https://fas.org/publication/sboms-hardware/. November 7, 2025 20/20

work page 2024