Biomedical Open Source Software: Crucial Packages and Hidden Heroes
Pith reviewed 2026-05-24 02:23 UTC · model grok-4.3
The pith
Centrality metrics on software dependency networks identify the foundational packages biomedical research depends on most.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the CZ Software Mentions Dataset, the paper builds directed dependency graphs for packages drawn from PyPI, CRAN, and Bioconductor that appear in biomedical literature, then computes centrality scores on those graphs; the packages that receive the highest scores are presented as the critical, often invisible, components of the biomedical software ecosystem.
What carries the argument
Centrality metrics computed on the directed graph whose nodes are software packages and whose edges point from a package to its upstream dependencies.
If this is right
- High-centrality packages can be flagged for priority maintenance and funding because their removal would affect the largest number of research workflows.
- The same network construction can be repeated on other scientific domains to locate their own hidden foundational packages.
- Metrics that combine direct mentions with indirect dependency reach can replace simple citation counts when evaluating software impact.
- Ecosystem maintainers gain a quantitative way to decide which packages deserve dedicated support staff or long-term archiving.
Where Pith is reading between the lines
- If centrality rankings remain stable across successive yearly snapshots of the dataset, they could serve as an early-warning system for packages drifting into critical status.
- The method could be extended by weighting edges according to how often a dependency is actually invoked in code, rather than treating every declared dependency equally.
- Cross-ecosystem comparison might reveal whether one language community (Python versus R) concentrates risk in fewer foundational packages than the other.
Load-bearing premise
The CZ Software Mentions Dataset supplies a representative sample of the packages and dependency links actually used in biomedical papers.
What would settle it
A fresh, independent extraction of software mentions from a new corpus of biomedical papers that produces a materially different top-ranked set of packages by the same centrality measures.
Figures
read the original abstract
Despite the importance of scientific software for research, it is often not formally recognized and rewarded. This is especially true for foundational libraries, which are hidden below packages visible to the users (and thus doubly hidden, since even the packages directly used in research are frequently not visible in the paper). Research stakeholders like funders, infrastructure providers, and other organizations need to understand the complex network of computer programs that contemporary research relies upon. In this work, we use the CZ Software Mentions Dataset to map the upstream dependencies of software used in biomedical papers and find the packages critical to scientific software ecosystems. We propose centrality metrics for the network of software dependencies, analyze three ecosystems (PyPi, CRAN, Bioconductor), and determine the packages with the highest centrality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the CZ Software Mentions Dataset can be used to map upstream dependencies of software mentioned in biomedical papers, that centrality metrics can be defined on the resulting dependency networks, and that analysis of the PyPI, CRAN, and Bioconductor ecosystems reveals the packages with highest centrality that are critical yet hidden in scientific software stacks.
Significance. If the dataset is shown to be representative and the centrality definitions are made explicit and reproducible, the work could help funders and infrastructure providers identify foundational packages that merit greater recognition. The multi-ecosystem scope is a constructive feature. The grounding in an external mention dataset is noted as a positive, data-driven approach.
major comments (2)
- [Data and Methods] Data section: no coverage statistics, comparison against an independent corpus (e.g., PubMed Central full-text), or bias analysis is supplied to establish that the CZ Software Mentions Dataset supplies a representative sample of packages and dependency relations actually invoked in biomedical papers. This assumption is load-bearing for the upstream-dependency mapping and all subsequent centrality rankings.
- [Methods] Centrality definition: the manuscript does not supply explicit formulas or pseudocode for the proposed centrality metrics on the dependency graphs, nor does it report how the graphs are constructed from mentions (e.g., edge-weighting, handling of transitive dependencies). Without these, the claim that the highest-centrality packages are the “critical” ones cannot be evaluated.
minor comments (1)
- [Abstract] Abstract: the three ecosystems are named but the scale of the extracted networks (number of nodes/edges) is not stated.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas where additional transparency and validation will strengthen the manuscript. We agree that both the representativeness of the dataset and the explicit definition of centrality metrics require elaboration. We will revise the manuscript to incorporate these elements as detailed below.
read point-by-point responses
-
Referee: [Data and Methods] Data section: no coverage statistics, comparison against an independent corpus (e.g., PubMed Central full-text), or bias analysis is supplied to establish that the CZ Software Mentions Dataset supplies a representative sample of packages and dependency relations actually invoked in biomedical papers. This assumption is load-bearing for the upstream-dependency mapping and all subsequent centrality rankings.
Authors: We agree that the manuscript should demonstrate the representativeness of the CZ Software Mentions Dataset. The current version relies on the dataset without providing coverage statistics or bias analysis. In the revision we will add a dedicated subsection to the Data section that reports: the total number of papers and unique packages extracted; basic coverage metrics such as the fraction of biomedical papers containing software mentions; a comparison against a random sample of PubMed Central full-text articles (reporting overlap in mentioned packages); and a brief discussion of potential biases (e.g., field or ecosystem skew). These additions will directly support the validity of the downstream dependency mapping and centrality results. revision: yes
-
Referee: [Methods] Centrality definition: the manuscript does not supply explicit formulas or pseudocode for the proposed centrality metrics on the dependency graphs, nor does it report how the graphs are constructed from mentions (e.g., edge-weighting, handling of transitive dependencies). Without these, the claim that the highest-centrality packages are the “critical” ones cannot be evaluated.
Authors: We acknowledge that the manuscript describes the centrality metrics at a conceptual level but omits explicit formulas, pseudocode, and graph-construction details. In the revised Methods section we will insert: (i) the precise mathematical definitions of the centrality measures applied to the directed dependency graphs (including any adaptations of standard metrics such as degree or betweenness); (ii) pseudocode outlining the graph-construction procedure from the mention data; (iii) the chosen edge-weighting scheme (mention frequency); and (iv) the decision to use direct dependencies only, with a short justification for not computing transitive closures. These additions will make the “critical package” identification fully reproducible and evaluable. revision: yes
Circularity Check
No circularity; empirical mapping from external dataset using standard network metrics
full rationale
The paper constructs dependency networks from the CZ Software Mentions Dataset (an external resource) and applies standard centrality metrics to rank packages in PyPI, CRAN, and Bioconductor. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided abstract or described approach. Results are presented as direct extractions and rankings from the input data rather than derivations that reduce to the paper's own definitions or prior outputs by construction. The analysis remains self-contained against the external dataset benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scientific Software Production: Incentives and Collaboration
Howison J, Herbsleb JD. Scientific Software Production: Incentives and Collaboration. In: Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work. CSCW ’11. New York, NY, USA: Association for Computing Machinery; 2011. p. 513–522. Available from: https://doi.org/10.1145/1958824.1958904
-
[2]
Understanding the scientific software ecosystem and its impact: Current and future measures
Howison J, Deelman E, McLennan MJ, Ferreira da Silva R, Herbsleb JD. Understanding the scientific software ecosystem and its impact: Current and future measures. Research Evaluation. 2015;24(4):454–470. doi:10.1093/reseval/rvv014
-
[3]
The unsung heroes of scientific software
Singh Chawla D. The unsung heroes of scientific software. Nature. 2016;529(7584):115–116. doi:10.1038/529115a
-
[4]
Howison J, Bullard J. Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. JASIST. 2016;67(9):2137–2155. doi:10.1002/asi.23538. November 7, 2025 16/20
-
[5]
We need to talk about the lack of investment in digital research infrastructure
Knowles R, Mateen BA, Yehudi Y. We need to talk about the lack of investment in digital research infrastructure. Nature Computational Science. 2021;1(3):169–171. doi:10.1038/s43588-021-00048-5
-
[6]
Druskat S, Hong NPC, Buzzard S, Konovalov O, Kornek P. Don’t Mention It: An Approach to Assess Challenges to Using Software Mentions for Citation and Discoverability Research. arXiv. 2024;2024(arXiv:2402.14602). doi:10.48550/arXiv.2402.14602
-
[7]
SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles
Schindler D, Bensmann F, Dietze S, Kr¨ uger F. SoMeSci—A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. New York, NY, USA: Association for Computing Machinery; 2021. p. 4574–4583. Available from: https://doi.org/10.1...
-
[8]
SoftCite dataset: A dataset of software mentions in biomedical and economic research publications
Du C, Cohoon J, Lopez P, Howison J. SoftCite dataset: A dataset of software mentions in biomedical and economic research publications. JASIST. 2021;72(7):870–884. doi:10.1002/asi.24454
-
[9]
CZ Software Mentions: A large dataset of software mentions in the biomedical literature; 2022
Istrate AM, Veytsman B, Li D, Taraborelli D, Torkar M, Williams I. CZ Software Mentions: A large dataset of software mentions in the biomedical literature; 2022. Available from: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c
-
[10]
A large dataset of software mentions in the biomedical literature
Istrate AM, Li D, Taraborelli D, Torkar M, Veytsman B, Williams I. A large dataset of software mentions in the biomedical literature. arXiv. 2022;doi:10.48550/ARXIV.2209.00693
-
[11]
Guiding Development Work Across a Software Ecosystem by Visualizing Usage Data
Bogart C, Howison J, Herbsleb J. Guiding Development Work Across a Software Ecosystem by Visualizing Usage Data. arXiv e-prints. 2020; p. arXiv:2012.05987. doi:10.48550/arXiv.2012.05987
-
[12]
The Nebraska problem in open source software development
Hatta M. The Nebraska problem in open source software development. Annals of Business Administrative Science. 2022;21(5):91–102. doi:10.7880/abas.0220914a
-
[13]
What we know about the xz utils backdoor that almost infected the world; 2024
Goodin D. What we know about the xz utils backdoor that almost infected the world; 2024. Ars Technica. Available from: https://arstechnica.com/security/2024/04/ what-we-know-about-the-xz-utils-backdoor-that-almost-infected-the-world/
work page 2024
-
[14]
Computational reproducibility of Jupyter notebooks from biomedical publications
Samuel S, Mietchen D. Computational reproducibility of Jupyter notebooks from biomedical publications. GigaScience. 2024;13. doi:10.1093/GIGASCIENCE/GIAD113
- [15]
-
[16]
Katz DS. Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. Journal of Open Research Software. 2014;doi:10.5334/jors.be
-
[17]
Implementing Transitive Credit with JSON-LD
Katz DS, Smith AM. Implementing Transitive Credit with JSON-LD. arXiv. 2014;doi:10.48550/arXiv.1407.5117
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1407.5117 2014
-
[18]
Druskat S, Spaaks JH, Chue Hong N, Haines R, Baker J, Bliven S, et al.. Citation File Format; 2021. Available from: https://doi.org/10.5281/zenodo.5171937. November 7, 2025 17/20
-
[19]
Software and Dependencies in Research Citation Graphs
Druskat S. Software and Dependencies in Research Citation Graphs. Computing in Science & Engineering. 2020;22(2):8–21. doi:10.1109/MCSE.2019.2952840
-
[20]
When and How to Make Breaking Changes: Policies and Practices in 18 Open Source Software Ecosystems
Bogart C, K¨ astner C, Herbsleb J, Thung F. When and How to Make Breaking Changes: Policies and Practices in 18 Open Source Software Ecosystems. ACM Trans Softw Eng Methodol. 2021;30(4). doi:10.1145/3447245
-
[21]
Pasteur’s Quadrant: Basic Science and Technological Innovation
Stokes DE. Pasteur’s Quadrant: Basic Science and Technological Innovation. Washington, D. C.: Brookings Institute Press; 1997
work page 1997
-
[22]
Exploring the dependencies of the CZI mentions dataset; 2023
Brown EM, Nesbitt A, H´ ebert-Dufresne L, Veytsman B, Pimentel JaF, Druskat S, et al.. Exploring the dependencies of the CZI mentions dataset; 2023. Available from:https://github.com/borisveytsman/SoftwareImpactHackathon2023_ Tracing_dependencies
work page 2023
-
[23]
Nesbitt A. Package and Dependency Metadata for CZI Hackathon: Mapping the Impact of Research Software in Science; 2023. Zenodo
work page 2023
-
[24]
OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts
Priem J, Piwowar H, Orr R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv e-prints. 2022; p. arXiv:2205.01833. doi:10.48550/arXiv.2205.01833
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01833 2022
-
[25]
Brown EM. A Dependency Graph for 460,000 Papers and Their Software Mentions from the CZI Software Mentions Dataset; 2023. Available from: https://doi.org/10.5281/zenodo.10048132
-
[26]
GEXF Working Group. GEXF File Format; 2009. Available from: https://gexf.net/
work page 2009
-
[27]
Three Perspectives on Centrality
Borgatti SP, Everett MG. Three Perspectives on Centrality. In: Light R, Moody J, editors. The Oxford Handbook of Social Networks. Oxford University Press
-
[28]
Some unique properties of eigenvector centrality
Bonacich P. Some unique properties of eigenvector centrality. Social Networks. 2007;29(4):555–564. doi:10.1016/J.SOCNET.2007.04.002
-
[29]
The anatomy of a large-scale hypertextual Web search engine
Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks. 1998;30(1-7):107–117. doi:10.1016/S0169-7552(98)00110-X
-
[30]
A new status index derived from sociometric analysis
Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43. doi:10.1007/BF02289026
-
[31]
Diffusion of Innovations, 5th Edition
Rogers EM. Diffusion of Innovations, 5th Edition. Free Press; 2003
work page 2003
-
[32]
A Survey of Models and Algorithms for Social Influence Analysis
Sun J, Tang J. A Survey of Models and Algorithms for Social Influence Analysis. In: Aggarwal CC, editor. Social Network Data Analytics. Boston, MA: Springer US; 2011. p. 177–214
work page 2011
-
[33]
Wickham H. ggplot2. Wiley interdisciplinary reviews: computational statistics. 2011;3(2):180–185
work page 2011
-
[34]
Ravikumar P, Lafferty J, Liu H, Wasserman L. Sparse additive models. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2009;71(5):1009–1030
work page 2009
-
[35]
Learning stateful models for network honeypots
Krueger T, Gascon H, Kr¨ amer N, Rieck K. Learning stateful models for network honeypots. In: Proceedings of the 5th ACM workshop on Security and artificial intelligence; 2012. p. 37–48. November 7, 2025 18/20
work page 2012
- [36]
-
[37]
Pymol: An open-source molecular graphics tool
DeLano WL, et al. Pymol: An open-source molecular graphics tool. CCP4 Newsl protein crystallogr. 2002;40(1):82–92
work page 2002
-
[38]
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology. 2014;15(12):550
work page 2014
-
[39]
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. bioinformatics. 2010;26(1):139–140
work page 2010
-
[40]
Limma: linear models for microarray data
Smyth GK. Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. Springer; 2005. p. 397–420
work page 2005
-
[41]
limma powers differential expression analyses for RNA-sequencing and microarray studies
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research. 2015;43(7):e47–e47
work page 2015
-
[42]
Concentration and dependency ratios [English translation of the original 1909 paper]
Gini C. Concentration and dependency ratios [English translation of the original 1909 paper]. Rivista di Politica Economica. 1997;87:769–789
work page 1909
-
[43]
Wickham H, Henry L, Vaughan D. vctrs: Vector Helpers; 2023. Available from: https://CRAN.R-project.org/package=vctrs
work page 2023
-
[44]
withr: Run Code ‘With’ Temporarily Modified Global State; 2024
Hester J, Henry L, M¨ uller K, Ushey K, Wickham H, Chang W. withr: Run Code ‘With’ Temporarily Modified Global State; 2024. Available from: https://CRAN.R-project.org/package=withr
work page 2024
-
[45]
isoband: Generate Isolines and Isobands from Regularly Spaced Elevation Grids; 2022
Wickham H, Wilke CO, Pedersen TL. isoband: Generate Isolines and Isobands from Regularly Spaced Elevation Grids; 2022. Available from: https://CRAN.R-project.org/package=isoband
work page 2022
-
[46]
Schultz D, Ebbert M, De Coster W. newick; 2021. Available from: https://pypi.org/project/pauvre/
work page 2021
- [47]
-
[48]
Python Packaging Authority. setuptools; 2025. Available from: https://pypi.org/project/setuptools/
work page 2025
-
[49]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, Fran¸ cois R, et al. Welcome to the tidyverse. Journal of Open Source Software. 2019;4(43):1686. doi:10.21105/joss.01686
-
[50]
Zerbino DR, Foret S, Gurney JM, Slater G, Birney E, Marshall J, et al. Velvet [Software]. Software Heritage. 2014
work page 2014
-
[51]
Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs
Zerbino DR, Birney E. Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs. Genome Research. 2008;18(5):821–829. doi:10.1101/gr.074492.107
-
[52]
The TopHat developers. TopHat; 2012. Available from: https://pypi.org/project/TopHat
work page 2012
-
[53]
TopHat: Discovering Splice Junctions with RNA-Seq
Trapnell C, Pachter L, Salzberg SL. TopHat: Discovering Splice Junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–1111. doi:10.1093/bioinformatics/btp120. November 7, 2025 19/20
-
[54]
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions. Genome Biology. 2013;14(4):R36. doi:10.1186/gb-2013-14-4-r36
-
[55]
GraphPad prism, data analysis, and scientific graphing
Swift ML. GraphPad prism, data analysis, and scientific graphing. Journal of chemical information and computer sciences. 1997;37(2):411–412
work page 1997
-
[56]
Gephi: An Open Source Software for Exploring and Manipulating Networks
Bastian M, Heymann S, Jacomy M. Gephi: An Open Source Software for Exploring and Manipulating Networks. In: International AAAI Conference on Weblogs and Social Media. AAAI; 2009. p. 361–362. Available from: http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154
work page 2009
-
[57]
An updated set of basic linear algebra subprograms (BLAS)
Blackford LS, Petitet A, Pozo R, Remington K, Whaley RC, Demmel J, et al. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software. 2002;28(2):135–151
work page 2002
-
[58]
Anderson E, Bai Z, Bischof C, Blackford LS, Demmel J, Dongarra J, et al. LAPACK users’ guide. SIAM; 1999
work page 1999
-
[59]
Trujillo MZ, H´ ebert-Dufresne L, Bagrow J. The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative. EPJ Data Science. 2022;11(1):31
work page 2022
-
[60]
Support scientific software infrastructure by requiring SBOMs for federally funded research; 2024
Howison J, Ram K. Support scientific software infrastructure by requiring SBOMs for federally funded research; 2024. Available from: https://fas.org/publication/sboms-hardware/. November 7, 2025 20/20
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.