The Reciprocal Impact of Science and Software: A Cross-Corpus Analysis of How Research Shapes Software and Software Enables Research
Pith reviewed 2026-06-29 01:43 UTC · model grok-4.3
The pith
The measured correlation between software reuse and scientific citations reverses sign depending on how the two are linked.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The two directions of influence illuminate different, complementary strata: literature reaches software mainly via a reproducibility and packaging layer and sequence-analysis tools, whereas software reaches science mainly via a largely invisible machine-learning and data-science infrastructure tier. The direct paper-names-software channel is too sparse to support ranking. Dependency reuse as a proxy shows at most weak coupling to citation count and stars. The reuse-citation correlation flips sign and statistical significance across two reasonable pairing methods, with n=137 yielding rho=0.05 (CI straddling zero) and n=1,067 yielding rho=0.13 (CI [0.07,0.19]).
What carries the argument
A typed cross-corpus graph of 69.8M edges over eight relation types that links World of Code version histories to Semantic Scholar and OpenAlex records, anchored on 18,247 curated science repositories.
If this is right
- Science shapes software most visibly through reproducibility frameworks and packaging systems rather than through direct algorithmic contributions.
- Software shapes science most visibly through data-science and machine-learning libraries that papers rarely name explicitly.
- Dependency reuse can stand in for direct citation counts but only as a weak proxy.
- Any headline claim about the strength or direction of science-software coupling must be tested against multiple pairing methods because the sign is not robust.
Where Pith is reading between the lines
- Impact studies that rely on a single linking rule should report results from at least one alternative rule to demonstrate stability.
- The observed sparsity of explicit mentions suggests that better named-entity recognition or mandatory software citation standards could change the measured strata.
- Separate metrics may be needed for the reproducibility layer and the machine-learning infrastructure layer rather than a single aggregate score.
Load-bearing premise
The typed linkages between papers and repositories, especially mentions and declared citations, are complete and unbiased enough to reveal the main strata of influence.
What would settle it
Recompute the reuse-citation Spearman correlations after adding a third independent matching rule, such as full-text search for repository names inside every paper, and check whether the sign stays the same across all three rules.
read the original abstract
Software and scientific knowledge co-evolve, yet they are catalogued in separate corpora that rarely speak to one another. We bridge them at global scale by linking World of Code (a near-complete mirror of public version-control history) to Semantic Scholar and OpenAlex through a typed cross-corpus graph of 69.8M edges over eight relation types (paper-to-software mentions, software-to-paper citations, software dependencies, authorship, affiliation, and identity bridges). Anchoring on 18,247 curated science repositories, we ask two reciprocal questions: what is the impact of science on software, and of software on science? To test whether this Science-Software Supply Chain (S3C) view is feasible, we run basic investigations rather than claim a definitive measurement. The two directions appear to illuminate different, complementary strata: the literature's reach into software is dominated by a reproducibility and packaging layer (nf-core, Nextflow, Bioconda) and sequence-analysis tools, whereas software's reach back into science is proxied by a largely invisible machine-learning and data-science infrastructure tier (PyTorch, seaborn, NLTK). The direct paper-names-software channel is too sparse to rank: a human-curated gold benchmark links none of its 65 in-scope cases. Dependency reuse stands in as a proxy and is at most weakly coupled to citation count and to stars (Spearman rho=0.36). Our most cautionary finding is about measurement itself: the reuse-citation coupling flips sign and confidence across two reasonable ways of pairing a repository with a citation count, through papers that name it (n=137, rho=0.05, CI straddling zero) versus DOIs a repository declares for itself (n=1,067, rho=0.13, CI [0.07,0.19]). With linkage this sparse, the sign of a headline correlation depends on which gap one tolerates, so we report both and refrain from a strong decoupling claim.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript constructs a typed cross-corpus graph of 69.8M edges linking World of Code to Semantic Scholar and OpenAlex across eight relation types. Anchoring on 18,247 curated science repositories, it performs basic exploratory investigations into reciprocal impacts rather than definitive measurements. It reports that science-to-software influence is dominated by reproducibility/packaging layers (nf-core, Nextflow, Bioconda) and sequence-analysis tools, while software-to-science influence is proxied by ML/data-science infrastructure (PyTorch, seaborn, NLTK). Direct paper-to-software mentions are sparse (zero matches in a human-curated 65-case gold benchmark), dependency reuse is at most weakly coupled to citations/stars (rho=0.36), and the reuse-citation correlation flips sign and confidence depending on pairing method (named papers: n=137, rho=0.05, CI straddling zero; declared DOIs: n=1,067, rho=0.13, CI [0.07,0.19]).
Significance. If the linkages are representative within the acknowledged sparsity, the work demonstrates the feasibility of large-scale cross-corpus analysis for science-software co-evolution and underscores measurement sensitivity in such settings. Explicit strengths include the human-curated gold benchmark, direct reporting of both pairing methods with their differing outcomes and CIs, and consistent framing as basic investigations rather than strong claims.
minor comments (2)
- [Abstract] Abstract: the sample sizes n=137 and n=1,067 for the two correlation analyses are reported without a brief description of how the subsets were extracted from the full 18,247-repository anchor set; adding one sentence would improve reproducibility of the comparison.
- The manuscript could add a short dedicated limitations subsection (perhaps after the methods) that consolidates the acknowledged sparsity of direct linkages and the proxy nature of dependency reuse, even though these points are already stated in the abstract.
Simulated Author's Rebuttal
We thank the referee for the careful reading and positive assessment of the manuscript. The recommendation for minor revision is noted; we will address any editorial or presentational suggestions in the revised version.
Circularity Check
No significant circularity detected
full rationale
The paper presents an exploratory analysis based on direct empirical counts, Spearman correlations, and a typed cross-corpus graph constructed from external sources (World of Code, Semantic Scholar, OpenAlex). No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear. All reported findings (strata of influence, correlation comparisons, sparsity observations) rest on the constructed linkages and external data without reduction to inputs by construction or load-bearing self-citation chains. The work explicitly frames itself as feasibility tests and reports both pairing methods with their differing results, confirming the derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 18,247 curated science repositories and the typed linkages provide a representative enough sample to identify the dominant strata of science-software influence.
Reference graph
Works this paper leans on
-
[1]
Ivanov, John Chamberlin, David Hanauer, Candace L
Awan Afiaz, Andrey A. Ivanov, John Chamberlin, David Hanauer, Candace L. Savonen, Mary J. Goldman, Martin Morgan, Michael Reich, Alexander Getka, Aaron Holmes, Sarthak Pati, Dan Knight, Paul C. Boutros, Spyridon Bakas, J. Gregory Caporaso, Guilherme Del Fiol, Harry Hochheiser, Brian Haas, Patrick D. Schloss, James A. Eddy, Jake Albrecht, Andrey Fedorov, L...
-
[2]
Sadika Amreen, Yuxia Zhang, Chris Bogart, Russell Zaretzki, and Audris Mockus
refs/eval-software-impact-biomed-2023.pdf. Sadika Amreen, Yuxia Zhang, Chris Bogart, Russell Zaretzki, and Audris Mockus. Alfaa: Ac- tive learning fingerprint based anti-aliasing for correcting developer identity errors in ver- sion control systems.Empirical Software Engineering, 25(2):1136–1167,
2023
-
[3]
doi: 10.1007/ s10664-019-09786-7. URLpapers/ALFAA.pdf. Eva Maxfield Brown, Stephan Druskat, Laurent H´ ebert-Dufresne, James Howison, Daniel Mietchen, Andrew Nesbitt, Jo˜ ao Felipe Pimentel, and Boris Veytsman. Biomedical open source software: Crucial packages and hidden heroes.arXiv preprint arXiv:2404.06672,
-
[4]
Biol.; refs/biomedical-oss-hidden-heroes-2024.pdf
intended for PLOS Comput. Biol.; refs/biomedical-oss-hidden-heroes-2024.pdf. Alexandre Decan, Tom Mens, and Philippe Grosjean. An empirical comparison of dependency network evolution in seven software packaging ecosystems.Empirical Software Engineering, 24 (1):381–416,
2024
-
[5]
doi: 10.1007/s10664-017-9589-y. Stephan Druskat. Software and dependencies in research citation graphs.Computing in Sci- ence & Engineering, 22(2):8–21,
-
[6]
arXiv:1906.06141; refs/software-dependencies-citation-graphs-2019.pdf
doi: 10.1109/MCSE.2019.2952840. arXiv:1906.06141; refs/software-dependencies-citation-graphs-2019.pdf. 17 Stephan Druskat, Neil P. Chue Hong, Sammie Buzzard, Olexandr Konovalov, and Patrick Kornek. Don’t mention it: An approach to assess challenges to using software mentions for citation and discoverability research.arXiv preprint arXiv:2402.14602,
-
[7]
Caifan Du, Johanna Cohoon, Patrice Lopez, and James Howison
refs/dont-mention-it-software- mentions-2024.pdf. Caifan Du, Johanna Cohoon, Patrice Lopez, and James Howison. Softcite dataset: A dataset of software mentions in biomedical and economic research publications.Journal of the Association for Information Science and Technology, 72(7):870–884,
2024
-
[8]
software- mention extraction recall is well below
doi: 10.1002/asi.24454. software- mention extraction recall is well below
-
[9]
James Howison and Julia Bullard
doi: 10.1126/science.aao0185. James Howison and Julia Bullard. Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature.Journal of the Association for Information Science and Technology (JASIST), 67(9):2137–2155,
-
[10]
refs/howison-bullard-2016-software-in-lit.pdf
doi: 10.1002/asi.23538. refs/howison-bullard-2016-software-in-lit.pdf. James Howison and James D. Herbsleb. Scientific software production: incentives and collaboration. InProc. ACM CSCW,
-
[11]
refs/howison-herbsleb-2011-scisoft- incentives.pdf
doi: 10.1145/1958824.1958904. refs/howison-herbsleb-2011-scisoft- incentives.pdf. Ana-Maria Istrate, Donghui Li, Dario Taraborelli, Michaela Torkar, Boris Veytsman, and Ivana Williams. A large dataset of software mentions in the biomedical literature,
-
[12]
CZ Software Mentions; also Proc
URLhttps:// arxiv.org/abs/2209.00693. CZ Software Mentions; also Proc. ISSI 2023, pp. 155–174; refs/czi- software-mentions-biomed-2022.pdf. Mahmoud Jahanshahi and Audris Mockus. Cracks in the stack: Hidden vulnerabilities and licensing risks in llm pre-training datasets. InLLM4Code, April-May
arXiv 2023
-
[13]
arXiv preprint, under review. Rodney Kinney et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140,
-
[14]
Challenges of measuring the impact of software: an examination of the lme4 R package
doi: 10.1126/science.adw3000. Kai Li, Pei-Ying Chen, and Erjia Yan. Challenges of measuring the impact of software: an exami- nation of the lme4 r package.arXiv preprint arXiv:1811.11270,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1126/science.adw3000
-
[15]
Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus
refs/challenges-measuring- software-lme4-2018.pdf. Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus. World of code: An infrastructure for mining the universe of open source vcs data. InIEEE Working Conference on Mining Software Repositories, May 26
2018
-
[16]
URL https://arxiv.org/abs/2312.06382. Addi Malviya-Thakur, Reed Milewicz, Lavinia Paganini, Mahmoud Jahanshahi, Ahmed Samir Imam Mahmoud, Bogdan Vasilescu, and Audris Mockus. Scientific open-source soft- ware is more sustainable than one might think! InThe ACM International Conference on the Foundations of Software Engineering, June 23-27
-
[17]
org/doi/10.1145/3338906.3342813?cid=81100250207
URLhttps://dl.acm. org/doi/10.1145/3338906.3342813?cid=81100250207. FSE’19 Industry Keynote. Audris Mockus. Tutorial: Open source software supply chains. InIndia Software Engineering Conference,
-
[18]
Audris Mockus, Peter C
companion paper, under preparation. Audris Mockus, Peter C. Rigby, Rui Abreu, Parth Suresh, Yifen Chen, and Nachiappan Nagappan. Modeling the centrality of developer output with software supply chains. InESEC/FSE 2023, December
2023
-
[19]
Heather Piwowar, Jason Priem, and James Howison
doi: 10.1007/s11192-016-2138-4. Heather Piwowar, Jason Priem, and James Howison. Citeas: mapping software to its requested citation.https://citeas.org,
-
[20]
Openalex: an open and comprehensive catalog of scholarly works
Jason Priem, Heather Piwowar, and Richard Orr. Openalex: an open and comprehensive catalog of scholarly works. arXiv:2205.01833,
-
[21]
doi: 10.1145/3459637.3482017. 19 David Schindler, Felix Bensmann, Stefan Dietze, and Frank Kr¨ uger. The role of software in science: a knowledge graph-based analysis of software mentions in pubmed central.PeerJ Computer Science, 8:e835,
-
[22]
David Schindler, Tazin Hossain, Sascha Spors, and Frank Kr¨ uger
doi: 10.7717/peerj-cs.835. David Schindler, Tazin Hossain, Sascha Spors, and Frank Kr¨ uger. A multi-level analysis of data quality for formal software citation.arXiv preprint arXiv:2306.17535,
-
[23]
refs/multilevel-data- quality-software-citation-2023.pdf. Arfon M. Smith, Daniel S. Katz, and Kyle E. Niemeyer. Software citation principles.PeerJ Computer Science, 2:e86,
2023
-
[24]
FORCE11 Software Citation Working Group
doi: 10.7717/peerj-cs.86. FORCE11 Software Citation Working Group. Vincent A. Traag. Science of science—citation models and research evaluation. In Taha Yasseri, editor,Handbook of Computational Social Science. Edward Elgar,
-
[25]
Dashun Wang and Albert-L´ aszl´ o Barab´ asi.The Science of Science
arXiv:2207.11116; refs/sciofsci-citation-models-eval-2022.pdf. Dashun Wang and Albert-L´ aszl´ o Barab´ asi.The Science of Science. Cambridge University Press,
arXiv 2022
-
[26]
doi: 10.1038/s41586-019-0941-9. 20 Table 8: Threats to validity and mitigations. Threat Description Mitigation / residual risk Construct: “science software” The SciCat seed is one LLM-classified op- erationalization from a sampled crawl; flag- ship repositories can be absent (e.g. the E3SM model; only an I/O component is present). Seed is curated and fiel...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.