Operationalizing Research Software for Supply Chain Security
Pith reviewed 2026-05-16 10:22 UTC · model grok-4.3
The pith
A harmonized taxonomy for research software defines consistent boundaries so security measurements can be compared across studies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an RSSC-oriented taxonomy, built by mapping prior definitions and heuristics into shared dimensions of scope and operational boundaries, is required to interpret repository-centric security signals correctly. Application of the taxonomy to the Research Software Encyclopedia corpus produces an annotated dataset whose clusters exhibit distinct security profiles under OpenSSF Scorecard analysis, proving that stratification by taxonomy category is necessary for valid RSSC security assessments.
What carries the argument
The RSSC-oriented taxonomy, which harmonizes definitions, inclusion criteria, units of analysis, and identification heuristics from existing studies into explicit dimensions that mark the boundaries of research software.
If this is right
- Security assessments of research software must be stratified by taxonomy category before conclusions are drawn.
- Earlier studies can be retroactively mapped to the taxonomy dimensions for re-analysis and comparison.
- The reproducible labeling pipeline and codebook allow new datasets to be annotated consistently.
- OpenSSF Scorecard outputs become interpretable only after assignment to taxonomy clusters.
Where Pith is reading between the lines
- The same taxonomy approach could be tested on general open-source software to check whether supply-chain security signals also vary by category outside research contexts.
- Policy efforts to secure software supply chains could adopt the taxonomy dimensions to decide which repositories require different levels of scrutiny.
- Longitudinal tracking of security scores within each taxonomy cluster could reveal whether certain categories improve or degrade over time.
Load-bearing premise
The targeted scoping review of recent repository mining and dataset studies has captured enough existing operationalizations to produce a comprehensive and stable taxonomy.
What would settle it
Re-running the security analysis on the same corpus after randomly reassigning papers to taxonomy clusters and finding that the security signal differences disappear would show the taxonomy does not capture meaningful distinctions.
Figures
read the original abstract
Empirical studies of research software are hard to compare because the literature operationalizes ``research software'' inconsistently. Motivated by the research software supply chain (RSSC) and its security risks, we introduce an RSSC-oriented taxonomy that makes scope and operational boundaries explicit for empirical research software security studies. We conduct a targeted scoping review of recent repository mining and dataset construction studies, extracting each work's definition, inclusion criteria, unit of analysis, and identification heuristics. We synthesize these into a harmonized taxonomy and a mapping that translates prior approaches into shared taxonomy dimensions. We operationalize the taxonomy on a large community-curated corpus from the Research Software Encyclopedia (RSE), producing an annotated dataset, a labeling codebook, and a reproducible labeling pipeline. Finally, we apply OpenSSF Scorecard as a preliminary security analysis to show how repository-centric security signals differ across taxonomy-defined clusters and why taxonomy-aware stratification is necessary for interpreting RSSC security measurements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that inconsistent operationalizations of 'research software' in the literature hinder comparable empirical studies of research software supply chain (RSSC) security. It conducts a targeted scoping review of recent repository mining and dataset construction studies to extract definitions, criteria, units of analysis, and heuristics; synthesizes these into a harmonized RSSC-oriented taxonomy with a mapping to prior works; operationalizes the taxonomy via a reproducible labeling pipeline on the Research Software Encyclopedia (RSE) corpus to produce an annotated dataset and codebook; and applies OpenSSF Scorecard as a preliminary analysis showing that repository-centric security signals differ across taxonomy-defined clusters, thereby arguing that taxonomy-aware stratification is necessary for interpreting RSSC security measurements.
Significance. If the central claim holds, the work supplies a concrete, reusable framework and artifacts (taxonomy, mapping, annotated RSE dataset, labeling pipeline) that directly address the lack of comparability in RSSC security research. The scoping-review synthesis and independent OpenSSF application on an external corpus provide a reproducible foundation for future stratified analyses; the demonstration of cluster differences supplies an existence proof that unstratified repository-centric metrics can be misleading.
minor comments (2)
- Abstract: the phrasing 'taxonomy-aware stratification is necessary' is stronger than the illustrative nature of the OpenSSF results; consider softening to 'supported by preliminary evidence from' or similar to match the framing in §4.
- The scoping-review protocol (search strings, inclusion dates, number of papers screened) is described at a high level; adding a brief PRISMA-style flow diagram or explicit counts in §2 would improve traceability without altering the synthesis.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and recommendation to accept the manuscript. The referee's summary accurately captures the contributions of the harmonized RSSC-oriented taxonomy, the mapping to prior studies, the annotated RSE dataset with labeling pipeline, and the preliminary OpenSSF Scorecard analysis demonstrating the value of taxonomy-aware stratification.
Circularity Check
No significant circularity identified
full rationale
The paper's derivation begins with a targeted scoping review of external prior literature on repository mining and dataset construction studies, from which definitions, inclusion criteria, units of analysis, and heuristics are extracted and synthesized into a harmonized taxonomy. This taxonomy is then operationalized via a reproducible labeling pipeline on the independent RSE corpus, followed by application of the external OpenSSF Scorecard tool to illustrate cluster differences. No step reduces by construction to the paper's own inputs: the taxonomy is not self-defined, no parameters are fitted and relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The central claim that taxonomy-aware stratification is necessary rests on observable differences in the external data rather than tautological re-expression of the synthesis process.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A targeted scoping review of recent repository mining and dataset construction studies captures the essential variation in prior operationalizations of research software.
Reference graph
Works this paper leans on
-
[1]
3GIMBALS. 2025. How Foreign Threats to U.S. Academic and Research Institutions Undermine National Security. https://3gimbals.com/insights/how- foreign-threats-to-u-s-academic-and-research-institutions-undermine- national-security/. Accessed: 2026-01-26
work page 2025
-
[2]
Eva Maxfield Brown, Lindsey Schwartz, Richard Lewei Huang, and Nicholas Weber. 2024. Soft-Search: Two Datasets to Study the Identification and Production of Research Software. InProceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL ’23). doi:10.1109/JCDL57899.2023.00040
-
[3]
Neil P. Chue Hong, Daniel S. Katz, Michelle Barker, Anna-Lena Lamprecht, Carlos Martinez, Françoise , ..., others, and RDA FAIR4RS WG. 2022. FAIR Principles for Research Software (FAIR4RS Principles). doi:10.15497/RDA00068
-
[4]
Zadia Codabux, Melina Vidoni, and Fatemeh H. Fard. 2021. Technical Debt in the Peer-Review Documentation of R Packages: a rOpenSci Case Study. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 195–206. doi:10.1109/MSR52588.2021.00032 ISSN: 2574-3864
-
[5]
2021.Bash Uploader Security Update
Codecov Security Team. 2021.Bash Uploader Security Update. https://tinyurl. com/5d8bd4je Accessed: 2025-05-15
work page 2021
-
[6]
2018.Postmortem for Malicious Packages Published on July 12th, 2018
ESLint Team. 2018.Postmortem for Malicious Packages Published on July 12th, 2018. https://eslint.org/blog/2018/07/postmortem-for-malicious-package- publishes/ Accessed: 2025-05-15
work page 2018
-
[7]
Thomas Green et al. 2025. Evaluation of the Nvidia Grace Superchip in the HPE/Cray XD Isambard 3 supercomputer. InProceedings of the Cray User Group (CUG ’25). doi:10.1145/3757348.3757359
-
[8]
Morane Gruenpeter et al. 2021. Defining Research Software: a controversial discussion. doi:10.5281/zenodo.5504016
-
[9]
Evan Harvey et al. 2022. Half-Precision Scalar Support in Kokkos and Kokkos Kernels: An Engineering Study and Experience Report. In2022 IEEE 18th Interna- tional Conference on e-Science (e-Science). doi:10.1109/eScience55777.2022.00095
-
[10]
László Horváth. 2020. Research Configuration of Engineering Modeling Platform. In2020 IEEE 14th International Symposium on Applied Computational Intelligence and Informatics (SACI). 000261–000266. doi:10.1109/SACI49304.2020.9118812
-
[11]
2023.Research Program on Research Security
JASON. 2023.Research Program on Research Security. Technical Report JSR-22-08. The MITRE Corporation, 7515 Colshire Drive, McLean, Virginia 22102-7508. Contact: Gordon Long — glong@mitre.org. Distribution A. Approved for public release. Distribution is unlimited
work page 2023
- [12]
-
[13]
Kalu, Sofia Okorafor, Betül Durak, Kim Laine, Radames C
Kelechi G. Kalu, Sofia Okorafor, Betül Durak, Kim Laine, Radames C. Moreno, Santiago Torres-Arias, and James C. Davis. 2026. ARMS: A Vision for Ac- tor Reputation Metric Systems in the Open-Source Software Supply Chain. arXiv:2505.18760 [cs.CR] https://arxiv.org/abs/2505.18760
-
[14]
Kelechi G Kalu, Tanmay Singla, Chinenye Okafor, Santiago Torres-Arias, and James C Davis. 2025. An Industry Interview Study of Software Signing for Supply Chain Security. In34th USENIX Security Symposium (USENIX Security 25)
work page 2025
-
[15]
Pranjay Kumar, Davin Ie, and Melina Vidoni. 2022. On the developers’ atti- tude towards CRAN checks. InProceedings of the 30th IEEE/ACM International Conference on Program Comprehension (ICPC ’22). doi:10.1145/3524610.3528389
-
[16]
Justin Murphy et al. 2020. A curated dataset of security defects in scientific software projects. InProceedings of the 7th Symposium on Hot Topics in the Science of Security (HotSoS ’20). New York, NY, USA, 1–2. doi:10.1145/3384217.3384218
-
[17]
2025.Assessing Research Security Efforts in Higher Education: Proceedings of a Workshop
National Academies of Sciences, Engineering, and Medicine. 2025.Assessing Research Security Efforts in Higher Education: Proceedings of a Workshop. The National Academies Press. doi:10.17226/29241 Accessed: 2026-01-26
-
[18]
Schorlemmer, Santiago Torres-Arias, and James C
Chinenye Okafor, Taylor R. Schorlemmer, Santiago Torres-Arias, and James C. Davis. 2022. SoK: Analysis of Software Supply Chain Security by Establishing Secure Design Properties. InProceedings of the 2022 ACM Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (SCORED’22)
work page 2022
-
[19]
Open Source Security Foundation (OpenSSF). 2026.OpenSSF Scorecard. https: //scorecard.dev/
work page 2026
-
[20]
Enrique Orduña-Malea and Rodrigo Costas. 2021. Link-based approach to study scientific software usage: The case of VOSviewer.Scientometrics126, 9 (2021), 8153–8186
work page 2021
-
[21]
Hyoungjoo Park and Dietmar Wolfram. 2019. Research software citation in the Data Citation Index: Current practices and implications for research software sharing and reuse.Journal of Informetrics13, 2 (2019), 574–582
work page 2019
-
[22]
Zedong Peng, Upulee Kanewala, and Nan Niu. 2021. Contextual Understanding and Improvement of Metamorphic Testing in Scientific Software Development. InProceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). doi:10.1145/3475716.3484188
-
[23]
Henrique Estanislau Maldonado Peres et al. 2025. Revisiting Impedance Spec- troscopy: A Didactic Virtual Instrument for Modeling and Analysing Nanomate- rials in Gas Sensors. In2025 39th Symposium on Microelectronics Technology and Devices (SBMicro). doi:10.1109/SBMicro66945.2025.11197761
-
[24]
Purdue Duality Lab. 2026. CROSS (GitHub repository). https://github.com/ PurdueDualityLab/CROSS. Accessed: 2026-01-26
work page 2026
-
[25]
Barry Smith et al. 2025. AI Assistants to Enhance and Exploit the PETSc Knowl- edge Base. InProceedings of the International Conference on Parallel Processing (ICPP Workshops ’25). doi:10.1145/3750720.3757281
-
[26]
Vanessa Sochat, Nicholas May, Ian Cosden, Carlos Martinez-Ortiz, and Sadie Bartholomew. 2022. The Research Software Encyclopedia: A Community Frame- work to Define Research Software.Journal of Open Research Software10, 1 (March 2022), 2. doi:10.5334/jors.359
-
[27]
Jiayi Sun, Aarya Patil, Youhai Li, Jin L.C. Guo, and Shurui Zhou. 2025. Col- laboration Challenges and Opportunities in Developing Scientific Open-Source Software Ecosystem: A Case Study on Astropy.Proc. ACM Hum.-Comput. Interact. 9, 7 (Oct. 2025), CSCW281:1–CSCW281:33. doi:10.1145/3757462
-
[28]
Synopsys. 2023.2023 Open Source Security and Risk Analysis (OSSRA) Re- port. https://www.synopsys.com/software-integrity/engage/ossra/rep-ossra- 2023-pdf
-
[29]
Justin Z Tam et al. 2023. A Containerization Framework for Bioinformatics Software to Advance Scalability, Portability, and Maintainability. InProceedings of the ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’23). doi:10.1145/3584371.3612948
-
[30]
Addi Malviya Thakur, Reed Milewicz, Mahmoud Jahanshahi, Lavínia Paganini, Bogdan Vasilescu, and Audris Mockus. 2025. Scientific Open-Source Software Is Less Likely to Become Abandoned Than One Might Think! Lessons from Curating a Catalog of Maintained Scientific Software.Proc. ACM Softw. Eng.2, FSE (2025), FSE099:2216–FSE099:2239. doi:10.1145/3729369
-
[31]
The Apache Software Foundation. 2026. apache (GitHub organization). https: //github.com/apache. Accessed: 2026-01-25
work page 2026
-
[32]
The Apache Software Foundation. 2026. Apache Projects Directory: Foundation Projects (projects.json). https://projects.apache.org/json/foundation/projects. json. Accessed: 2026-01-25
work page 2026
-
[33]
The White House. 2021. Presidential Memorandum on United States Government- Supported Research and Development National Security Policy (NSPM-33). https://trumpwhitehouse.archives.gov/presidential-actions/presidential- memorandum-united-states-government-supported-research-development- national-security-policy/. Issued: January 14, 2021. Accessed: 2026-01-26
work page 2021
-
[34]
Christos Tsigkanos, Pooja Rani, Sebastian Müller, and Timo Kehrer. 2023. Large Language Models: The Next Frontier for Variable Discovery within Metamorphic Testing?. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). doi:10.1109/SANER56733.2023.00070
-
[35]
Cristian Urlea et al. 2025. Bridging Disciplinary Gaps in Climate Research through Programming Accessibility and Interdisciplinary Collaboration. InProceedings of the 2nd ACM SIGPLAN International Workshop on Programming for the Planet (PROPL ’25). doi:10.1145/3759536.3763804
-
[36]
Lynn von Kurnatowski et al. 2020. Scientific Software Engineering: Mining Repositories to gain insights into BACARDI. In2020 IEEE Aerospace Conference. doi:10.1109/AERO47225.2020.9172261
-
[37]
Marcus Willett. 2023. Lessons of the SolarWinds hack. InSurvival April–May 2021: Facing Russia. Routledge, 7–25
work page 2023
-
[38]
Laurie Williams et al. 2025. Research directions in software supply chain security. ACM Transactions on Software Engineering and Methodology34, 5 (2025)
work page 2025
-
[39]
Nusrat Zahan, Parth Kanakiya, Brian Hambleton, Shohanuzzaman Shohan, and Laurie Williams. 2023. Openssf scorecard: On the path toward ecosystem-wide automated security metrics.IEEE Security & Privacy(2023)
work page 2023
-
[40]
Zixiao Zhao and Fatemeh Fard. 2025. Do Current Language Models Support Code Intelligence for R Programming Language?ACM Trans. Softw. Eng. Methodol.34, 8 (Oct. 2025), 240:1–240:39. doi:10.1145/3735635 A Search sources and time window. We query IEEE Xplore and ACM Digital Library. We restrict the search window to 2020-2025 to focus on recent empirical prac...
-
[41]
[35]; [23] (2) network service Delivered as a hosted service or API rather than a packaged artifact
(1) installer/binary Distributed as downloadable binaries or installers. [35]; [23] (2) network service Delivered as a hosted service or API rather than a packaged artifact. [7]; [36]; +2 more (4) package registry Distributed via a language package registry (e.g.,CRAN, PyPI, Maven, npm). [15]; [4] (2) releases Distributed via tagged releases and published...
-
[42]
(1) Software as research ob- ject Software itself is the object of study (e.g., mining, quality, longevity, practices, or ecosystem dynamics). [2, 4]; +9 more (11) Foundation for research Reusable foundations and infrastructure for research (e.g., libraries, platforms, build/distribution/dev tooling). [7, 9]; +4 more (6)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.