On the Security of Research Artifacts
Pith reviewed 2026-05-08 08:56 UTC · model grok-4.3
The pith
Many shared research artifacts contain insecure code patterns that can create practical security risks when reused.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Static analysis of 509 research artifacts reveals that 41.60% of prevalent findings may pose security concerns under practical usage, and the SAFE framework distinguishes security from non-security risks at 84.80% accuracy and 84.63% F1-score by weighing code semantics, execution context, and practical exploitability.
What carries the argument
SAFE (Security-Aware Framework for Artifact Evaluation), which applies a context-aware taxonomy to static-analysis results to filter false positives and assess real exploitability in shared code.
If this is right
- Artifact evaluation at conferences should add security assessments alongside reproducibility checks.
- Researchers releasing code must review it for insecure patterns that could be misused by others.
- Automated classification tools can make security review of artifacts feasible at scale.
- Safe research sharing requires explicit attention to potential attack vectors introduced by public code.
Where Pith is reading between the lines
- Open-science policies may eventually require security vetting steps for all publicly released code, not only in security venues.
- The same context-aware filtering approach could be tested on artifacts from non-security fields to measure whether risks appear at similar rates.
- If demonstrated exploits remain rare despite the flagged patterns, the priority might shift toward better documentation of safe usage rather than blocking releases.
Load-bearing premise
Manual filtering of static-analysis false positives and the selected usage contexts accurately reflect real-world exploitability.
What would settle it
A controlled experiment that reuses the artifacts in realistic settings and demonstrates that none of the flagged patterns can be successfully exploited would disprove the claim of practical security concerns.
Figures
read the original abstract
Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential security risks. Since these artifacts are publicly released and reused, they may unintentionally create opportunities for misuse and raise concerns about safe and responsible sharing. We study 509 research artifacts from top-tier security venues and find that many contain insecure code patterns that may introduce potential attack vectors. We propose a taxonomy for context-aware security assessment to enable structured analysis of such risks. We perform static analysis and examine the resulting findings, filtering false positives and identifying real security risks. Our analysis shows that 41.60% of the prevalent findings may pose security concerns under practical usage. To support scalable analysis, we introduce SAFE (Security-Aware Framework for Artifact Evaluation), a first step toward an autonomous framework that analyzes tool-reported findings by considering code semantics, execution context, and practical exploitability. SAFE achieves 84.80% accuracy and 84.63% F1-score in distinguishing security and non-security risks. Overall, our results show that security is also important in AE for promoting safe and responsible research sharing. The source code is available at: https://github.com/nanda-rani/SAFE
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines 509 research artifacts from top-tier security venues, applies static analysis to detect insecure code patterns, and after manual false-positive filtering reports that 41.60% of prevalent findings may pose security concerns under practical usage. It introduces a context-aware taxonomy, proposes the SAFE framework for automated risk classification, and claims SAFE attains 84.80% accuracy and 84.63% F1-score on the manually labeled data. The work argues that artifact evaluation should incorporate security considerations and releases the SAFE source code.
Significance. If the empirical findings hold, the study provides a large-scale measurement of overlooked security risks in publicly shared research artifacts and offers a reproducible starting point (open-source SAFE) for automated, context-aware assessment. The scale (509 artifacts) and quantitative reporting are clear strengths for an empirical security paper.
major comments (3)
- [§4] §4 (Artifact Collection and Analysis): The selection criteria for the 509 artifacts and the exact process for manually filtering static-analysis false positives are not described in sufficient detail (e.g., no inter-rater agreement statistics, no explicit decision rules for “practical usage” contexts). This directly affects the reliability of the headline 41.60% figure.
- [§5] §5 (SAFE Evaluation): The ground-truth labels used to compute SAFE’s 84.80% accuracy and 84.63% F1-score are the same manually filtered labels that produce the 41.60% result. Without independent validation (dynamic execution, exploit demonstrations, or external oracle), both the prevalence claim and the framework’s reported performance rest on the same unverified manual step.
- [§5.3] §5.3 (Practical Exploitability): No concrete exploit demonstrations or dynamic confirmation are provided for the artifacts labeled as security risks. The assessment therefore relies entirely on static patterns plus chosen usage contexts, which may over- or under-estimate real-world exploitability.
minor comments (2)
- [§3] The taxonomy in §3 is presented at a high level; a concrete mapping from taxonomy categories to the static-analysis rules used in the study would improve reproducibility.
- [Figure 2, Table 4] Figure 2 and Table 4 would benefit from clearer captions indicating whether the percentages are computed before or after manual filtering.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate revisions to be incorporated in the next version of the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Artifact Collection and Analysis): The selection criteria for the 509 artifacts and the exact process for manually filtering static-analysis false positives are not described in sufficient detail (e.g., no inter-rater agreement statistics, no explicit decision rules for “practical usage” contexts). This directly affects the reliability of the headline 41.60% figure.
Authors: We agree that greater detail on artifact selection and the manual filtering process is required to support the reliability of the 41.60% figure. In the revised manuscript we will expand §4 with a dedicated subsection that specifies the selection criteria (venues, years, and requirement that artifacts be publicly available code repositories) and provides the explicit decision rules applied during manual false-positive filtering with respect to practical usage contexts. We will also state that filtering was performed by the authors without computing inter-rater agreement and will discuss this as a methodological limitation. revision: yes
-
Referee: [§5] §5 (SAFE Evaluation): The ground-truth labels used to compute SAFE’s 84.80% accuracy and 84.63% F1-score are the same manually filtered labels that produce the 41.60% result. Without independent validation (dynamic execution, exploit demonstrations, or external oracle), both the prevalence claim and the framework’s reported performance rest on the same unverified manual step.
Authors: We acknowledge that the ground-truth labels for SAFE’s accuracy and F1-score are the same manually derived labels used for the prevalence statistic; this is by design because SAFE automates the context-aware classification we performed manually. We will add an explicit limitations paragraph in §5 noting the absence of independent validation (dynamic execution or external oracles) and will frame the reported metrics as an initial validation against our own manual oracle. We will also outline future work directions for obtaining such independent validation. revision: partial
-
Referee: [§5.3] §5.3 (Practical Exploitability): No concrete exploit demonstrations or dynamic confirmation are provided for the artifacts labeled as security risks. The assessment therefore relies entirely on static patterns plus chosen usage contexts, which may over- or under-estimate real-world exploitability.
Authors: We agree that the lack of concrete exploit demonstrations or dynamic confirmation is a limitation of the current assessment. Performing such validation at the scale of 509 artifacts lies outside the scope of the present study, which focuses on identifying potential risks via static analysis and context. In the revision we will strengthen §5.3 by adding explicit caveats about possible over- or under-estimation inherent to static-plus-context methods and will clarify that the labeled risks are intended to flag artifacts for further investigation rather than to assert confirmed exploitability. revision: yes
Circularity Check
No circularity: empirical measurement study with external data and standard evaluation.
full rationale
The paper conducts static analysis on 509 external research artifacts, applies manual false-positive filtering, and reports an empirical 41.60% risk rate plus SAFE's accuracy/F1 on the resulting labels. No equations, fitted parameters, self-definitional loops, or predictions that reduce to inputs by construction appear. No uniqueness theorems or ansatzes are imported via self-citation. The central claims rest on direct artifact examination and conventional ML hold-out evaluation rather than internal redefinitions, making the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
SAFE framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Aqua Security: Trivy: Vulnerability scanner for containers and other artifacts (2026),https://github.com/aquasecurity/trivy, Accessed: 2026-04-13
work page 2026
-
[2]
arXiv preprint arXiv:2602.10046 , year=
Baek, D., Pradel, M.: Artisan: Agentic artifact evaluation. arXiv preprint arXiv:2602.10046 (2026)
-
[3]
Barr, E., Bell, J., Hilton, M., Mechtaev, S., Timperley, C.: Continuously acceler- ating research. In: 2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). pp. 123–128. IEEE (2023)
work page 2023
-
[4]
International Journal on Software Tools for Technology Transfer27(4), 397–401 (2025)
Beyer, D., Hartmanns, A.: Reproducibility and replication of research results: A special issue for rrrr 2022. International Journal on Software Tools for Technology Transfer27(4), 397–401 (2025)
work page 2022
-
[5]
Beyer, D., Winter, S.: Artifact evaluations for stronger research results. In: Pro- ceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. pp. 1234–1237 (2025)
work page 2025
-
[6]
Journal of Systems and Software218, 112187 (2024)
Guevara-Vega, C., Bernárdez, B., Cruz, M., Durán, A., Ruiz-Cortés, A., Solari, M.: Research artifacts for human-oriented experiments in software engineering: An acm badges-driven structure proposal. Journal of Systems and Software218, 112187 (2024)
work page 2024
-
[7]
In: Proceedings of the 2nd ACM Conference on Repro- ducibility and Replicability
Guilloteau, Q., Ciorba, F., Poquet, M., Goepp, D., Richard, O.: Longevity of arti- facts in leading parallel and distributed systems conferences: A review of the state of the practice in 2023. In: Proceedings of the 2nd ACM Conference on Repro- ducibility and Replicability. pp. 121–133 (2024)
work page 2023
-
[8]
In: Community Workshop on Practical Reproducibility in HPC (2024)
Guilloteau, Q., Poquet, M., Korndörfer, J.H.M., Ciorba, F.M.: Artifact evaluations as authors and reviewers: Lessons, questions, and frustrations. In: Community Workshop on Practical Reproducibility in HPC (2024)
work page 2024
-
[9]
IEEE Transactions on Software Engineering49(12), 5154–5188 (2023)
Guo, Z., Tan, T., Liu, S., Liu, X., Lai, W., Yang, Y., Li, Y., Chen, L., Dong, W., Zhou, Y.: Mitigating false positive static analysis warnings: Progress, challenges, and opportunities. IEEE Transactions on Software Engineering49(12), 5154–5188 (2023)
work page 2023
-
[10]
Hermann, B.: What has artifact evaluation ever done for us? IEEE Security & Privacy20(5), 96–99 (2022)
work page 2022
-
[11]
Empirical Software Engineering25(6), 4585– 4616 (2020)
Heumüller, R., Nielebock, S., Krüger, J., Ortmeier, F.: Publish or perish, but do not forget your software artifacts. Empirical Software Engineering25(6), 4585– 4616 (2020)
work page 2020
-
[12]
In: 2025 IEEE International Conference on Big Data (BigData)
Heye, D., Kindermann, K., Decker, R., Lohmöller, J., Belova, A., Geisler, S., Wehrle, K., Pennekamp, J.: Supporting artifact evaluation with llms: A study with published security research papers. In: 2025 IEEE International Conference on Big Data (BigData). pp. 5077–5085. IEEE (2025)
work page 2025
-
[13]
Kang, H.J., Aw, K.L., Lo, D.: Detecting false alarms from automatic static analysis tools: How far are we? In: Proceedings of the 44th International Conference on Software Engineering. pp. 698–709 (2022)
work page 2022
-
[14]
Malik, T.: Artifact description/artifact evaluation: A reproducibility bane or a boon. In: Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems. pp. 1–1 (2020)
work page 2020
-
[15]
MITRE Corporation: Cwe-78: Improper neutralization of special elements used in an os command (’os command injection’) (2026),https://cwe.mitre.org/data/ definitions/78.html, accessed: 2026-04-13 18 N. Rani et al
work page 2026
-
[16]
arXiv preprint arXiv:2601.02066 (2026)
Muttakin, A., Mondal, S., Roy, C.K.: The state of open science in software engi- neering research: A case study of icse artifacts. arXiv preprint arXiv:2601.02066 (2026)
-
[17]
In: Proceedings of the 3rd ACM Conference on Reproducibility and Replicability
Olszewski, D., Lu, A., Crowder, A., Bennett, N., Layton, S., Varma Bhupathiraju, S.H., Tucker, T., Kalgutkar, S., Ver Helst, H., Stillman, C., et al.: Reproducibil- ity in applied security conferences: An 11-year review on artifacts and evaluation committees. In: Proceedings of the 3rd ACM Conference on Reproducibility and Replicability. pp. 96–107 (2025)
work page 2025
-
[18]
get in researchers; we’re measuring reproducibility
Olszewski, D., Lu, A., Stillman, C., Warren, K., Kitroser, C., Pascual, A., Ukirde, D., Butler, K., Traynor, P.: " get in researchers; we’re measuring reproducibility": A reproducibility study of machine learning papers in tier 1 security conferences. In: Proceedings of the 2023 ACM SIGSAC conference on computer and communi- cations security. pp. 3433–3459 (2023)
work page 2023
-
[19]
In: 34th USENIX Security Symposium (USENIX Security 25)
Olszewski, D., Tucker, T., Butler, K.R., Traynor, P.:{SoK}: Towards a unified approach to applied replicability for computer security. In: 34th USENIX Security Symposium (USENIX Security 25). pp. 469–488 (2025)
work page 2025
-
[20]
arXiv preprint arXiv:2406.13045 (2024)
Sedghpour, M.R.S., Papadopoulos, A.V., Klein, C., Tordsson, J.: Artifact eval- uation for distributed systems: Current practices and beyond. arXiv preprint arXiv:2406.13045 (2024)
-
[21]
Semgrep, Inc.: Command injection in python: Prevention cheat sheet (2025),https://semgrep.dev/docs/cheat-sheets/python-command-injection, accessed: 2026-04-13
work page 2025
-
[22]
Semgrep, Inc.: Semgrep: Static analysis tool for code security (2026),https:// github.com/semgrep/semgrep, Accessed: 2026-04-13
work page 2026
-
[23]
Winter, S., Timperley, C.S., Hermann, B., Cito, J., Bell, J., Hilton, M., Beyer, D.: A retrospective study of one decade of artifact evaluations. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering. pp. 145–156 (2022)
work page 2022
-
[24]
arXiv preprint arXiv:2602.02235 (2026)
Wu, Z., Zhao, Y., Chen, Z., Wang, Z., Wang, H.: Agent-based software artifact evaluation. arXiv preprint arXiv:2602.02235 (2026)
-
[25]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)
work page 2022
-
[26]
Zilberman, N., Moore, A.W.: Thoughts about artifact badging. SIGCOMM Com- put. Commun. Rev.50(2), 60–63 (May 2020) A Case Study 1 Output Listing 1.2: Findings label output { "security_label": " B E N I G N _ R E S E A R C H _ U S A G E " , "code_purpose": " The file .... private key .... for local .... wo rk fl ow s . " , "execution_context": " The key is...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.