pith. sign in

arxiv: 2605.06508 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.AI

On the Security of Research Artifacts

Pith reviewed 2026-05-08 08:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords research artifactssecurity risksartifact evaluationstatic analysiscode vulnerabilitiesreproducibilitySAFE framework
0
0 comments X

The pith

Many shared research artifacts contain insecure code patterns that can create practical security risks when reused.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors analyzed 509 artifacts from top security conferences and found that 41.60 percent of common static-analysis findings represent potential security issues under realistic conditions. They developed a taxonomy that evaluates risks by considering code context and usage scenarios rather than isolated patterns. To scale this assessment they built the SAFE framework, which classifies findings as security or non-security risks with 84.80 percent accuracy by incorporating semantics and exploitability. The work shows that current artifact evaluation, focused only on reproducibility, overlooks these risks and therefore needs to include security checks for safer public sharing.

Core claim

Static analysis of 509 research artifacts reveals that 41.60% of prevalent findings may pose security concerns under practical usage, and the SAFE framework distinguishes security from non-security risks at 84.80% accuracy and 84.63% F1-score by weighing code semantics, execution context, and practical exploitability.

What carries the argument

SAFE (Security-Aware Framework for Artifact Evaluation), which applies a context-aware taxonomy to static-analysis results to filter false positives and assess real exploitability in shared code.

If this is right

  • Artifact evaluation at conferences should add security assessments alongside reproducibility checks.
  • Researchers releasing code must review it for insecure patterns that could be misused by others.
  • Automated classification tools can make security review of artifacts feasible at scale.
  • Safe research sharing requires explicit attention to potential attack vectors introduced by public code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open-science policies may eventually require security vetting steps for all publicly released code, not only in security venues.
  • The same context-aware filtering approach could be tested on artifacts from non-security fields to measure whether risks appear at similar rates.
  • If demonstrated exploits remain rare despite the flagged patterns, the priority might shift toward better documentation of safe usage rather than blocking releases.

Load-bearing premise

Manual filtering of static-analysis false positives and the selected usage contexts accurately reflect real-world exploitability.

What would settle it

A controlled experiment that reuses the artifacts in realistic settings and demonstrates that none of the flagged patterns can be successfully exploited would disprove the claim of practical security concerns.

Figures

Figures reproduced from arXiv: 2605.06508 by Christian Rossow, Nanda Rani.

Figure 1
Figure 1. Figure 1: Overview of the collected artifacts. The diversity in both security domains view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the formal taxonomy categorization view at source ↗
Figure 3
Figure 3. Figure 3: SAFE Workflow Building on this context, the framework applies structured reasoning to assess exploitability. In our implementation, this reasoning is performed using a large 1 The source code is available at: https://github.com/nanda-rani/SAFE view at source ↗
Figure 4
Figure 4. Figure 4: Decision logic workflow for assessing the exploitability of static findings. view at source ↗
read the original abstract

Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential security risks. Since these artifacts are publicly released and reused, they may unintentionally create opportunities for misuse and raise concerns about safe and responsible sharing. We study 509 research artifacts from top-tier security venues and find that many contain insecure code patterns that may introduce potential attack vectors. We propose a taxonomy for context-aware security assessment to enable structured analysis of such risks. We perform static analysis and examine the resulting findings, filtering false positives and identifying real security risks. Our analysis shows that 41.60% of the prevalent findings may pose security concerns under practical usage. To support scalable analysis, we introduce SAFE (Security-Aware Framework for Artifact Evaluation), a first step toward an autonomous framework that analyzes tool-reported findings by considering code semantics, execution context, and practical exploitability. SAFE achieves 84.80% accuracy and 84.63% F1-score in distinguishing security and non-security risks. Overall, our results show that security is also important in AE for promoting safe and responsible research sharing. The source code is available at: https://github.com/nanda-rani/SAFE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines 509 research artifacts from top-tier security venues, applies static analysis to detect insecure code patterns, and after manual false-positive filtering reports that 41.60% of prevalent findings may pose security concerns under practical usage. It introduces a context-aware taxonomy, proposes the SAFE framework for automated risk classification, and claims SAFE attains 84.80% accuracy and 84.63% F1-score on the manually labeled data. The work argues that artifact evaluation should incorporate security considerations and releases the SAFE source code.

Significance. If the empirical findings hold, the study provides a large-scale measurement of overlooked security risks in publicly shared research artifacts and offers a reproducible starting point (open-source SAFE) for automated, context-aware assessment. The scale (509 artifacts) and quantitative reporting are clear strengths for an empirical security paper.

major comments (3)
  1. [§4] §4 (Artifact Collection and Analysis): The selection criteria for the 509 artifacts and the exact process for manually filtering static-analysis false positives are not described in sufficient detail (e.g., no inter-rater agreement statistics, no explicit decision rules for “practical usage” contexts). This directly affects the reliability of the headline 41.60% figure.
  2. [§5] §5 (SAFE Evaluation): The ground-truth labels used to compute SAFE’s 84.80% accuracy and 84.63% F1-score are the same manually filtered labels that produce the 41.60% result. Without independent validation (dynamic execution, exploit demonstrations, or external oracle), both the prevalence claim and the framework’s reported performance rest on the same unverified manual step.
  3. [§5.3] §5.3 (Practical Exploitability): No concrete exploit demonstrations or dynamic confirmation are provided for the artifacts labeled as security risks. The assessment therefore relies entirely on static patterns plus chosen usage contexts, which may over- or under-estimate real-world exploitability.
minor comments (2)
  1. [§3] The taxonomy in §3 is presented at a high level; a concrete mapping from taxonomy categories to the static-analysis rules used in the study would improve reproducibility.
  2. [Figure 2, Table 4] Figure 2 and Table 4 would benefit from clearer captions indicating whether the percentages are computed before or after manual filtering.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate revisions to be incorporated in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Artifact Collection and Analysis): The selection criteria for the 509 artifacts and the exact process for manually filtering static-analysis false positives are not described in sufficient detail (e.g., no inter-rater agreement statistics, no explicit decision rules for “practical usage” contexts). This directly affects the reliability of the headline 41.60% figure.

    Authors: We agree that greater detail on artifact selection and the manual filtering process is required to support the reliability of the 41.60% figure. In the revised manuscript we will expand §4 with a dedicated subsection that specifies the selection criteria (venues, years, and requirement that artifacts be publicly available code repositories) and provides the explicit decision rules applied during manual false-positive filtering with respect to practical usage contexts. We will also state that filtering was performed by the authors without computing inter-rater agreement and will discuss this as a methodological limitation. revision: yes

  2. Referee: [§5] §5 (SAFE Evaluation): The ground-truth labels used to compute SAFE’s 84.80% accuracy and 84.63% F1-score are the same manually filtered labels that produce the 41.60% result. Without independent validation (dynamic execution, exploit demonstrations, or external oracle), both the prevalence claim and the framework’s reported performance rest on the same unverified manual step.

    Authors: We acknowledge that the ground-truth labels for SAFE’s accuracy and F1-score are the same manually derived labels used for the prevalence statistic; this is by design because SAFE automates the context-aware classification we performed manually. We will add an explicit limitations paragraph in §5 noting the absence of independent validation (dynamic execution or external oracles) and will frame the reported metrics as an initial validation against our own manual oracle. We will also outline future work directions for obtaining such independent validation. revision: partial

  3. Referee: [§5.3] §5.3 (Practical Exploitability): No concrete exploit demonstrations or dynamic confirmation are provided for the artifacts labeled as security risks. The assessment therefore relies entirely on static patterns plus chosen usage contexts, which may over- or under-estimate real-world exploitability.

    Authors: We agree that the lack of concrete exploit demonstrations or dynamic confirmation is a limitation of the current assessment. Performing such validation at the scale of 509 artifacts lies outside the scope of the present study, which focuses on identifying potential risks via static analysis and context. In the revision we will strengthen §5.3 by adding explicit caveats about possible over- or under-estimation inherent to static-plus-context methods and will clarify that the labeled risks are intended to flag artifacts for further investigation rather than to assert confirmed exploitability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with external data and standard evaluation.

full rationale

The paper conducts static analysis on 509 external research artifacts, applies manual false-positive filtering, and reports an empirical 41.60% risk rate plus SAFE's accuracy/F1 on the resulting labels. No equations, fitted parameters, self-definitional loops, or predictions that reduce to inputs by construction appear. No uniqueness theorems or ansatzes are imported via self-citation. The central claims rest on direct artifact examination and conventional ML hold-out evaluation rather than internal redefinitions, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

This is an empirical measurement and tool-building study with no mathematical derivations, free parameters, or background axioms required for a central claim.

invented entities (1)
  • SAFE framework no independent evidence
    purpose: Autonomous filtering of static-analysis findings using code semantics, execution context, and exploitability
    New system introduced by the authors to enable scalable analysis.

pith-pipeline@v0.9.0 · 5521 in / 1136 out tokens · 53541 ms · 2026-05-08T08:56:56.767212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Aqua Security: Trivy: Vulnerability scanner for containers and other artifacts (2026),https://github.com/aquasecurity/trivy, Accessed: 2026-04-13

  2. [2]

    arXiv preprint arXiv:2602.10046 , year=

    Baek, D., Pradel, M.: Artisan: Agentic artifact evaluation. arXiv preprint arXiv:2602.10046 (2026)

  3. [3]

    In: 2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

    Barr, E., Bell, J., Hilton, M., Mechtaev, S., Timperley, C.: Continuously acceler- ating research. In: 2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). pp. 123–128. IEEE (2023)

  4. [4]

    International Journal on Software Tools for Technology Transfer27(4), 397–401 (2025)

    Beyer, D., Hartmanns, A.: Reproducibility and replication of research results: A special issue for rrrr 2022. International Journal on Software Tools for Technology Transfer27(4), 397–401 (2025)

  5. [5]

    In: Pro- ceedings of the 33rd ACM International Conference on the Foundations of Software Engineering

    Beyer, D., Winter, S.: Artifact evaluations for stronger research results. In: Pro- ceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. pp. 1234–1237 (2025)

  6. [6]

    Journal of Systems and Software218, 112187 (2024)

    Guevara-Vega, C., Bernárdez, B., Cruz, M., Durán, A., Ruiz-Cortés, A., Solari, M.: Research artifacts for human-oriented experiments in software engineering: An acm badges-driven structure proposal. Journal of Systems and Software218, 112187 (2024)

  7. [7]

    In: Proceedings of the 2nd ACM Conference on Repro- ducibility and Replicability

    Guilloteau, Q., Ciorba, F., Poquet, M., Goepp, D., Richard, O.: Longevity of arti- facts in leading parallel and distributed systems conferences: A review of the state of the practice in 2023. In: Proceedings of the 2nd ACM Conference on Repro- ducibility and Replicability. pp. 121–133 (2024)

  8. [8]

    In: Community Workshop on Practical Reproducibility in HPC (2024)

    Guilloteau, Q., Poquet, M., Korndörfer, J.H.M., Ciorba, F.M.: Artifact evaluations as authors and reviewers: Lessons, questions, and frustrations. In: Community Workshop on Practical Reproducibility in HPC (2024)

  9. [9]

    IEEE Transactions on Software Engineering49(12), 5154–5188 (2023)

    Guo, Z., Tan, T., Liu, S., Liu, X., Lai, W., Yang, Y., Li, Y., Chen, L., Dong, W., Zhou, Y.: Mitigating false positive static analysis warnings: Progress, challenges, and opportunities. IEEE Transactions on Software Engineering49(12), 5154–5188 (2023)

  10. [10]

    Hermann, B.: What has artifact evaluation ever done for us? IEEE Security & Privacy20(5), 96–99 (2022)

  11. [11]

    Empirical Software Engineering25(6), 4585– 4616 (2020)

    Heumüller, R., Nielebock, S., Krüger, J., Ortmeier, F.: Publish or perish, but do not forget your software artifacts. Empirical Software Engineering25(6), 4585– 4616 (2020)

  12. [12]

    In: 2025 IEEE International Conference on Big Data (BigData)

    Heye, D., Kindermann, K., Decker, R., Lohmöller, J., Belova, A., Geisler, S., Wehrle, K., Pennekamp, J.: Supporting artifact evaluation with llms: A study with published security research papers. In: 2025 IEEE International Conference on Big Data (BigData). pp. 5077–5085. IEEE (2025)

  13. [13]

    Kang, H.J., Aw, K.L., Lo, D.: Detecting false alarms from automatic static analysis tools: How far are we? In: Proceedings of the 44th International Conference on Software Engineering. pp. 698–709 (2022)

  14. [14]

    In: Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems

    Malik, T.: Artifact description/artifact evaluation: A reproducibility bane or a boon. In: Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems. pp. 1–1 (2020)

  15. [15]

    Rani et al

    MITRE Corporation: Cwe-78: Improper neutralization of special elements used in an os command (’os command injection’) (2026),https://cwe.mitre.org/data/ definitions/78.html, accessed: 2026-04-13 18 N. Rani et al

  16. [16]

    arXiv preprint arXiv:2601.02066 (2026)

    Muttakin, A., Mondal, S., Roy, C.K.: The state of open science in software engi- neering research: A case study of icse artifacts. arXiv preprint arXiv:2601.02066 (2026)

  17. [17]

    In: Proceedings of the 3rd ACM Conference on Reproducibility and Replicability

    Olszewski, D., Lu, A., Crowder, A., Bennett, N., Layton, S., Varma Bhupathiraju, S.H., Tucker, T., Kalgutkar, S., Ver Helst, H., Stillman, C., et al.: Reproducibil- ity in applied security conferences: An 11-year review on artifacts and evaluation committees. In: Proceedings of the 3rd ACM Conference on Reproducibility and Replicability. pp. 96–107 (2025)

  18. [18]

    get in researchers; we’re measuring reproducibility

    Olszewski, D., Lu, A., Stillman, C., Warren, K., Kitroser, C., Pascual, A., Ukirde, D., Butler, K., Traynor, P.: " get in researchers; we’re measuring reproducibility": A reproducibility study of machine learning papers in tier 1 security conferences. In: Proceedings of the 2023 ACM SIGSAC conference on computer and communi- cations security. pp. 3433–3459 (2023)

  19. [19]

    In: 34th USENIX Security Symposium (USENIX Security 25)

    Olszewski, D., Tucker, T., Butler, K.R., Traynor, P.:{SoK}: Towards a unified approach to applied replicability for computer security. In: 34th USENIX Security Symposium (USENIX Security 25). pp. 469–488 (2025)

  20. [20]

    arXiv preprint arXiv:2406.13045 (2024)

    Sedghpour, M.R.S., Papadopoulos, A.V., Klein, C., Tordsson, J.: Artifact eval- uation for distributed systems: Current practices and beyond. arXiv preprint arXiv:2406.13045 (2024)

  21. [21]

    Semgrep, Inc.: Command injection in python: Prevention cheat sheet (2025),https://semgrep.dev/docs/cheat-sheets/python-command-injection, accessed: 2026-04-13

  22. [22]

    Semgrep, Inc.: Semgrep: Static analysis tool for code security (2026),https:// github.com/semgrep/semgrep, Accessed: 2026-04-13

  23. [23]

    In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering

    Winter, S., Timperley, C.S., Hermann, B., Cito, J., Bell, J., Hilton, M., Beyer, D.: A retrospective study of one decade of artifact evaluations. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering. pp. 145–156 (2022)

  24. [24]

    arXiv preprint arXiv:2602.02235 (2026)

    Wu, Z., Zhao, Y., Chen, Z., Wang, Z., Wang, H.: Agent-based software artifact evaluation. arXiv preprint arXiv:2602.02235 (2026)

  25. [25]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)

  26. [26]

    security_label

    Zilberman, N., Moore, A.W.: Thoughts about artifact badging. SIGCOMM Com- put. Commun. Rev.50(2), 60–63 (May 2020) A Case Study 1 Output Listing 1.2: Findings label output { "security_label": " B E N I G N _ R E S E A R C H _ U S A G E " , "code_purpose": " The file .... private key .... for local .... wo rk fl ow s . " , "execution_context": " The key is...