CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis
Pith reviewed 2026-05-09 21:04 UTC · model grok-4.3
The pith
A benchmark of 15 Python CVEs shows that 87% of multi-commit vulnerabilities are invisible to per-commit static analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST. The two per-commit detections that do occur are qualitatively poor, one on a security-fix commit and one missing the main vulnerability. Even in cumulative mode the detection rate is only 27%.
What carries the argument
The CrossCommitVuln-Bench dataset with its manually annotated commit chains and structured rationales explaining why each individual commit evades per-commit static analysis.
Load-bearing premise
The 15 selected CVEs are representative of multi-commit vulnerabilities in general and the manual annotations accurately capture the contributing commits and evasion reasons without bias.
What would settle it
Finding a substantially larger collection of multi-commit Python CVEs where per-commit SAST tools detect more than 13% of the cases would falsify the central claim.
read the original abstract
We present CrossCommitVuln-Bench, a curated benchmark of 15 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis - but collectively critical. We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis, and baseline evaluations using Semgrep and Bandit in both per-commit and cumulative scanning modes. Our central finding: the per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST. Critically, both per-commit detections are qualitatively poor: one occurs on commits framed as security fixes (where developers suppress the alert), and the other detects only the minor hardcoded-key component while completely missing the primary vulnerability (200+ unprotected API endpoints). Even in cumulative mode (full codebase present), the detection rate is only 27%, confirming that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits. The dataset, annotation schema, evaluation scripts, and reproducible baselines are released under open-source licenses to support research on cross-commit vulnerability detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CrossCommitVuln-Bench, a curated dataset of 15 Python CVEs in which exploitable conditions are introduced across multiple commits, each individually benign to per-commit static analysis (using Semgrep and Bandit), but collectively critical. It provides manual annotations of contributing commit chains and evasion rationales, reports a per-commit detection rate (CCDR) of 13% (2/15, with both detections qualitatively poor) and 27% in cumulative mode, and releases the dataset, annotations, evaluation scripts, and baselines under open licenses.
Significance. If the selection and annotation methodology can be strengthened, the benchmark would usefully highlight limitations of snapshot-based SAST for multi-commit vulnerabilities and supply a reproducible resource for developing cross-commit detection techniques. The open release of the full dataset, structured annotations, and evaluation code is a clear strength that supports follow-on work.
major comments (3)
- Abstract: The claim that the results 'confirm that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits' is not supported by the reported measurements. The 15 CVEs were explicitly selected under the filter that each commit is individually benign to per-commit analysis; the resulting 13% per-commit and 27% cumulative rates are therefore expected by construction for this sample and cannot be extrapolated to unselected vulnerabilities without a described sampling frame from the broader CVE population.
- Dataset Construction and Annotation sections: The manual annotation of commit chains and evasion rationales lacks any reported protocol details, number of annotators, inter-annotator agreement metrics, or blinded validation steps. Because the 'benign' labels directly determine the 87% invisibility figure and the two reported detections (one on a security-fix commit, one partial), this omission is load-bearing for the central empirical claims.
- Evaluation section: The paper states that both per-commit detections are qualitatively poor, yet provides no details on tool configurations, rule sets, severity thresholds, or false-positive handling for Semgrep and Bandit. Without these, the 13% and 27% rates cannot be independently verified or compared to other studies.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our paper. We address each major comment in detail below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim that the results 'confirm that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits' is not supported by the reported measurements. The 15 CVEs were explicitly selected under the filter that each commit is individually benign to per-commit analysis; the resulting 13% per-commit and 27% cumulative rates are therefore expected by construction for this sample and cannot be extrapolated to unselected vulnerabilities without a described sampling frame from the broader CVE population.
Authors: We agree with the referee that the selection criteria make the low detection rates expected for this specific sample, and we do not claim to have a representative sampling frame from all CVEs. The benchmark is curated to highlight cases where vulnerabilities are introduced across commits in a way that evades per-commit analysis. We will revise the abstract to replace 'confirming' with 'illustrating' and clarify that the results demonstrate the potential for such invisibility in multi-commit scenarios, without implying a general prevalence. This revision will be made in the next version of the manuscript. revision: yes
-
Referee: Dataset Construction and Annotation sections: The manual annotation of commit chains and evasion rationales lacks any reported protocol details, number of annotators, inter-annotator agreement metrics, or blinded validation steps. Because the 'benign' labels directly determine the 87% invisibility figure and the two reported detections (one on a security-fix commit, one partial), this omission is load-bearing for the central empirical claims.
Authors: The referee correctly identifies a gap in the reporting of our annotation process. While the annotations were performed manually by the authors using a consistent schema for identifying contributing commit chains and evasion rationales, the manuscript does not detail the protocol. We will add a new subsection under Dataset Construction that describes the annotation methodology, including the steps for identifying multi-commit chains, the rationale categories used, and the process for validating the 'benign' labels against the CVE descriptions. Although formal inter-annotator agreement metrics were not computed (as the work was conducted by a small team with iterative review), we will describe the reconciliation process to enhance transparency and allow readers to evaluate the reliability of the 87% figure. revision: yes
-
Referee: Evaluation section: The paper states that both per-commit detections are qualitatively poor, yet provides no details on tool configurations, rule sets, severity thresholds, or false-positive handling for Semgrep and Bandit. Without these, the 13% and 27% rates cannot be independently verified or compared to other studies.
Authors: We acknowledge that the current manuscript lacks the necessary details on the SAST tool configurations to enable full reproducibility and comparison. We will revise the Evaluation section to include a comprehensive description of the setups: for Semgrep, the specific rules (e.g., from the python.security and python.lang.security categories), command options, and how alerts were filtered; for Bandit, the enabled tests, severity levels, and confidence thresholds. Additionally, we will explain the qualitative assessment criteria used to deem the detections 'poor' and how false positives were considered in the cumulative mode. These additions will be provided in the revised manuscript and accompanying code release to support independent verification. revision: yes
Circularity Check
Curated selection of CVEs defined as per-commit invisible makes the 13% CCDR finding tautological by construction
specific steps
-
self definitional
[Abstract]
"We present CrossCommitVuln-Bench, a curated benchmark of 15 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis - but collectively critical. [...] We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis [...] Our central finding: the per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST."
The benchmark is explicitly defined and selected on the basis that individual commits are benign (invisible) to per-commit SAST, with manual annotations supplying the evasion rationales. The reported 13% detection rate and 87% invisibility therefore follow by construction from the curation filter and annotations rather than constituting an independent empirical result about tool behavior on a broader or randomly sampled set of vulnerabilities.
full rationale
This is an empirical benchmark paper with no mathematical derivations, equations, or self-citations. However, the central quantitative claim reduces directly to the input selection criteria. The 15 CVEs were manually curated and annotated specifically because 'the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis' with 'structured rationale for why each commit evades per-commit analysis.' Running Semgrep and Bandit then produces the expected low detection rates (2/15 per-commit), which the paper presents as the 'central finding' of 13% CCDR and 87% invisible. This matches the self-definitional pattern: the reported invisibility rate is guaranteed by how the sample was assembled rather than emerging independently from an unselected population. The paper provides useful annotations and qualitative discussion of the two detections, but the headline statistic is not an independent measurement. Score reflects partial circularity in the core empirical claim while acknowledging the dataset's descriptive value.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Auto- mated Collection of Vulnerabilities and Their Fixes from Open-Source Soft- ware. InProceedings of the 17th International Conference on Predictive Mod- els and Data Analytics in Software Engineering (PROMISE). 30–39. https: //doi.org/10.1145/3475960.3475985
-
[2]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. InProceedings of the 17th International Conference on Mining Software Repositories (MSR). 508–512. https://doi.org/10.1145/3379597.3387501
-
[3]
GitHub Inc. 2024. CodeQL: Semantic Code Analysis Engine. https://codeql.github. com
work page 2024
-
[4]
Arunabh Majumdar. 2026. POSTURA: Graph-Based Cross-Commit Secu- rity Analysis. https://pypi.org/project/postura/ (PyPI) and https://github.com/ motornomad/postura (source)
work page 2026
-
[5]
2017.Juliet Test Suite v1.3 for C/C++ and Java
NSA Center for Assured Software. 2017.Juliet Test Suite v1.3 for C/C++ and Java. Technical Report. National Security Agency
work page 2017
-
[6]
PyCQA. 2024. Bandit: A Tool Designed to Find Common Security Issues in Python Code. https://github.com/PyCQA/bandit
work page 2024
-
[7]
Semgrep Team. 2024. Semgrep: Fast, Open-Source, Static Analysis for Finding Bugs and Enforcing Code Standards. https://semgrep.dev
work page 2024
-
[8]
Dave Wichers and Jim Manico. 2021. OWASP Benchmark: A Free and Open Test Suite for Evaluating the Accuracy of Software Vulnerability Detection Tools. https://owasp.org/www-project-benchmark/
work page 2021
-
[9]
Yunhui Zheng, Saurabh Pujar, Benjamin Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Anal- ysis. InProceedings of the IEEE/ACM 43rd International Conference on Soft- ware Engineering: Software Engineering in Practice (ICSE...
-
[10]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Pro- gram Semantics via Graph Neural Networks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 32
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.