CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Arunabh Majumdar

arxiv: 2604.21917 · v1 · submitted 2026-04-23 · 💻 cs.CR · cs.SE

CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Arunabh Majumdar This is my paper

Pith reviewed 2026-05-09 21:04 UTC · model grok-4.3

classification 💻 cs.CR cs.SE

keywords multi-commit vulnerabilitiesstatic application security testingSASTPython CVEsbenchmark datasetvulnerability detectioncross-commit analysisSemgrep Bandit evaluation

0 comments

The pith

A benchmark of 15 Python CVEs shows that 87% of multi-commit vulnerabilities are invisible to per-commit static analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CrossCommitVuln-Bench, a dataset of 15 real-world Python vulnerabilities where the exploitable condition emerges only across a chain of commits, each appearing benign individually. It manually annotates the contributing commits and evasion reasons for each, then evaluates common SAST tools like Semgrep and Bandit in per-commit and cumulative modes. The core result is that per-commit detection catches only 13% of these vulnerabilities, with even cumulative scanning reaching just 27%. This matters because it highlights a gap in standard vulnerability detection practices that assume issues appear in single snapshots or commits. The dataset and baselines are released openly to enable further research on detecting such cross-commit issues.

Core claim

The per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST. The two per-commit detections that do occur are qualitatively poor, one on a security-fix commit and one missing the main vulnerability. Even in cumulative mode the detection rate is only 27%.

What carries the argument

The CrossCommitVuln-Bench dataset with its manually annotated commit chains and structured rationales explaining why each individual commit evades per-commit static analysis.

Load-bearing premise

The 15 selected CVEs are representative of multi-commit vulnerabilities in general and the manual annotations accurately capture the contributing commits and evasion reasons without bias.

What would settle it

Finding a substantially larger collection of multi-commit Python CVEs where per-commit SAST tools detect more than 13% of the cases would falsify the central claim.

read the original abstract

We present CrossCommitVuln-Bench, a curated benchmark of 15 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis - but collectively critical. We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis, and baseline evaluations using Semgrep and Bandit in both per-commit and cumulative scanning modes. Our central finding: the per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST. Critically, both per-commit detections are qualitatively poor: one occurs on commits framed as security fixes (where developers suppress the alert), and the other detects only the minor hardcoded-key component while completely missing the primary vulnerability (200+ unprotected API endpoints). Even in cumulative mode (full codebase present), the detection rate is only 27%, confirming that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits. The dataset, annotation schema, evaluation scripts, and reproducible baselines are released under open-source licenses to support research on cross-commit vulnerability detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a benchmark of 15 curated Python CVEs that span multiple commits and evade per-commit SAST by design, with released annotations and baselines, but the low detection rates do not generalize beyond the filtered sample.

read the letter

The main takeaway is that this paper supplies a small, annotated benchmark of 15 real CVEs where the vulnerability only becomes exploitable after several commits, each of which looks clean to per-commit static analysis. They run Semgrep and Bandit in both per-commit and cumulative modes and report 13% and 27% detection respectively, plus they release the dataset, the commit-chain annotations, the evasion rationales, and the scripts.

Referee Report

3 major / 0 minor

Summary. The paper introduces CrossCommitVuln-Bench, a curated dataset of 15 Python CVEs in which exploitable conditions are introduced across multiple commits, each individually benign to per-commit static analysis (using Semgrep and Bandit), but collectively critical. It provides manual annotations of contributing commit chains and evasion rationales, reports a per-commit detection rate (CCDR) of 13% (2/15, with both detections qualitatively poor) and 27% in cumulative mode, and releases the dataset, annotations, evaluation scripts, and baselines under open licenses.

Significance. If the selection and annotation methodology can be strengthened, the benchmark would usefully highlight limitations of snapshot-based SAST for multi-commit vulnerabilities and supply a reproducible resource for developing cross-commit detection techniques. The open release of the full dataset, structured annotations, and evaluation code is a clear strength that supports follow-on work.

major comments (3)

Abstract: The claim that the results 'confirm that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits' is not supported by the reported measurements. The 15 CVEs were explicitly selected under the filter that each commit is individually benign to per-commit analysis; the resulting 13% per-commit and 27% cumulative rates are therefore expected by construction for this sample and cannot be extrapolated to unselected vulnerabilities without a described sampling frame from the broader CVE population.
Dataset Construction and Annotation sections: The manual annotation of commit chains and evasion rationales lacks any reported protocol details, number of annotators, inter-annotator agreement metrics, or blinded validation steps. Because the 'benign' labels directly determine the 87% invisibility figure and the two reported detections (one on a security-fix commit, one partial), this omission is load-bearing for the central empirical claims.
Evaluation section: The paper states that both per-commit detections are qualitatively poor, yet provides no details on tool configurations, rule sets, severity thresholds, or false-positive handling for Semgrep and Bandit. Without these, the 13% and 27% rates cannot be independently verified or compared to other studies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each major comment in detail below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The claim that the results 'confirm that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits' is not supported by the reported measurements. The 15 CVEs were explicitly selected under the filter that each commit is individually benign to per-commit analysis; the resulting 13% per-commit and 27% cumulative rates are therefore expected by construction for this sample and cannot be extrapolated to unselected vulnerabilities without a described sampling frame from the broader CVE population.

Authors: We agree with the referee that the selection criteria make the low detection rates expected for this specific sample, and we do not claim to have a representative sampling frame from all CVEs. The benchmark is curated to highlight cases where vulnerabilities are introduced across commits in a way that evades per-commit analysis. We will revise the abstract to replace 'confirming' with 'illustrating' and clarify that the results demonstrate the potential for such invisibility in multi-commit scenarios, without implying a general prevalence. This revision will be made in the next version of the manuscript. revision: yes
Referee: Dataset Construction and Annotation sections: The manual annotation of commit chains and evasion rationales lacks any reported protocol details, number of annotators, inter-annotator agreement metrics, or blinded validation steps. Because the 'benign' labels directly determine the 87% invisibility figure and the two reported detections (one on a security-fix commit, one partial), this omission is load-bearing for the central empirical claims.

Authors: The referee correctly identifies a gap in the reporting of our annotation process. While the annotations were performed manually by the authors using a consistent schema for identifying contributing commit chains and evasion rationales, the manuscript does not detail the protocol. We will add a new subsection under Dataset Construction that describes the annotation methodology, including the steps for identifying multi-commit chains, the rationale categories used, and the process for validating the 'benign' labels against the CVE descriptions. Although formal inter-annotator agreement metrics were not computed (as the work was conducted by a small team with iterative review), we will describe the reconciliation process to enhance transparency and allow readers to evaluate the reliability of the 87% figure. revision: yes
Referee: Evaluation section: The paper states that both per-commit detections are qualitatively poor, yet provides no details on tool configurations, rule sets, severity thresholds, or false-positive handling for Semgrep and Bandit. Without these, the 13% and 27% rates cannot be independently verified or compared to other studies.

Authors: We acknowledge that the current manuscript lacks the necessary details on the SAST tool configurations to enable full reproducibility and comparison. We will revise the Evaluation section to include a comprehensive description of the setups: for Semgrep, the specific rules (e.g., from the python.security and python.lang.security categories), command options, and how alerts were filtered; for Bandit, the enabled tests, severity levels, and confidence thresholds. Additionally, we will explain the qualitative assessment criteria used to deem the detections 'poor' and how false positives were considered in the cumulative mode. These additions will be provided in the revised manuscript and accompanying code release to support independent verification. revision: yes

Circularity Check

1 steps flagged

Curated selection of CVEs defined as per-commit invisible makes the 13% CCDR finding tautological by construction

specific steps

self definitional [Abstract]
"We present CrossCommitVuln-Bench, a curated benchmark of 15 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis - but collectively critical. [...] We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis [...] Our central finding: the per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST."

The benchmark is explicitly defined and selected on the basis that individual commits are benign (invisible) to per-commit SAST, with manual annotations supplying the evasion rationales. The reported 13% detection rate and 87% invisibility therefore follow by construction from the curation filter and annotations rather than constituting an independent empirical result about tool behavior on a broader or randomly sampled set of vulnerabilities.

full rationale

This is an empirical benchmark paper with no mathematical derivations, equations, or self-citations. However, the central quantitative claim reduces directly to the input selection criteria. The 15 CVEs were manually curated and annotated specifically because 'the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis' with 'structured rationale for why each commit evades per-commit analysis.' Running Semgrep and Bandit then produces the expected low detection rates (2/15 per-commit), which the paper presents as the 'central finding' of 13% CCDR and 87% invisible. This matches the self-definitional pattern: the reported invisibility rate is guaranteed by how the sample was assembled rather than emerging independently from an unselected population. The paper provides useful annotations and qualitative discussion of the two detections, but the headline statistic is not an independent measurement. Score reflects partial circularity in the core empirical claim while acknowledging the dataset's descriptive value.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is based on curation of existing CVEs and standard tool evaluations.

pith-pipeline@v0.9.0 · 5511 in / 1195 out tokens · 49831 ms · 2026-05-09T21:04:30.727584+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Auto- mated Collection of Vulnerabilities and Their Fixes from Open-Source Soft- ware. InProceedings of the 17th International Conference on Predictive Mod- els and Data Analytics in Software Engineering (PROMISE). 30–39. https: //doi.org/10.1145/3475960.3475985

work page doi:10.1145/3475960.3475985 2021
[2]

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. InProceedings of the 17th International Conference on Mining Software Repositories (MSR). 508–512. https://doi.org/10.1145/3379597.3387501

work page doi:10.1145/3379597.3387501 2020
[3]

GitHub Inc. 2024. CodeQL: Semantic Code Analysis Engine. https://codeql.github. com

work page 2024
[4]

Arunabh Majumdar. 2026. POSTURA: Graph-Based Cross-Commit Secu- rity Analysis. https://pypi.org/project/postura/ (PyPI) and https://github.com/ motornomad/postura (source)

work page 2026
[5]

2017.Juliet Test Suite v1.3 for C/C++ and Java

NSA Center for Assured Software. 2017.Juliet Test Suite v1.3 for C/C++ and Java. Technical Report. National Security Agency

work page 2017
[6]

PyCQA. 2024. Bandit: A Tool Designed to Find Common Security Issues in Python Code. https://github.com/PyCQA/bandit

work page 2024
[7]

Semgrep Team. 2024. Semgrep: Fast, Open-Source, Static Analysis for Finding Bugs and Enforcing Code Standards. https://semgrep.dev

work page 2024
[8]

Dave Wichers and Jim Manico. 2021. OWASP Benchmark: A Free and Open Test Suite for Evaluating the Accuracy of Software Vulnerability Detection Tools. https://owasp.org/www-project-benchmark/

work page 2021
[9]

Yunhui Zheng, Saurabh Pujar, Benjamin Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Anal- ysis. InProceedings of the IEEE/ACM 43rd International Conference on Soft- ware Engineering: Software Engineering in Practice (ICSE...

work page doi:10.1109/icse-seip52600.2021.00018 2021
[10]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Pro- gram Semantics via Graph Neural Networks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 32

work page 2019

[1] [1]

Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Auto- mated Collection of Vulnerabilities and Their Fixes from Open-Source Soft- ware. InProceedings of the 17th International Conference on Predictive Mod- els and Data Analytics in Software Engineering (PROMISE). 30–39. https: //doi.org/10.1145/3475960.3475985

work page doi:10.1145/3475960.3475985 2021

[2] [2]

Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. InProceedings of the 17th International Conference on Mining Software Repositories (MSR). 508–512. https://doi.org/10.1145/3379597.3387501

work page doi:10.1145/3379597.3387501 2020

[3] [3]

GitHub Inc. 2024. CodeQL: Semantic Code Analysis Engine. https://codeql.github. com

work page 2024

[4] [4]

Arunabh Majumdar. 2026. POSTURA: Graph-Based Cross-Commit Secu- rity Analysis. https://pypi.org/project/postura/ (PyPI) and https://github.com/ motornomad/postura (source)

work page 2026

[5] [5]

2017.Juliet Test Suite v1.3 for C/C++ and Java

NSA Center for Assured Software. 2017.Juliet Test Suite v1.3 for C/C++ and Java. Technical Report. National Security Agency

work page 2017

[6] [6]

PyCQA. 2024. Bandit: A Tool Designed to Find Common Security Issues in Python Code. https://github.com/PyCQA/bandit

work page 2024

[7] [7]

Semgrep Team. 2024. Semgrep: Fast, Open-Source, Static Analysis for Finding Bugs and Enforcing Code Standards. https://semgrep.dev

work page 2024

[8] [8]

Dave Wichers and Jim Manico. 2021. OWASP Benchmark: A Free and Open Test Suite for Evaluating the Accuracy of Software Vulnerability Detection Tools. https://owasp.org/www-project-benchmark/

work page 2021

[9] [9]

Yunhui Zheng, Saurabh Pujar, Benjamin Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Anal- ysis. InProceedings of the IEEE/ACM 43rd International Conference on Soft- ware Engineering: Software Engineering in Practice (ICSE...

work page doi:10.1109/icse-seip52600.2021.00018 2021

[10] [10]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Pro- gram Semantics via Graph Neural Networks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 32

work page 2019