{"paper":{"title":"Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Code language models acquire no transferable security understanding from vulnerability-fixing code changes alone.","cross_cats":["cs.CR","cs.LG"],"primary_cat":"cs.SE","authors_text":"Felix M\\\"achtle, Joseph Bienh\\\"uls, Kristoffer Hempel, Nils Loose, Thomas Eisenbarth","submitted_at":"2026-05-13T08:05:14Z","abstract_excerpt":"Automated detection of vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, as advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories. We present a comprehensive evaluation of code language model based VFC detection through a unified framework consolidating over 20 fragmented datasets spanning more than 180000 commits. Across over 180 experiments with fine-tuned models from 125 M to 14 B parameters, we find no evidence that models acquire transferable security-relevant code understanding from code changes alone. Whe"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The consolidated datasets from prior sources contain accurate, unbiased labels for vulnerability-fixing commits and that the chosen evaluation splits (random, group-stratified, temporal) reflect realistic deployment conditions without unmeasured distributional shifts.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Code language models acquire no transferable security understanding from vulnerability-fixing code changes alone.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"25a1be96350817c713baecf030316abf00b64d86c8d8547d77c3207058b2a9b0"},"source":{"id":"2605.13138","kind":"arxiv","version":1},"verdict":{"id":"2516d9de-f028-4b7e-ab9a-2d88ca47ed50","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T18:35:52.994809Z","strongest_claim":"We find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes.","one_line_summary":"Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The consolidated datasets from prior sources contain accurate, unbiased labels for vulnerability-fixing commits and that the chosen evaluation splits (random, group-stratified, temporal) reflect realistic deployment conditions without unmeasured distributional shifts.","pith_extraction_headline":"Code language models acquire no transferable security understanding from vulnerability-fixing code changes alone."},"references":{"count":69,"sample":[{"doi":"10.1145/3663533.3664036","year":2024,"title":"Jafar Akhoundali, Sajad Rahim Nouri, Kristian F. D. Rietveld, and Olga Gady- atskaya. 2024. MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery. InProceedin","work_id":"d795933f-251b-448c-b3cd-51621a1f7c8a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Dos and Don’ts of Machine Learning in Computer Security","work_id":"03ad60b9-20e3-4f7e-8d3f-e984a2958cba","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3475960","year":2021,"title":"Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. InPROMISE ’21: 17th International Conference on ","work_id":"ed6551bc-10bb-4663-9e7a-65eb97dc084f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Max Brunsfeld. [n.d.]. Tree-sitter. https://github.com/tree-sitter/tree-sitter","work_id":"e32d1502-15d0-4f40-af79-04cac2c105d7","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Tianyu Chen, Lin Li, Taotao Qian, Jingyi Liu, Wei Yang, Ding Li, Guangtai Liang, Qianxiang Wang, and Tao Xie. 2024. CompVPD: Iteratively Identifying Vulnerability Patches Based on Human Validation Res","work_id":"d04f0819-a582-4763-907e-aff185777919","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":69,"snapshot_sha256":"ba7a3ee75425b8a23c392ca1b79f358ea73f07777683b77ea192f59b8b0d5c53","internal_anchors":3},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}