Mapping NVD Records to Their Vulnerability-fixing Commits: How Hard is It?
Pith reviewed 2026-05-22 00:54 UTC · model grok-4.3
The pith
Git references in NVD enable mapping to vulnerability-fixing commits over 86% of the time while non-Git references succeed under 14%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mapping NVD records to their vulnerability-fixing commits is feasible primarily when Git references are present, allowing over 86% success rate, and an automated system combining NVD extraction, external databases, and GitHub mining can map 26,710 unique records at 87% precision, though 88.7% of records lack sufficient links for mapping.
What carries the argument
The automated extraction pipeline that classifies NVD references as Git or non-Git and mines commits from Git repositories, supplemented by data from external security databases and GitHub searches.
If this is right
- Over 26,000 NVD records can be mapped to fixing commits using the combined sources.
- Git references provide the highest success rate for automated identification.
- External databases contribute mappings with 88.4% precision.
- GitHub adds unique contributions but at lower 73% precision.
- 88.7% of NVD records still cannot be mapped due to missing Git links.
Where Pith is reading between the lines
- Encouraging vulnerability reporters to include Git commit links in NVD entries could substantially increase future mapping coverage.
- The mapped dataset could support training machine learning models for automatic vulnerability fix detection.
- Projects without Git references in NVD might benefit from manual review or advanced code search techniques.
- Expanding the method to other vulnerability databases could create a more comprehensive security dataset.
Load-bearing premise
The manual analysis performed on a sample of NVD references accurately represents the characteristics and success rates for the entire collection of 235,341 records.
What would settle it
A full manual audit of the 26,710 mapped records revealing that fewer than 80% actually link to the correct fixing commits would disprove the reported precision.
Figures
read the original abstract
Mapping National Vulnerability Database (NVD) records to vulnerability-fixing commits (VFCs) is crucial for vulnerability analysis but challenging due to sparse explicit links in NVD references. This study explores this mapping's feasibility through an empirical approach. Manual analysis of NVD references showed Git references enable over 86% success, while non-Git references achieve under 14%. Using these findings, we built an automated pipeline extracting 31,942 VFCs from 20,360 NVD records (8.7% of 235,341) with 87% precision, mainly from Git references. To fill gaps, we mined six external security databases, yielding 29,254 VFCs for 18,985 records (8.1%) at 88.4% precision, and GitHub repositories, adding 3,686 VFCs for 2,795 records (1.2%) at 73% precision. Combining these, we mapped 26,710 unique records (11.3% coverage) from 7,634 projects, with overlap between NVD and external databases, plus unique GitHub contributions. Despite success with Git references, 88.7% of records remain unmapped, highlighting the difficulty without Git links. This study offers insights for enhancing vulnerability datasets and guiding future automated security research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical study mapping NVD vulnerability records to vulnerability-fixing commits (VFCs). Manual analysis indicates Git references succeed in over 86% of cases versus under 14% for non-Git. An automated pipeline from NVD extracts VFCs for 8.7% of records at 87% precision; external databases add 8.1% at 88.4%; GitHub contributes 1.2% at 73%. Combined, 11.3% coverage is achieved, leaving 88.7% unmapped and emphasizing the difficulty without Git links.
Significance. If the empirical measurements are reliable, this study quantifies the mapping challenge in vulnerability data, providing benchmarks (e.g., 11.3% coverage, differential rates by reference type) that are useful for security analytics and dataset improvement in software engineering. The multi-source approach and concrete precision figures offer a foundation for future automated methods.
major comments (2)
- The reported >86% success rate for Git references and <14% for non-Git is based on manual inspection, but details on sample size, selection process (random vs. stratified), and inter-rater reliability are not provided. This is critical because the automated pipeline's 87% precision relies on this as ground truth, and any bias in sampling could affect the overall coverage claim of 11.3%.
- In the description of the automated pipeline and its precision evaluation, the 87% precision for NVD extraction and 88.4% for external databases are stated without elaboration on how the validation samples were chosen or how false positives were identified. This makes it difficult to assess whether the precision figures and the combined 26,710 unique records are robust to annotation artifacts.
minor comments (2)
- The abstract states extraction of 31,942 VFCs from 20,360 records but reports 26,710 unique records in the combined result; a brief clarification on overlaps between sources would improve readability.
- Consider adding a summary table in the results section comparing coverage, precision, and unique contributions from NVD, external databases, and GitHub.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help improve the clarity of our empirical study. We address each major comment point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: The reported >86% success rate for Git references and <14% for non-Git is based on manual inspection, but details on sample size, selection process (random vs. stratified), and inter-rater reliability are not provided. This is critical because the automated pipeline's 87% precision relies on this as ground truth, and any bias in sampling could affect the overall coverage claim of 11.3%.
Authors: We agree that the manuscript should provide these details to allow proper evaluation of the manual analysis and its use as ground truth. The current version omits them, which is an oversight. In the revision we will expand the relevant methodology subsection to report the sample size, the selection process used, and inter-rater reliability measures. This addition will directly support the validity of the 87% precision figure and the 11.3% coverage claim. revision: yes
-
Referee: In the description of the automated pipeline and its precision evaluation, the 87% precision for NVD extraction and 88.4% for external databases are stated without elaboration on how the validation samples were chosen or how false positives were identified. This makes it difficult to assess whether the precision figures and the combined 26,710 unique records are robust to annotation artifacts.
Authors: We concur that additional information on the validation procedure is required. The manuscript currently lacks explicit description of sample selection and false-positive identification. We will revise the evaluation section to detail how the validation samples were drawn and the manual process used to classify false positives. These clarifications will strengthen confidence in the reported precision values and the combined mapping results. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
This paper is an empirical measurement study that performs manual analysis of NVD references to establish success rates for Git vs. non-Git links, then applies those observations to construct and evaluate an automated extraction pipeline, supplemented by independent mining of external databases and GitHub. No equations, fitted parameters, derivations, or self-referential definitions appear; the reported figures (86%+ success, 87% precision, 11.3% coverage) are direct counts and precision measurements against sampled ground truth rather than outputs that reduce to inputs by construction. External sources provide non-overlapping contributions, and the work contains no load-bearing self-citations or imported uniqueness theorems. The derivation chain is therefore self-contained against its own empirical benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NVD references contain parsable information sufficient to locate fixing commits when they are Git-based
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Manual analysis of NVD references showed Git references enable over 86% success, while non-Git references achieve under 14%. ... automated pipeline extracting 31,942 VFCs ... 87% precision
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compiled a dataset of 37,441 VFCs across 26,710 NVD records, achieving an average precision of 86.1%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Bugzilla Redhat. https://bugzilla.redhat.com/. (Accessed on 12/15/2024)
work page 2024
-
[2]
[n. d.]. CVE - Home. https://cve.mitre.org/. (Accessed on 12/15/2024)
work page 2024
-
[3]
[n. d.]. Django Security Reports. https://docs.djangoproject.com/en/5.1/topics/security/. (Accessed on 12/15/2024)
work page 2024
-
[4]
[n. d.]. GitHub Advisory Database. https://github.com/advisories. (Accessed on 12/15/2024)
work page 2024
-
[5]
[n. d.]. Nifi Apache Security. https://nifi.apache.org/documentation/security/. (Accessed on 12/15/2024)
work page 2024
-
[6]
[n. d.]. NVD - Home. https://nvd.nist.gov/. (Accessed on 12/15/2024)
work page 2024
-
[7]
[n. d.]. OSV database. https://osv.dev/. (Accessed on 12/15/2024)
work page 2024
-
[8]
[n. d.]. Prospector GitHub Repository. https://github.com/SAP/project-kb/tree/main/prospector. (Accessed on 12/15/2024)
work page 2024
-
[9]
[n. d.]. Synk.io. https://snyk.io/. (Accessed on 12/15/2024)
work page 2024
-
[10]
[n. d.]. Ubuntu Security Reports. https://https://ubuntu.com/security/. (Accessed on 12/15/2024)
work page 2024
-
[11]
Flaw leads to Google+ shutting down
2018. Flaw leads to Google+ shutting down. Network Security 2018, 10 (2018), 3. doi:10.1016/S1353-4858(18)30095-3
-
[12]
Jafar Akhoundali, Sajad Rahim Nouri, Kristian Rietveld, and Olga Gadyatskaya. 2024. MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery. In Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering . 42–51
work page 2024
-
[13]
Lingfeng Bao, Xin Xia, Ahmed E Hassan, and Xiaohu Yang. 2022. V-SZZ: automatic identification of version ranges affected by CVE vulnerabilities. In Proceedings of the 44th international conference on software engineering . 2352–2364
work page 2022
-
[14]
Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39
work page 2021
-
[15]
S Biswas, M Sohel, MM Sajal, T Afrin, T Bhuiyan, and MM Hassan. 2018. A study on remote code execution vulnerability in web applications. In International conference on cyber security and computer science (ICONCS 2018) . 50–57
work page 2018
- [16]
-
[17]
Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. 2023. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses . 654–668
work page 2023
-
[18]
Long Cheng, Fang Liu, and Danfeng Yao. 2017. Enterprise data breach: causes, challenges, prevention, and future directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7, 5 (2017), e1211
work page 2017
- [19]
-
[20]
Trevor Dunlap, Elizabeth Lin, William Enck, and Bradley Reaves. 2024. VFCFinder: Pairing Security Advisories and Patches. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security . 1128–1142
work page 2024
-
[21]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. A C/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories . 508–512
work page 2020
-
[22]
Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering . 935–947
work page 2022
-
[23]
Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, and Damian A Tamburri. 2024. Automated mapping of vulnerability advisories onto their fix commits in open source repositories. ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–28
work page 2024
-
[24]
Triet Huynh Minh Le, David Hin, Roland Croft, and M Ali Babar. 2021. Deepcva: Automated commit-level vulnerability assessment with deep multi-task learning. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 717–729
work page 2021
-
[25]
Kaixuan Li, Jian Zhang, Sen Chen, Han Liu, Yang Liu, and Yixiang Chen. 2024. PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source Software. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis . ACM, 590–602
work page 2024
-
[26]
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svy- atkovskiy, Shengyu Fu, and Neel Sundaresan. 2022. Automating code review activities by large-scale pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering ...
-
[27]
Ruyan Lin, Yulong Fu, Wei Yi, Jincheng Yang, Jin Cao, Zhiqiang Dong, Fei Xie, and Hui Li. 2024. Vulnerabilities and Security Patches Detection in OSS: A Survey. Comput. Surveys 57, 1 (2024), 1–37. , Vol. 1, No. 1, Article . Publication date: September 2025. Mapping NVD Records to Their VFCs 27
work page 2024
-
[28]
Peter Mell, Karen Scarfone, Sasha Romanosky, et al. 2007. A complete guide to the common vulnerability scoring system version 2.0. In Published by FIRST-forum of incident response and security teams , Vol. 1. 23
work page 2007
-
[29]
Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Xuan-Bach D Le, and David Lo. 2022. Vulcurator: a vulnerability- fixing commit detector. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1726–1730
work page 2022
-
[30]
Truong Giang Nguyen, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Chengran Yang, Zhipeng Zhao, Bowen Xu, Jiayuan Zhou, Xin Xia, Ahmed E Hassan, et al . 2023. Multi-granularity detector for vulnerability fixes. IEEE Transactions on Software Engineering 49, 8 (2023), 4035–4057
work page 2023
-
[31]
Giang Nguyen-Truong, Hong Jin Kang, David Lo, Abhishek Sharma, Andrew E Santosa, Asankhaya Sharma, and Ming Yi Ang. 2022. Hermes: Using commit-issue linking to detect vulnerability-fixing commits. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 51–62
work page 2022
-
[32]
Georgios Nikitopoulos, Konstantina Dritsa, Panos Louridas, and Dimitris Mitropoulos. 2021. CrossVul: a cross-language vulnerability dataset with commit data. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1565–1569
work page 2021
-
[33]
Mike O’Leary and Mike O’Leary. 2019. Privilege Escalation in Linux. Cyber Operations: Building, Defending, and Attacking Modern Computer Networks (2019), 419–453
work page 2019
-
[34]
Shengyi Pan, Lingfeng Bao, Xin Xia, David Lo, and Shanping Li. 2023. Fine-grained commit-level vulnerability type prediction by CWE tree structure. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 957–969
work page 2023
-
[35]
Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. 2015. Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security . 426–437
work page 2015
-
[36]
Antonino Sabetta, Serena Elisa Ponta, Rocio Cabrera Lozoya, Michele Bezzi, Tommaso Sacchetti, Matteo Greco, Gergő Balogh, Péter Hegedűs, Rudolf Ferenc, Ranindya Paramitha, Ivan Pashchenko, Aurora Papotti, Ákos Milánkovich, and Fabio Massacci. 2024. Known Vulnerabilities of Open Source Projects: Where Are the Fixes? IEEE Security & Privacy 22, 2 (2024), 49...
-
[37]
Riccardo Scandariato, James Walden, Aram Hovsepyan, and Wouter Joosen. 2014. Predicting Vulnerable Software Components via Text Mining. IEEE Transactions on Software Engineering 40 (10 2014), 993–1006. doi:10.1109/TSE.2014. 2340398
-
[38]
Jiamou Sun, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Liming Zhu, Thong Hoang, and Dehai Zhao. 2023. Silent vulnerable dependency alert prediction with vulnerability key aspect explanation. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 970–982
work page 2023
-
[39]
Xin Tan, Yuan Zhang, Chenyuan Mi, Jiajun Cao, Kun Sun, Yifan Lin, and Min Yang. 2021. Locating the Security Patches for Disclosed OSS Vulnerabilities with Vulnerability-Commit Correlation Ranking. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (Virtual Event, Republic of Korea) (CCS ’21). Association for Computing...
-
[40]
Dean Turner, Marc Fossi, Eric Johnson, Trevor Mack, Joseph Blackbird, Stephen Entwisle, Mo King Low, David McKinney, and Candid Wueest. 2008. Symantec Internet Security Threat Report: Trends for July-December 07-Volume XIII. Technical Report. Technical report, Symantec Corporation
work page 2008
-
[41]
Shichao Wang, Yun Zhang, Liagfeng Bao, Xin Xia, and Minghui Wu. 2022. Vcmatch: a ranking-based approach for automatic security patches localization for OSS vulnerabilities. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 589–600
work page 2022
-
[42]
Congying Xu, Bihuan Chen, Chenhao Lu, Kaifeng Huang, Xin Peng, and Yang Liu. 2022. Tracking patches for open source software vulnerabilities. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 860–871
work page 2022
-
[43]
Jiayuan Zhou, Michael Pacheco, Jinfu Chen, Xing Hu, Xin Xia, David Lo, and Ahmed E Hassan. 2023. Colefunda: Ex- plainable silent vulnerability fix identification. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2565–2577
work page 2023
-
[44]
Jiayuan Zhou, Michael Pacheco, Zhiyuan Wan, Xin Xia, David Lo, Yuan Wang, and Ahmed E Hassan. 2021. Finding a needle in a haystack: Automated mining of silent vulnerability fixes. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 705–716
work page 2021
-
[45]
Yaqin Zhou and Asankhaya Sharma. 2017. Automated identification of security issues from commit messages and bug reports. In Proceedings of the 2017 11th joint meeting on foundations of software engineering . 914–919. , Vol. 1, No. 1, Article . Publication date: September 2025
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.