A Ground-Truth-Based Evaluation of Vulnerability Detection Across Multiple Ecosystems
Pith reviewed 2026-05-09 23:20 UTC · model grok-4.3
The pith
A curated ground-truth dataset reveals systematic differences in how vulnerability detection tools perform across software ecosystems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a snapshot dataset explicitly mapping vulnerabilities to concrete package versions supports direct comparisons of detection tools across ecosystems, exposing systematic variations in their results and underscoring the necessity of transparent dataset construction for reproducible security studies.
What carries the argument
The ground-truth dataset that resolves vulnerabilities to specific package versions and permits controlled tool comparisons.
If this is right
- Different detection systems produce varying results even when tested on the same version-mapped vulnerabilities.
- Transparent construction methods are required to make empirical security evaluations reproducible.
- An open-source reconstruction tool allows the dataset to be regenerated from updated database contents.
- Version-specific mappings make cross-ecosystem comparisons of detection performance feasible.
Where Pith is reading between the lines
- Similar curation methods could be applied to evaluate other classes of software analysis tools beyond vulnerability detectors.
- Clearer version specification practices might reduce inconsistencies when multiple tools draw from the same underlying data sources.
- The reconstruction approach could support longitudinal studies that track how detection performance changes as new vulnerabilities are recorded.
Load-bearing premise
The assumption that the curated dataset accurately represents ground truth without introducing new ambiguities in version mappings or identifier schemes.
What would settle it
Re-running the full tool comparison after changing how version ranges are interpreted in the dataset and checking whether the observed systematic differences remain or vanish.
Figures
read the original abstract
Automated vulnerability detection tools are widely used to identify security vulnerabilities in software dependencies. However, the evaluation of such tools remains challenging due to the heterogeneous structure of vulnerability data sources, inconsistent identifier schemes, and ambiguities in version range specifications. In this paper, we present an empirical evaluation of vulnerability detection across multiple software ecosystems using a curated ground-truth dataset derived from the Open Source Vulnerabilities (OSV) database. The dataset explicitly maps vulnerabilities to concrete package versions and enables a systematic comparison of detection results across different tools and services. Since vulnerability databases such as OSV are continuously updated, the dataset used in this study represents a snapshot of the vulnerability landscape at the time of the evaluation. To support reproducibility and future studies, we provide an open-source tool that automatically reconstructs the dataset from the current OSV database using the methodology described in this paper. Our evaluation highlights systematic differences between vulnerability detection systems and demonstrates the importance of transparent dataset construction for reproducible empirical security research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical evaluation of vulnerability detection tools across multiple software ecosystems. It constructs a ground-truth dataset by deriving a snapshot from the Open Source Vulnerabilities (OSV) database, explicitly mapping vulnerabilities to concrete package versions to resolve ambiguities in version ranges and identifier schemes, then uses this dataset to compare detection results across tools and services. The authors supply an open-source reconstruction tool to regenerate the dataset from the current OSV database and conclude that systematic differences exist between detection systems while underscoring the value of transparent dataset construction for reproducible research.
Significance. If the version-mapping methodology produces reliable ground truth, the work would strengthen empirical security research by supplying a reproducible, auditable benchmark dataset and tool that future studies can extend. The provision of the reconstruction tool is a clear strength for reproducibility and falsifiability of the evaluation.
major comments (2)
- [Dataset construction and mapping methodology] The description of dataset curation (around the OSV mapping process): the paper acknowledges ambiguities in version range specifications yet provides no exhaustive enumeration or validation of the resolution rules (e.g., handling of open-ended ranges, pre-release tags, ecosystem-specific comparators, or identifier normalization). Without an independent audit, error analysis, or comparison against how the evaluated tools themselves parse the same ranges, observed detection discrepancies could be artifacts of the authors' curation choices rather than genuine tool behavior. This mapping is load-bearing for the central ground-truth claim.
- [Evaluation results] Results and evaluation sections: the manuscript reports no quantitative metrics (precision, recall, or disagreement rates per ecosystem or tool), no error analysis of the mapping, and no sensitivity checks on how alternative resolution rules would alter the detected differences. These omissions prevent assessment of the magnitude or robustness of the claimed systematic differences.
minor comments (2)
- [Abstract] The abstract states that the dataset 'represents a snapshot' but does not give the exact date or list the precise ecosystems and package managers covered.
- [Notation and definitions] Notation for version identifiers and range operators is introduced without a dedicated table or glossary, which could hinder readers attempting to replicate the mapping.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Dataset construction and mapping methodology] The description of dataset curation (around the OSV mapping process): the paper acknowledges ambiguities in version range specifications yet provides no exhaustive enumeration or validation of the resolution rules (e.g., handling of open-ended ranges, pre-release tags, ecosystem-specific comparators, or identifier normalization). Without an independent audit, error analysis, or comparison against how the evaluated tools themselves parse the same ranges, observed detection discrepancies could be artifacts of the authors' curation choices rather than genuine tool behavior. This mapping is load-bearing for the central ground-truth claim.
Authors: We agree that the mapping methodology requires more explicit documentation to support the ground-truth claim. In the revised manuscript we will add a dedicated subsection that enumerates all resolution rules for version ranges, open-ended ranges, pre-release tags, ecosystem-specific comparators, and identifier normalization, together with concrete examples. We will also include a validation subsection reporting the results of a manual audit on a random sample of mappings and a comparison of our rules against the parsing behavior of the evaluated tools. The open-source reconstruction tool already permits independent reproduction and auditing of the dataset; we will emphasize this point and invite external verification. revision: yes
-
Referee: [Evaluation results] Results and evaluation sections: the manuscript reports no quantitative metrics (precision, recall, or disagreement rates per ecosystem or tool), no error analysis of the mapping, and no sensitivity checks on how alternative resolution rules would alter the detected differences. These omissions prevent assessment of the magnitude or robustness of the claimed systematic differences.
Authors: We accept that the current results section lacks the quantitative detail needed to evaluate the scale and robustness of the observed differences. We will expand the evaluation section to report per-ecosystem and per-tool detection counts, disagreement rates, and derived metrics that quantify systematic differences. We will add an error analysis of the mapping process and a sensitivity analysis that re-runs the comparison under plausible alternative resolution rules. These additions will allow readers to assess both the magnitude of the differences and their sensitivity to curation choices. revision: yes
Circularity Check
Empirical evaluation against external OSV database exhibits no circularity
full rationale
The paper performs an empirical comparison of vulnerability detection tools using a curated snapshot dataset explicitly derived from the public OSV database. No mathematical derivations, parameter fitting, predictions of fitted quantities, or load-bearing self-citations appear in the provided text. The methodology for mapping vulnerabilities to concrete versions is described and accompanied by a reconstruction tool, rendering the evaluation reproducible against an independent external benchmark. This satisfies the criteria for a self-contained empirical study with no reduction of claims to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption OSV database provides accurate and unambiguous ground-truth mappings of vulnerabilities to concrete package versions
Reference graph
Works this paper leans on
-
[1]
Common Vulnerabilities and Exposures (CVE),
MITRE Corporation, “Common Vulnerabilities and Exposures (CVE),” https://www.cve.org, accessed: 2026-01
work page 2026
-
[2]
National Vulnerability Database (NVD),
National Institute of Standards and Technology, “National Vulnerability Database (NVD),” https://nvd.nist.gov, accessed: 2026-01
work page 2026
-
[3]
“National Institute of Standards and Technology: Official Common Plat- form Enumeration (CPE) Dictionary,” https://nvd.nist.gov/products/cpe, accessed: 2026-01
work page 2026
-
[4]
GitHub, Inc., “GitHub Advisory Database,” https : / / github.com / advisories, accessed: 2026-01
work page 2026
-
[5]
Open Source Vulnerabilities (OSV),
Google Open Source Security Team, “Open Source Vulnerabilities (OSV),” https://osv.dev, accessed: 2026-01
work page 2026
-
[6]
Black Duck Software Composition Analysis,
Synopsys, Inc., “Black Duck Software Composition Analysis,” https: / / www.synopsys.com / software - integrity / security - testing / software - composition-analysis.html, accessed: 2026-01
work page 2026
-
[7]
GitHub, Inc., “GitHub Dependabot,” https://docs.github.com/en/code- security/dependabot, accessed: 2026-01
work page 2026
-
[8]
OW ASP Foundation, “OW ASP Dependency-Track,” https : //dependencytrack.org, accessed: 2026-01
work page 2026
-
[9]
FOSSA Open Source Risk Management,
FOSSA, Inc., “FOSSA Open Source Risk Management,” https : / / fossa.com, accessed: 2026-01
work page 2026
-
[10]
Grype: A Vulnerability Scanner for Container Images and Filesystems,
Anchore, Inc., “Grype: A Vulnerability Scanner for Container Images and Filesystems,” https://github.com/anchore/grype, accessed: 2026-01
work page 2026
-
[11]
Mend Open Source Security (formerly WhiteSource),
Mend.io, “Mend Open Source Security (formerly WhiteSource),” https: //www.mend.io, accessed: 2026-01
work page 2026
-
[12]
Sonatype, Inc., “Sonatype OSS Index,” https://ossindex.sonatype.org, accessed: 2026-01
work page 2026
-
[13]
Google, “Osv-scanner,” https://google.github.io/osv-scanner/, accessed: 2026-03-16
work page 2026
-
[14]
Snyk Ltd., “Snyk Vulnerability Database,” https://security.snyk.io, ac- cessed: 2026-01
work page 2026
-
[15]
Aqua Security, “Trivy vulnerability scanner,” 2024. [Online]. Available: https://github.com/aquasecurity/trivy
work page 2024
-
[16]
Vulnerable open source dependencies: Counting those that matter,
I. Pashchenko, H. Plate, S. E. Ponta, A. Sabetta, and F. Massacci, “Vulnerable open source dependencies: Counting those that matter,” inProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2018, pp. 42:1–42:10
work page 2018
-
[17]
Detection, assessment and mitigation of vulnerabilities in open source dependencies,
S. E. Ponta, H. Plate, and A. Sabetta, “Detection, assessment and mitigation of vulnerabilities in open source dependencies,”Empirical Software Engineering, 2020
work page 2020
-
[18]
On the impact of security vulnerabilities in the npm and rubygems dependency networks,
A. Zerouali, T. Mens, A. Decan, and C. De Roover, “On the impact of security vulnerabilities in the npm and rubygems dependency networks,” Empirical Software Engineering, vol. 27, no. 5, 2022
work page 2022
-
[19]
On the effect of transitivity and granularity on vulnerability propagation in the maven ecosystem,
A. M. Mir, M. Keshani, and S. Proksch, “On the effect of transitivity and granularity on vulnerability propagation in the maven ecosystem,” inSANER, 2023
work page 2023
-
[20]
A comparative study of vulnera- bility reporting by software composition analysis tools,
N. Imtiaz, S. Thorn, and L. Williams, “A comparative study of vulnera- bility reporting by software composition analysis tools,” inProceedings of the ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, pp. 1–11
work page 2021
-
[21]
Software composition analysis for vulnerability detection: An empirical study on java projects,
L. Zhao, S. Chen, Z. Xu, C. Liu, L. Zhang, J. Wu, J. Sun, and Y . Liu, “Software composition analysis for vulnerability detection: An empirical study on java projects,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023, pp. 960–972
work page 2023
-
[22]
Tool or toy: Are sca tools ready for challenging scenarios?
C. Shuet al., “Tool or toy: Are sca tools ready for challenging scenarios?”Computers & Security, 2025
work page 2025
-
[23]
Identifying affected libraries and their ecosystems for open source software vulnerabilities,
S. Wuet al., “Identifying affected libraries and their ecosystems for open source software vulnerabilities,” inICSE, 2024
work page 2024
-
[24]
On the security blind spots of software composition analysis,
J. Dietrich, S. Rasheed, A. Jordan, and T. White, “On the security blind spots of software composition analysis,” inSCORED, 2024
work page 2024
-
[25]
Empirical analysis of security vulnerabilities in python packages,
M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,”Empirical Software Engineering, 2023
work page 2023
-
[26]
Automated vulnerability validation and verification: A large language model approach,
A. Lotfi, C. Katsis, and E. Bertino, “Automated vulnerability validation and verification: A large language model approach,”arXiv preprint arXiv:2509.24037, 2025
-
[27]
Ai-assisted dependency vulnerability resolution in large- scale enterprise systems,
S. R. Kathi, “Ai-assisted dependency vulnerability resolution in large- scale enterprise systems,”International Research Journal of Advanced Engineering and Technology, 2025. APPENDIXA TEMPORALCOMPARISON OFTWOGROUND-TRUTH SNAPSHOTS ANDTHEIRTOOLEVALUATIONS To examine the temporal stability of the underlying vul- nerability data and its impact on tool asses...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.