A Ground-Truth-Based Evaluation of Vulnerability Detection Across Multiple Ecosystems

Martin H\"ausl; Maximilian Auch; Paul Mandl; Peter Mandl

arxiv: 2604.21111 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.CR

A Ground-Truth-Based Evaluation of Vulnerability Detection Across Multiple Ecosystems

Peter Mandl , Paul Mandl , Martin H\"ausl , Maximilian Auch This is my paper

Pith reviewed 2026-05-09 23:20 UTC · model grok-4.3

classification 💻 cs.SE cs.CR

keywords vulnerability detectionground-truth datasetempirical evaluationsoftware ecosystemsdataset constructionreproducibilitydetection tools comparisonsecurity research

0 comments

The pith

A curated ground-truth dataset reveals systematic differences in how vulnerability detection tools perform across software ecosystems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset that directly links vulnerabilities to specific package versions drawn from a public vulnerability database. It applies this dataset to test multiple automated detection tools and services, allowing side-by-side comparison of their outputs on identical inputs. The evaluation finds consistent differences in the vulnerabilities each tool identifies. The authors also release software that rebuilds the dataset from the latest database snapshot, enabling others to repeat or extend the work. This approach demonstrates that clear, version-explicit dataset construction is necessary for trustworthy empirical comparisons in security research.

Core claim

The paper establishes that a snapshot dataset explicitly mapping vulnerabilities to concrete package versions supports direct comparisons of detection tools across ecosystems, exposing systematic variations in their results and underscoring the necessity of transparent dataset construction for reproducible security studies.

What carries the argument

The ground-truth dataset that resolves vulnerabilities to specific package versions and permits controlled tool comparisons.

If this is right

Different detection systems produce varying results even when tested on the same version-mapped vulnerabilities.
Transparent construction methods are required to make empirical security evaluations reproducible.
An open-source reconstruction tool allows the dataset to be regenerated from updated database contents.
Version-specific mappings make cross-ecosystem comparisons of detection performance feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar curation methods could be applied to evaluate other classes of software analysis tools beyond vulnerability detectors.
Clearer version specification practices might reduce inconsistencies when multiple tools draw from the same underlying data sources.
The reconstruction approach could support longitudinal studies that track how detection performance changes as new vulnerabilities are recorded.

Load-bearing premise

The assumption that the curated dataset accurately represents ground truth without introducing new ambiguities in version mappings or identifier schemes.

What would settle it

Re-running the full tool comparison after changing how version ranges are interpreted in the dataset and checking whether the observed systematic differences remain or vanish.

Figures

Figures reproduced from arXiv: 2604.21111 by Martin H\"ausl, Maximilian Auch, Paul Mandl, Peter Mandl.

**Figure 2.** Figure 2: Overview of the evaluation methodology. The input set [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Relationship between the ground-truth set [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Simplified architecture of vulnerability detection ecosystems. Vulner [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Simplified conceptual data flow between vulnerability identifier [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Conceptual view of how the evaluated tools access and organize [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the evaluated tools with respect to mean recall and [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Significance matrix for pairwise recall comparisons. Each cell [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Automated vulnerability detection tools are widely used to identify security vulnerabilities in software dependencies. However, the evaluation of such tools remains challenging due to the heterogeneous structure of vulnerability data sources, inconsistent identifier schemes, and ambiguities in version range specifications. In this paper, we present an empirical evaluation of vulnerability detection across multiple software ecosystems using a curated ground-truth dataset derived from the Open Source Vulnerabilities (OSV) database. The dataset explicitly maps vulnerabilities to concrete package versions and enables a systematic comparison of detection results across different tools and services. Since vulnerability databases such as OSV are continuously updated, the dataset used in this study represents a snapshot of the vulnerability landscape at the time of the evaluation. To support reproducibility and future studies, we provide an open-source tool that automatically reconstructs the dataset from the current OSV database using the methodology described in this paper. Our evaluation highlights systematic differences between vulnerability detection systems and demonstrates the importance of transparent dataset construction for reproducible empirical security research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical evaluation of vulnerability detection tools across multiple software ecosystems. It constructs a ground-truth dataset by deriving a snapshot from the Open Source Vulnerabilities (OSV) database, explicitly mapping vulnerabilities to concrete package versions to resolve ambiguities in version ranges and identifier schemes, then uses this dataset to compare detection results across tools and services. The authors supply an open-source reconstruction tool to regenerate the dataset from the current OSV database and conclude that systematic differences exist between detection systems while underscoring the value of transparent dataset construction for reproducible research.

Significance. If the version-mapping methodology produces reliable ground truth, the work would strengthen empirical security research by supplying a reproducible, auditable benchmark dataset and tool that future studies can extend. The provision of the reconstruction tool is a clear strength for reproducibility and falsifiability of the evaluation.

major comments (2)

[Dataset construction and mapping methodology] The description of dataset curation (around the OSV mapping process): the paper acknowledges ambiguities in version range specifications yet provides no exhaustive enumeration or validation of the resolution rules (e.g., handling of open-ended ranges, pre-release tags, ecosystem-specific comparators, or identifier normalization). Without an independent audit, error analysis, or comparison against how the evaluated tools themselves parse the same ranges, observed detection discrepancies could be artifacts of the authors' curation choices rather than genuine tool behavior. This mapping is load-bearing for the central ground-truth claim.
[Evaluation results] Results and evaluation sections: the manuscript reports no quantitative metrics (precision, recall, or disagreement rates per ecosystem or tool), no error analysis of the mapping, and no sensitivity checks on how alternative resolution rules would alter the detected differences. These omissions prevent assessment of the magnitude or robustness of the claimed systematic differences.

minor comments (2)

[Abstract] The abstract states that the dataset 'represents a snapshot' but does not give the exact date or list the precise ecosystems and package managers covered.
[Notation and definitions] Notation for version identifiers and range operators is introduced without a dedicated table or glossary, which could hinder readers attempting to replicate the mapping.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Dataset construction and mapping methodology] The description of dataset curation (around the OSV mapping process): the paper acknowledges ambiguities in version range specifications yet provides no exhaustive enumeration or validation of the resolution rules (e.g., handling of open-ended ranges, pre-release tags, ecosystem-specific comparators, or identifier normalization). Without an independent audit, error analysis, or comparison against how the evaluated tools themselves parse the same ranges, observed detection discrepancies could be artifacts of the authors' curation choices rather than genuine tool behavior. This mapping is load-bearing for the central ground-truth claim.

Authors: We agree that the mapping methodology requires more explicit documentation to support the ground-truth claim. In the revised manuscript we will add a dedicated subsection that enumerates all resolution rules for version ranges, open-ended ranges, pre-release tags, ecosystem-specific comparators, and identifier normalization, together with concrete examples. We will also include a validation subsection reporting the results of a manual audit on a random sample of mappings and a comparison of our rules against the parsing behavior of the evaluated tools. The open-source reconstruction tool already permits independent reproduction and auditing of the dataset; we will emphasize this point and invite external verification. revision: yes
Referee: [Evaluation results] Results and evaluation sections: the manuscript reports no quantitative metrics (precision, recall, or disagreement rates per ecosystem or tool), no error analysis of the mapping, and no sensitivity checks on how alternative resolution rules would alter the detected differences. These omissions prevent assessment of the magnitude or robustness of the claimed systematic differences.

Authors: We accept that the current results section lacks the quantitative detail needed to evaluate the scale and robustness of the observed differences. We will expand the evaluation section to report per-ecosystem and per-tool detection counts, disagreement rates, and derived metrics that quantify systematic differences. We will add an error analysis of the mapping process and a sensitivity analysis that re-runs the comparison under plausible alternative resolution rules. These additions will allow readers to assess both the magnitude of the differences and their sensitivity to curation choices. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation against external OSV database exhibits no circularity

full rationale

The paper performs an empirical comparison of vulnerability detection tools using a curated snapshot dataset explicitly derived from the public OSV database. No mathematical derivations, parameter fitting, predictions of fitted quantities, or load-bearing self-citations appear in the provided text. The methodology for mapping vulnerabilities to concrete versions is described and accompanied by a reconstruction tool, rendering the evaluation reproducible against an independent external benchmark. This satisfies the criteria for a self-contained empirical study with no reduction of claims to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating OSV as accurate ground truth and on the correctness of the authors' mapping methodology for versions and identifiers, neither of which is detailed or independently validated in the abstract.

axioms (1)

domain assumption OSV database provides accurate and unambiguous ground-truth mappings of vulnerabilities to concrete package versions
Invoked as the basis for the curated dataset and all subsequent comparisons.

pith-pipeline@v0.9.0 · 5467 in / 1043 out tokens · 38590 ms · 2026-05-09T23:20:32.253817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Common Vulnerabilities and Exposures (CVE),

MITRE Corporation, “Common Vulnerabilities and Exposures (CVE),” https://www.cve.org, accessed: 2026-01

work page 2026
[2]

National Vulnerability Database (NVD),

National Institute of Standards and Technology, “National Vulnerability Database (NVD),” https://nvd.nist.gov, accessed: 2026-01

work page 2026
[3]

National Institute of Standards and Technology: Official Common Plat- form Enumeration (CPE) Dictionary,

“National Institute of Standards and Technology: Official Common Plat- form Enumeration (CPE) Dictionary,” https://nvd.nist.gov/products/cpe, accessed: 2026-01

work page 2026
[4]

GitHub Advisory Database,

GitHub, Inc., “GitHub Advisory Database,” https : / / github.com / advisories, accessed: 2026-01

work page 2026
[5]

Open Source Vulnerabilities (OSV),

Google Open Source Security Team, “Open Source Vulnerabilities (OSV),” https://osv.dev, accessed: 2026-01

work page 2026
[6]

Black Duck Software Composition Analysis,

Synopsys, Inc., “Black Duck Software Composition Analysis,” https: / / www.synopsys.com / software - integrity / security - testing / software - composition-analysis.html, accessed: 2026-01

work page 2026
[7]

GitHub Dependabot,

GitHub, Inc., “GitHub Dependabot,” https://docs.github.com/en/code- security/dependabot, accessed: 2026-01

work page 2026
[8]

OW ASP Dependency-Track,

OW ASP Foundation, “OW ASP Dependency-Track,” https : //dependencytrack.org, accessed: 2026-01

work page 2026
[9]

FOSSA Open Source Risk Management,

FOSSA, Inc., “FOSSA Open Source Risk Management,” https : / / fossa.com, accessed: 2026-01

work page 2026
[10]

Grype: A Vulnerability Scanner for Container Images and Filesystems,

Anchore, Inc., “Grype: A Vulnerability Scanner for Container Images and Filesystems,” https://github.com/anchore/grype, accessed: 2026-01

work page 2026
[11]

Mend Open Source Security (formerly WhiteSource),

Mend.io, “Mend Open Source Security (formerly WhiteSource),” https: //www.mend.io, accessed: 2026-01

work page 2026
[12]

Sonatype OSS Index,

Sonatype, Inc., “Sonatype OSS Index,” https://ossindex.sonatype.org, accessed: 2026-01

work page 2026
[13]

Osv-scanner,

Google, “Osv-scanner,” https://google.github.io/osv-scanner/, accessed: 2026-03-16

work page 2026
[14]

Snyk Vulnerability Database,

Snyk Ltd., “Snyk Vulnerability Database,” https://security.snyk.io, ac- cessed: 2026-01

work page 2026
[15]

Trivy vulnerability scanner,

Aqua Security, “Trivy vulnerability scanner,” 2024. [Online]. Available: https://github.com/aquasecurity/trivy

work page 2024
[16]

Vulnerable open source dependencies: Counting those that matter,

I. Pashchenko, H. Plate, S. E. Ponta, A. Sabetta, and F. Massacci, “Vulnerable open source dependencies: Counting those that matter,” inProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2018, pp. 42:1–42:10

work page 2018
[17]

Detection, assessment and mitigation of vulnerabilities in open source dependencies,

S. E. Ponta, H. Plate, and A. Sabetta, “Detection, assessment and mitigation of vulnerabilities in open source dependencies,”Empirical Software Engineering, 2020

work page 2020
[18]

On the impact of security vulnerabilities in the npm and rubygems dependency networks,

A. Zerouali, T. Mens, A. Decan, and C. De Roover, “On the impact of security vulnerabilities in the npm and rubygems dependency networks,” Empirical Software Engineering, vol. 27, no. 5, 2022

work page 2022
[19]

On the effect of transitivity and granularity on vulnerability propagation in the maven ecosystem,

A. M. Mir, M. Keshani, and S. Proksch, “On the effect of transitivity and granularity on vulnerability propagation in the maven ecosystem,” inSANER, 2023

work page 2023
[20]

A comparative study of vulnera- bility reporting by software composition analysis tools,

N. Imtiaz, S. Thorn, and L. Williams, “A comparative study of vulnera- bility reporting by software composition analysis tools,” inProceedings of the ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, pp. 1–11

work page 2021
[21]

Software composition analysis for vulnerability detection: An empirical study on java projects,

L. Zhao, S. Chen, Z. Xu, C. Liu, L. Zhang, J. Wu, J. Sun, and Y . Liu, “Software composition analysis for vulnerability detection: An empirical study on java projects,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023, pp. 960–972

work page 2023
[22]

Tool or toy: Are sca tools ready for challenging scenarios?

C. Shuet al., “Tool or toy: Are sca tools ready for challenging scenarios?”Computers & Security, 2025

work page 2025
[23]

Identifying affected libraries and their ecosystems for open source software vulnerabilities,

S. Wuet al., “Identifying affected libraries and their ecosystems for open source software vulnerabilities,” inICSE, 2024

work page 2024
[24]

On the security blind spots of software composition analysis,

J. Dietrich, S. Rasheed, A. Jordan, and T. White, “On the security blind spots of software composition analysis,” inSCORED, 2024

work page 2024
[25]

Empirical analysis of security vulnerabilities in python packages,

M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,”Empirical Software Engineering, 2023

work page 2023
[26]

Automated vulnerability validation and verification: A large language model approach,

A. Lotfi, C. Katsis, and E. Bertino, “Automated vulnerability validation and verification: A large language model approach,”arXiv preprint arXiv:2509.24037, 2025

work page arXiv 2025
[27]

Ai-assisted dependency vulnerability resolution in large- scale enterprise systems,

S. R. Kathi, “Ai-assisted dependency vulnerability resolution in large- scale enterprise systems,”International Research Journal of Advanced Engineering and Technology, 2025. APPENDIXA TEMPORALCOMPARISON OFTWOGROUND-TRUTH SNAPSHOTS ANDTHEIRTOOLEVALUATIONS To examine the temporal stability of the underlying vul- nerability data and its impact on tool asses...

work page 2025

[1] [1]

Common Vulnerabilities and Exposures (CVE),

MITRE Corporation, “Common Vulnerabilities and Exposures (CVE),” https://www.cve.org, accessed: 2026-01

work page 2026

[2] [2]

National Vulnerability Database (NVD),

National Institute of Standards and Technology, “National Vulnerability Database (NVD),” https://nvd.nist.gov, accessed: 2026-01

work page 2026

[3] [3]

National Institute of Standards and Technology: Official Common Plat- form Enumeration (CPE) Dictionary,

“National Institute of Standards and Technology: Official Common Plat- form Enumeration (CPE) Dictionary,” https://nvd.nist.gov/products/cpe, accessed: 2026-01

work page 2026

[4] [4]

GitHub Advisory Database,

GitHub, Inc., “GitHub Advisory Database,” https : / / github.com / advisories, accessed: 2026-01

work page 2026

[5] [5]

Open Source Vulnerabilities (OSV),

Google Open Source Security Team, “Open Source Vulnerabilities (OSV),” https://osv.dev, accessed: 2026-01

work page 2026

[6] [6]

Black Duck Software Composition Analysis,

Synopsys, Inc., “Black Duck Software Composition Analysis,” https: / / www.synopsys.com / software - integrity / security - testing / software - composition-analysis.html, accessed: 2026-01

work page 2026

[7] [7]

GitHub Dependabot,

GitHub, Inc., “GitHub Dependabot,” https://docs.github.com/en/code- security/dependabot, accessed: 2026-01

work page 2026

[8] [8]

OW ASP Dependency-Track,

OW ASP Foundation, “OW ASP Dependency-Track,” https : //dependencytrack.org, accessed: 2026-01

work page 2026

[9] [9]

FOSSA Open Source Risk Management,

FOSSA, Inc., “FOSSA Open Source Risk Management,” https : / / fossa.com, accessed: 2026-01

work page 2026

[10] [10]

Grype: A Vulnerability Scanner for Container Images and Filesystems,

Anchore, Inc., “Grype: A Vulnerability Scanner for Container Images and Filesystems,” https://github.com/anchore/grype, accessed: 2026-01

work page 2026

[11] [11]

Mend Open Source Security (formerly WhiteSource),

Mend.io, “Mend Open Source Security (formerly WhiteSource),” https: //www.mend.io, accessed: 2026-01

work page 2026

[12] [12]

Sonatype OSS Index,

Sonatype, Inc., “Sonatype OSS Index,” https://ossindex.sonatype.org, accessed: 2026-01

work page 2026

[13] [13]

Osv-scanner,

Google, “Osv-scanner,” https://google.github.io/osv-scanner/, accessed: 2026-03-16

work page 2026

[14] [14]

Snyk Vulnerability Database,

Snyk Ltd., “Snyk Vulnerability Database,” https://security.snyk.io, ac- cessed: 2026-01

work page 2026

[15] [15]

Trivy vulnerability scanner,

Aqua Security, “Trivy vulnerability scanner,” 2024. [Online]. Available: https://github.com/aquasecurity/trivy

work page 2024

[16] [16]

Vulnerable open source dependencies: Counting those that matter,

I. Pashchenko, H. Plate, S. E. Ponta, A. Sabetta, and F. Massacci, “Vulnerable open source dependencies: Counting those that matter,” inProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2018, pp. 42:1–42:10

work page 2018

[17] [17]

Detection, assessment and mitigation of vulnerabilities in open source dependencies,

S. E. Ponta, H. Plate, and A. Sabetta, “Detection, assessment and mitigation of vulnerabilities in open source dependencies,”Empirical Software Engineering, 2020

work page 2020

[18] [18]

On the impact of security vulnerabilities in the npm and rubygems dependency networks,

A. Zerouali, T. Mens, A. Decan, and C. De Roover, “On the impact of security vulnerabilities in the npm and rubygems dependency networks,” Empirical Software Engineering, vol. 27, no. 5, 2022

work page 2022

[19] [19]

On the effect of transitivity and granularity on vulnerability propagation in the maven ecosystem,

A. M. Mir, M. Keshani, and S. Proksch, “On the effect of transitivity and granularity on vulnerability propagation in the maven ecosystem,” inSANER, 2023

work page 2023

[20] [20]

A comparative study of vulnera- bility reporting by software composition analysis tools,

N. Imtiaz, S. Thorn, and L. Williams, “A comparative study of vulnera- bility reporting by software composition analysis tools,” inProceedings of the ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2021, pp. 1–11

work page 2021

[21] [21]

Software composition analysis for vulnerability detection: An empirical study on java projects,

L. Zhao, S. Chen, Z. Xu, C. Liu, L. Zhang, J. Wu, J. Sun, and Y . Liu, “Software composition analysis for vulnerability detection: An empirical study on java projects,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023, pp. 960–972

work page 2023

[22] [22]

Tool or toy: Are sca tools ready for challenging scenarios?

C. Shuet al., “Tool or toy: Are sca tools ready for challenging scenarios?”Computers & Security, 2025

work page 2025

[23] [23]

Identifying affected libraries and their ecosystems for open source software vulnerabilities,

S. Wuet al., “Identifying affected libraries and their ecosystems for open source software vulnerabilities,” inICSE, 2024

work page 2024

[24] [24]

On the security blind spots of software composition analysis,

J. Dietrich, S. Rasheed, A. Jordan, and T. White, “On the security blind spots of software composition analysis,” inSCORED, 2024

work page 2024

[25] [25]

Empirical analysis of security vulnerabilities in python packages,

M. Alfadel, D. E. Costa, and E. Shihab, “Empirical analysis of security vulnerabilities in python packages,”Empirical Software Engineering, 2023

work page 2023

[26] [26]

Automated vulnerability validation and verification: A large language model approach,

A. Lotfi, C. Katsis, and E. Bertino, “Automated vulnerability validation and verification: A large language model approach,”arXiv preprint arXiv:2509.24037, 2025

work page arXiv 2025

[27] [27]

Ai-assisted dependency vulnerability resolution in large- scale enterprise systems,

S. R. Kathi, “Ai-assisted dependency vulnerability resolution in large- scale enterprise systems,”International Research Journal of Advanced Engineering and Technology, 2025. APPENDIXA TEMPORALCOMPARISON OFTWOGROUND-TRUTH SNAPSHOTS ANDTHEIRTOOLEVALUATIONS To examine the temporal stability of the underlying vul- nerability data and its impact on tool asses...

work page 2025