AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports

Amirali Sajadi; Kostadin Damevski; Preetha Chatterjee; Tu Nguyen

arxiv: 2602.14345 · v2 · pith:LGSZREDKnew · submitted 2026-02-15 · 💻 cs.CR · cs.AI

AXE: Grey-Box Exploitability Confirmation for Localized Vulnerability Reports

Amirali Sajadi , Tu Nguyen , Kostadin Damevski , Preetha Chatterjee This is my paper

classification 💻 cs.CR cs.AI

keywords vulnerabilityexploitationmetadatadetectiongrey-boxreportscodecve-bench

0 comments

read the original abstract

Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non-actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available metadata such as vulnerability type and source-code location. In this paper, we investigate how reported security vulnerabilities can be assessed in a realistic grey-box exploitation setting that leverages minimal vulnerability metadata, specifically a CWE classification and a vulnerable code location. We introduce Agentic eXploit Engine (AXE), a multi-agent framework for Web application exploitation that maps lightweight detection metadata to concrete exploits through decoupled planning, code exploration, and dynamic execution feedback. Evaluated on the CVE-Bench dataset, AXE achieves a 30% exploitation success rate, a 3x improvement over state-of-the-art black-box baselines. Even in a single-agent configuration, grey-box metadata yields a 1.75x performance gain. Systematic error analysis shows that most failed attempts arise from specific reasoning gaps, including misinterpreted vulnerability semantics and unmet execution preconditions. For successful exploits, AXE produces actionable, reproducible proof-of-concept artifacts, demonstrating its utility in streamlining Web vulnerability triage and remediation. We further evaluate AXE's generalizability through a case study on a recent real-world vulnerability not included in CVE-Bench.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
cs.CR 2026-05 unverdicted novelty 5.0

Dual-mode benchmarks reveal frontier LLMs have high false positives and low vulnerability coverage in cybersecurity tasks while domain-specialized models reach over 50% per-family detection and 0.904 precision, indica...