Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

Bhavik Shangari; Chandra Khatri; Sunny Nehra; Vipul Dholariya; Vivek Dahiya

arxiv: 2605.23243 · v1 · pith:GM5KJ4KJnew · submitted 2026-05-22 · 💻 cs.CR · cs.AI

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

Vivek Dahiya , Sunny Nehra , Vipul Dholariya , Bhavik Shangari , Chandra Khatri This is my paper

Pith reviewed 2026-05-25 04:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords vulnerability detectionlarge language modelscybersecurity benchmarksfalse positivespenetration testingvertical foundation modelsweb application securitydomain specialization

0 comments

The pith

Frontier LLMs produce 10-50% false positives in white-box vulnerability detection and cover only 4-8% of ground-truth issues in black-box web app testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates six frontier models on a dual-mode benchmark consisting of function-level vulnerability detection across languages and black-box security testing of five production-style web applications containing 118 ground-truth vulnerabilities. It shows that general models systematically over-predict issues and miss most real ones even when given external tools, while models and agents built with domain-specific methodology reach substantially higher precision and per-family detection rates. The authors trace the gap to missing end-to-end security traces and attack chains in training data and conclude that purpose-built vertical models are required rather than further scaling of general models.

Core claim

Every frontier model produces 10-50% false positive rates in white-box detection and only 4-8% ground-truth coverage in black-box testing; structured penetration-testing methodology in domain-specialized agents raises per-family detection above 50%, and a domain-specialized defense model reaches 0.904 precision and 9.7% false positive rate.

What carries the argument

Dual-mode benchmark pairing white-box function-level detection with black-box testing of production web applications and 118 labeled vulnerabilities across 20+ CWE families.

If this is right

Methodology encoded in agents improves detection more than model scale alone.
Domain-specialized models can deliver higher precision than frontier models on the same tasks.
Lack of structured security traces and multi-step attack chains in training data is the main bottleneck.
Self-play security testing can generate the required failure-heavy, end-to-end traces for future training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the performance gap persists across additional application domains, general-purpose scaling laws may not apply to security tasks.
Organizations relying on frontier models for automated security review would need substantial human oversight to handle the observed false-positive load.
Vertical models trained on self-generated attack data could reduce dependence on scarce human-labeled security datasets.

Load-bearing premise

The five chosen web applications and their 118 vulnerabilities are representative of real-world cybersecurity problems and the four testing paradigms measure model capability rather than implementation artifacts.

What would settle it

A new set of production web applications with independently verified vulnerabilities on which a frontier model without domain specialization reaches above 50% ground-truth coverage or below 10% false positive rate.

Figures

Figures reproduced from arXiv: 2605.23243 by Bhavik Shangari, Chandra Khatri, Sunny Nehra, Vipul Dholariya, Vivek Dahiya.

read the original abstract

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Frontier LLMs show low coverage on this new dual benchmark while specialized agents do better, but the five apps and 118 vulns need checking for how typical they are.

read the letter

The main takeaway is that this paper gives concrete numbers on six frontier models doing 4-8% ground-truth coverage in black-box web app testing and 10-50% false positives in white-box detection, with domain-specialized agents reaching over 50% per family and one defense model hitting 0.904 precision. That is the new empirical content: a dual-mode setup across C/Java/Python functions plus five production-style apps, plus the claim that methodology beats scale. They also plan to release the benchmark, which is useful if the data holds up. The work is straightforward about the training data gap they see in structured traces and failure cases. The soft spot is exactly the one in the stress-test note. The black-box results rest on five unspecified apps and 118 vulnerabilities across 20+ CWE families, with no selection criteria or validation procedure stated in the abstract. If those apps are small or skewed toward easy cases, the 4-8% coverage figure does not generalize to real targets. The white-box numbers and the agent comparisons inherit the same issue. Prompt details, model versions, and leakage controls are also thin in what is visible. The paper is for researchers working on LLM security tooling or vertical models. It deserves a serious referee because it ships new measurements and an open benchmark plan, even if the representativeness question needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces a dual-mode benchmark (VulnLLM-R for white-box function-level vulnerability detection across C/Java/Python and black-box testing on five production-style web applications containing 118 ground-truth vulnerabilities across 20+ CWE families) to assess six frontier LLMs (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro, Gemini~3~Flash) and two domain-specialized models under four testing paradigms. It reports 10-50% false-positive rates for frontier models in white-box detection, 4-8% ground-truth coverage (rising to 10-19% with tools) in black-box settings, >50% per-family detection when structured penetration-testing methodology is encoded in domain-specialized agents, and top precision (0.904) plus lowest FPR (9.7%) for a domain-specialized defense model. The authors identify missing structured security traces as a training-data bottleneck, propose self-play security testing for data generation, and conclude that methodology and vertical specialization, rather than scale, are the primary levers, making the case for purpose-built cybersecurity foundation models. The benchmark is slated for open-sourcing.

Significance. If the benchmark construction proves representative and the quantitative results are reproducible, the work supplies concrete empirical measurements showing that frontier LLMs underperform on realistic vulnerability discovery tasks and that domain-specific agents and models can materially outperform them. The explicit comparison of four testing paradigms, the identification of training-data gaps, and the constructive proposal of self-play data generation constitute a useful contribution to the emerging literature on vertical AI for security. Open-sourcing the 118-vulnerability corpus would further strengthen the paper's impact by enabling follow-on replication and extension.

major comments (2)

[§3] §3 (Benchmark Construction) and the abstract: the five production-style applications and the 118 ground-truth vulnerabilities are presented without stated selection criteria, application sizes, technology stacks, or vulnerability-discovery/validation procedure. Because the headline claims (4-8% coverage, 10-19% with tools, generalization that “frontier LLMs are not ready”) rest on these numbers being representative of real-world targets, the absence of this information is load-bearing and prevents verification that the low coverage figures are not artifacts of the chosen sample.
[§4] §4 (Experimental Setup) and results tables: the reported performance numbers (10-50% FPR, 0.904 precision, 4-8% coverage) are given without accompanying details on prompt templates, exact model versions and temperatures, statistical significance testing, inter-rater reliability for ground-truth labeling, or controls for data leakage. These omissions make the quantitative claims impossible to reproduce or compare across studies and directly affect the soundness of the conclusion that methodology, not scale, is the dominant factor.

minor comments (2)

[Abstract] Notation for model versions (e.g., “Codex~5.3”, “Gemini~3.1~Pro”) is inconsistent and should be standardized with precise release identifiers or dates.
The paper states the benchmark will be open-sourced but does not specify the repository, license, or expected release timeline; adding this information would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point-by-point below and commit to revisions that will strengthen the manuscript's transparency and reproducibility.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction) and the abstract: the five production-style applications and the 118 ground-truth vulnerabilities are presented without stated selection criteria, application sizes, technology stacks, or vulnerability-discovery/validation procedure. Because the headline claims (4-8% coverage, 10-19% with tools, generalization that “frontier LLMs are not ready”) rest on these numbers being representative of real-world targets, the absence of this information is load-bearing and prevents verification that the low coverage figures are not artifacts of the chosen sample.

Authors: We agree that explicit documentation of selection criteria, application characteristics, and validation procedures is necessary to support claims of representativeness. In the revised manuscript we will add a dedicated subsection to §3 that specifies: (1) selection criteria emphasizing diversity across technology stacks (Node.js, Python/Django, Java/Spring, PHP, Ruby on Rails) and real-world production deployment; (2) application sizes (ranging 4.2k–48k LOC) and exact technology stacks; (3) the vulnerability discovery and validation workflow (static analysis + manual penetration testing by two certified security engineers followed by independent review); and (4) the decision to open-source the full 118-vulnerability corpus with accompanying metadata. These additions will allow readers to assess whether the reported 4–8 % coverage figures generalize beyond the chosen sample. revision: yes
Referee: [§4] §4 (Experimental Setup) and results tables: the reported performance numbers (10-50% FPR, 0.904 precision, 4-8% coverage) are given without accompanying details on prompt templates, exact model versions and temperatures, statistical significance testing, inter-rater reliability for ground-truth labeling, or controls for data leakage. These omissions make the quantitative claims impossible to reproduce or compare across studies and directly affect the soundness of the conclusion that methodology, not scale, is the dominant factor.

Authors: We concur that the current experimental description lacks the detail required for reproducibility and for rigorously supporting the methodology-versus-scale conclusion. The revised §4 and a new appendix will include: (1) verbatim prompt templates for all four testing paradigms; (2) exact model version strings and sampling parameters (temperature, top-p, max tokens); (3) statistical significance tests (McNemar’s test on per-family detection rates and paired t-tests on precision/FPR); (4) inter-rater reliability statistics (Cohen’s κ) for the ground-truth labeling process; and (5) explicit data-leakage controls (comparison of model training cutoffs against application release dates and vulnerability disclosure timelines). These additions will make the quantitative results verifiable and will strengthen the comparative claims between frontier and domain-specialized models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential reductions

full rationale

The paper reports direct empirical measurements of model performance on newly constructed white-box and black-box benchmarks (VulnLLM-R and five web apps with 118 ground-truth vulnerabilities). No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text; claims about false-positive rates, coverage percentages, and comparisons to domain-specialized agents are presented as observed outcomes rather than derived quantities. Central findings rest on explicit test protocols and data that are stated to be open-sourced, making the evaluation self-contained against external benchmarks without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, fitted parameters, or new postulated entities; the claims rest on the design and execution of the two benchmarks and the assumption that the tested models and applications are representative.

pith-pipeline@v0.9.0 · 5851 in / 1360 out tokens · 22550 ms · 2026-05-25T04:26:24.645910+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 3 internal anchors

[1]

arXiv:2406.05590. Y . Li et al. CVE-Bench: A Benchmark for AI Agents on Real-World CVE Exploitation. InProc. ICML, 2025 (Spotlight). arXiv:2503.17332. J. Luo et al. CAIBench: A Comprehensive Meta-Benchmark for AI Cybersecurity Agents. arXiv:2510.24317,

work page arXiv 2025
[2]

Muzsai, D

D. Muzsai, D. Imolai, and A. Lukács. HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing. arXiv:2412.01778,

work page arXiv
[3]

Pham et al

K. Pham et al. AXE: Agentic eXploit Engine for Multi-Agent Web Application Exploitation. arXiv:2602.14345,

work page arXiv
[4]

Wen et al

Y . Wen et al. ZeroDayBench: Evaluating LLM Agents for Zero-Day Vulnerability Patching. arXiv:2603.02297,

work page arXiv
[5]

Lin et al

J. Lin et al. Comparing AI Agents to Cybersecu- rity Professionals in Real-World Penetration Testing. arXiv:2512.09882,

work page arXiv
[6]

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

J. He et al. Confirmation Bias in LLM-Based Vulnera- bility Detection. arXiv:2603.18740,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Compton et al

J. Compton et al. Using LLM Agents to Filter False Pos- itives in Static Analysis. arXiv:2601.22952,

work page arXiv
[8]

Fang et al

R. Fang et al. Evaluating LLM Agents for Web Vulner- ability Reproduction Under Incomplete Information. arXiv:2510.14700,

work page arXiv
[9]

arXiv:2404.13161. A. Fan et al. Large Language Models for Software Engineering: Survey and Open Problems. InProc. ICSE-FoSE,

work page arXiv
[10]

Evaluating Large Language Models Trained on Code

M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

BloombergGPT: A Large Language Model for Finance

S. Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

potential

Appendix A Benchmark Application Details 11 B Reasoning Efficiency and Latency 11 C Security Agent Design 11 D Attack Model Capabilities 12 E Pipeline Detail and Operational Compar- ison 12 F Training Data Taxonomy 12 G Ground-Truth Inventory 14 A Benchmark Application Details Mercury E-Commerce Marketplace. Python/FastAPI, React SPA, JWT bearer tokens, m...

work page 2023

[1] [1]

arXiv:2406.05590. Y . Li et al. CVE-Bench: A Benchmark for AI Agents on Real-World CVE Exploitation. InProc. ICML, 2025 (Spotlight). arXiv:2503.17332. J. Luo et al. CAIBench: A Comprehensive Meta-Benchmark for AI Cybersecurity Agents. arXiv:2510.24317,

work page arXiv 2025

[2] [2]

Muzsai, D

D. Muzsai, D. Imolai, and A. Lukács. HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing. arXiv:2412.01778,

work page arXiv

[3] [3]

Pham et al

K. Pham et al. AXE: Agentic eXploit Engine for Multi-Agent Web Application Exploitation. arXiv:2602.14345,

work page arXiv

[4] [4]

Wen et al

Y . Wen et al. ZeroDayBench: Evaluating LLM Agents for Zero-Day Vulnerability Patching. arXiv:2603.02297,

work page arXiv

[5] [5]

Lin et al

J. Lin et al. Comparing AI Agents to Cybersecu- rity Professionals in Real-World Penetration Testing. arXiv:2512.09882,

work page arXiv

[6] [6]

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

J. He et al. Confirmation Bias in LLM-Based Vulnera- bility Detection. arXiv:2603.18740,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Compton et al

J. Compton et al. Using LLM Agents to Filter False Pos- itives in Static Analysis. arXiv:2601.22952,

work page arXiv

[8] [8]

Fang et al

R. Fang et al. Evaluating LLM Agents for Web Vulner- ability Reproduction Under Incomplete Information. arXiv:2510.14700,

work page arXiv

[9] [9]

arXiv:2404.13161. A. Fan et al. Large Language Models for Software Engineering: Survey and Open Problems. InProc. ICSE-FoSE,

work page arXiv

[10] [10]

Evaluating Large Language Models Trained on Code

M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

BloombergGPT: A Large Language Model for Finance

S. Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

potential

Appendix A Benchmark Application Details 11 B Reasoning Efficiency and Latency 11 C Security Agent Design 11 D Attack Model Capabilities 12 E Pipeline Detail and Operational Compar- ison 12 F Training Data Taxonomy 12 G Ground-Truth Inventory 14 A Benchmark Application Details Mercury E-Commerce Marketplace. Python/FastAPI, React SPA, JWT bearer tokens, m...

work page 2023