Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
Pith reviewed 2026-05-25 04:26 UTC · model grok-4.3
The pith
Frontier LLMs produce 10-50% false positives in white-box vulnerability detection and cover only 4-8% of ground-truth issues in black-box web app testing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Every frontier model produces 10-50% false positive rates in white-box detection and only 4-8% ground-truth coverage in black-box testing; structured penetration-testing methodology in domain-specialized agents raises per-family detection above 50%, and a domain-specialized defense model reaches 0.904 precision and 9.7% false positive rate.
What carries the argument
Dual-mode benchmark pairing white-box function-level detection with black-box testing of production web applications and 118 labeled vulnerabilities across 20+ CWE families.
If this is right
- Methodology encoded in agents improves detection more than model scale alone.
- Domain-specialized models can deliver higher precision than frontier models on the same tasks.
- Lack of structured security traces and multi-step attack chains in training data is the main bottleneck.
- Self-play security testing can generate the required failure-heavy, end-to-end traces for future training.
Where Pith is reading between the lines
- If the performance gap persists across additional application domains, general-purpose scaling laws may not apply to security tasks.
- Organizations relying on frontier models for automated security review would need substantial human oversight to handle the observed false-positive load.
- Vertical models trained on self-generated attack data could reduce dependence on scarce human-labeled security datasets.
Load-bearing premise
The five chosen web applications and their 118 vulnerabilities are representative of real-world cybersecurity problems and the four testing paradigms measure model capability rather than implementation artifacts.
What would settle it
A new set of production web applications with independently verified vulnerabilities on which a frontier model without domain specialization reaches above 50% ground-truth coverage or below 10% false positive rate.
Figures
read the original abstract
We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a dual-mode benchmark (VulnLLM-R for white-box function-level vulnerability detection across C/Java/Python and black-box testing on five production-style web applications containing 118 ground-truth vulnerabilities across 20+ CWE families) to assess six frontier LLMs (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro, Gemini~3~Flash) and two domain-specialized models under four testing paradigms. It reports 10-50% false-positive rates for frontier models in white-box detection, 4-8% ground-truth coverage (rising to 10-19% with tools) in black-box settings, >50% per-family detection when structured penetration-testing methodology is encoded in domain-specialized agents, and top precision (0.904) plus lowest FPR (9.7%) for a domain-specialized defense model. The authors identify missing structured security traces as a training-data bottleneck, propose self-play security testing for data generation, and conclude that methodology and vertical specialization, rather than scale, are the primary levers, making the case for purpose-built cybersecurity foundation models. The benchmark is slated for open-sourcing.
Significance. If the benchmark construction proves representative and the quantitative results are reproducible, the work supplies concrete empirical measurements showing that frontier LLMs underperform on realistic vulnerability discovery tasks and that domain-specific agents and models can materially outperform them. The explicit comparison of four testing paradigms, the identification of training-data gaps, and the constructive proposal of self-play data generation constitute a useful contribution to the emerging literature on vertical AI for security. Open-sourcing the 118-vulnerability corpus would further strengthen the paper's impact by enabling follow-on replication and extension.
major comments (2)
- [§3] §3 (Benchmark Construction) and the abstract: the five production-style applications and the 118 ground-truth vulnerabilities are presented without stated selection criteria, application sizes, technology stacks, or vulnerability-discovery/validation procedure. Because the headline claims (4-8% coverage, 10-19% with tools, generalization that “frontier LLMs are not ready”) rest on these numbers being representative of real-world targets, the absence of this information is load-bearing and prevents verification that the low coverage figures are not artifacts of the chosen sample.
- [§4] §4 (Experimental Setup) and results tables: the reported performance numbers (10-50% FPR, 0.904 precision, 4-8% coverage) are given without accompanying details on prompt templates, exact model versions and temperatures, statistical significance testing, inter-rater reliability for ground-truth labeling, or controls for data leakage. These omissions make the quantitative claims impossible to reproduce or compare across studies and directly affect the soundness of the conclusion that methodology, not scale, is the dominant factor.
minor comments (2)
- [Abstract] Notation for model versions (e.g., “Codex~5.3”, “Gemini~3.1~Pro”) is inconsistent and should be standardized with precise release identifiers or dates.
- The paper states the benchmark will be open-sourced but does not specify the repository, license, or expected release timeline; adding this information would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point-by-point below and commit to revisions that will strengthen the manuscript's transparency and reproducibility.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction) and the abstract: the five production-style applications and the 118 ground-truth vulnerabilities are presented without stated selection criteria, application sizes, technology stacks, or vulnerability-discovery/validation procedure. Because the headline claims (4-8% coverage, 10-19% with tools, generalization that “frontier LLMs are not ready”) rest on these numbers being representative of real-world targets, the absence of this information is load-bearing and prevents verification that the low coverage figures are not artifacts of the chosen sample.
Authors: We agree that explicit documentation of selection criteria, application characteristics, and validation procedures is necessary to support claims of representativeness. In the revised manuscript we will add a dedicated subsection to §3 that specifies: (1) selection criteria emphasizing diversity across technology stacks (Node.js, Python/Django, Java/Spring, PHP, Ruby on Rails) and real-world production deployment; (2) application sizes (ranging 4.2k–48k LOC) and exact technology stacks; (3) the vulnerability discovery and validation workflow (static analysis + manual penetration testing by two certified security engineers followed by independent review); and (4) the decision to open-source the full 118-vulnerability corpus with accompanying metadata. These additions will allow readers to assess whether the reported 4–8 % coverage figures generalize beyond the chosen sample. revision: yes
-
Referee: [§4] §4 (Experimental Setup) and results tables: the reported performance numbers (10-50% FPR, 0.904 precision, 4-8% coverage) are given without accompanying details on prompt templates, exact model versions and temperatures, statistical significance testing, inter-rater reliability for ground-truth labeling, or controls for data leakage. These omissions make the quantitative claims impossible to reproduce or compare across studies and directly affect the soundness of the conclusion that methodology, not scale, is the dominant factor.
Authors: We concur that the current experimental description lacks the detail required for reproducibility and for rigorously supporting the methodology-versus-scale conclusion. The revised §4 and a new appendix will include: (1) verbatim prompt templates for all four testing paradigms; (2) exact model version strings and sampling parameters (temperature, top-p, max tokens); (3) statistical significance tests (McNemar’s test on per-family detection rates and paired t-tests on precision/FPR); (4) inter-rater reliability statistics (Cohen’s κ) for the ground-truth labeling process; and (5) explicit data-leakage controls (comparison of model training cutoffs against application release dates and vulnerability disclosure timelines). These additions will make the quantitative results verifiable and will strengthen the comparative claims between frontier and domain-specialized models. revision: yes
Circularity Check
No circularity: empirical benchmark results with no derivations or self-referential reductions
full rationale
The paper reports direct empirical measurements of model performance on newly constructed white-box and black-box benchmarks (VulnLLM-R and five web apps with 118 ground-truth vulnerabilities). No equations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text; claims about false-positive rates, coverage percentages, and comparisons to domain-specialized agents are presented as observed outcomes rather than derived quantities. Central findings rest on explicit test protocols and data that are stated to be open-sourced, making the evaluation self-contained against external benchmarks without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
K. Pham et al. AXE: Agentic eXploit Engine for Multi-Agent Web Application Exploitation. arXiv:2602.14345,
- [4]
- [5]
-
[6]
Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review
J. He et al. Confirmation Bias in LLM-Based Vulnera- bility Detection. arXiv:2603.18740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
J. Compton et al. Using LLM Agents to Filter False Pos- itives in Static Analysis. arXiv:2601.22952,
-
[8]
R. Fang et al. Evaluating LLM Agents for Web Vulner- ability Reproduction Under Incomplete Information. arXiv:2510.14700,
- [9]
-
[10]
Evaluating Large Language Models Trained on Code
M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
BloombergGPT: A Large Language Model for Finance
S. Wu et al. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Appendix A Benchmark Application Details 11 B Reasoning Efficiency and Latency 11 C Security Agent Design 11 D Attack Model Capabilities 12 E Pipeline Detail and Operational Compar- ison 12 F Training Data Taxonomy 12 G Ground-Truth Inventory 14 A Benchmark Application Details Mercury E-Commerce Marketplace. Python/FastAPI, React SPA, JWT bearer tokens, m...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.