Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

Hakan T. Otal; Joseph M. Escobar; Michael H. Conaway; Tyler H. Merves; Unal Tatar

arxiv: 2604.17159 · v1 · submitted 2026-04-18 · 💻 cs.CR · cs.AI· cs.CL

Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

Tyler H. Merves , Michael H. Conaway , Joseph M. Escobar , Hakan T. Otal , Unal Tatar This is my paper

Pith reviewed 2026-05-10 05:58 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords LLM agentsoffensive cybersecurityCTF benchmarkingprompt engineeringmodel evaluationKali Linuxcybersecurity toolsagent frameworks

0 comments

The pith

Environment tooling and model selection drive LLM performance on offensive cyber tasks more than prompt engineering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks ten frontier large language models across all two hundred challenges in the NYU CTF Bench using an extended multi-agent framework. It runs a controlled study that varies the operating environment, prompt strategies, and model assignments while keeping other factors fixed. Results show a Kali Linux setup with over one hundred pre-installed tools raises solve rates by 9.5 percentage points over a basic Ubuntu environment, and that prompt interventions often reduce performance once tooling is adequate. Top models reach 59 percent and 52 percent solve rates, with cost efficiency varying sharply. These patterns matter because they point to concrete levers for building more capable LLM agents in cybersecurity.

Core claim

Through systematic testing of ten models on two hundred challenges, the work finds that a custom Kali Linux environment yields a 9.5 percentage-point gain over Ubuntu, Claude 4.5 Opus reaches the highest solve rate at 59 percent, and prompt engineering methods show diminishing or negative returns in well-equipped settings. Same-model planner-executor pairings outperform asymmetric or mixed-tier configurations, while overall results reflect both model reasoning and compatibility with agent tooling and APIs.

What carries the argument

The extended D-CIPHER multi-agent framework with multi-provider support, a custom Kali Linux environment containing over 100 penetration testing tools, and runtime tool-discovery agents, evaluated via a factorial study on the NYU CTF Bench.

If this is right

Equipping agents with a Kali Linux environment containing over 100 tools raises solve rates by 9.5 percentage points.
Claude 4.5 Opus achieves the highest solve rate at 59 percent, followed by Gemini 3 Pro at 52 percent.
Prompt engineering interventions such as auto-prompting degrade performance in already well-equipped environments.
Coherent same-model configurations for planning and execution outperform asymmetric planner-executor assignments.
Gemini 3 Flash delivers the best cost efficiency at roughly 0.05 dollars per solve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Builders of LLM cyber agents should allocate resources first to environment compatibility and model selection rather than broad prompt optimization.
Performance gaps between models likely include differences in API integration quality as well as raw reasoning strength.
Testing the same setups on live or non-CTF networks could show whether the observed environment gains transfer outside controlled benchmarks.

Load-bearing premise

The NYU CTF Bench challenges and extended D-CIPHER framework give a fair, unbiased measure of real-world offensive cyber capability without strong influence from API quirks or tool compatibility differences across models.

What would settle it

Re-running the full factorial study on a new collection of two hundred challenges or with a different tool-rich environment that produces no 9.5-point gain from the Kali setup would indicate the environment advantage is not robust.

Figures

Figures reproduced from arXiv: 2604.17159 by Hakan T. Otal, Joseph M. Escobar, Michael H. Conaway, Tyler H. Merves, Unal Tatar.

**Figure 2.** Figure 2: Solve rates by configuration and challenge category (RQ1+RQ2). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-category solve rates across all ten models. Darker cells indicate [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Cost–performance tradeoff: average cost per challenge vs. solve rate. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D-CIPHER multi-agent framework, we extend it with multi-provider backend support, a custom Kali Linux environment with over 100 pre-installed penetration testing tools, and runtime tool-discovery agents. Through a controlled factorial study, we find that the Kali Linux environment yields a +9.5 percentage-point improvement over Ubuntu, while auto-prompting and category-specific tips often degrade performance in well-equipped environments. Among models, Claude 4.5 Opus achieves the highest solve rate (59%), followed by Gemini 3 Pro (52%), with Gemini 3 Flash offering the best cost-efficiency at $0.05 per solve. Asymmetric planner/executor model assignments provide no meaningful benefit while coherent same-model configurations consistently outperform mixed-tier pairings. Our results indicate that environment tooling and model selection emerge as the strongest drivers of performance, whereas prompt engineering interventions show diminishing or negative returns in well-equipped environments. Reported performance reflects both model reasoning ability and compatibility with agent tooling and API integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers the largest public cross-model benchmark on frontier LLMs for the full NYU CTF offensive challenges, with clear numbers on environment and model effects, but the driver claims rest on unseparated compatibility factors.

read the letter

This paper's main value is running ten frontier models across all 200 NYU CTF challenges in a controlled factorial setup. They extended the D-CIPHER framework with multi-provider backends, a Kali Linux image packed with over 100 tools, and runtime tool-discovery agents. The results give concrete solve rates—Claude 4.5 Opus at 59 percent, Gemini 3 Pro at 52 percent—and show the Kali environment adding 9.5 points over Ubuntu while extra prompting often reduces performance in equipped setups. Same-model configurations beat mixed-tier ones, and they include practical cost figures such as Gemini Flash at five cents per solve. The authors note upfront that the numbers reflect both reasoning and tool/API fit, which keeps the claims grounded in what was actually measured.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks 10 frontier LLMs from 7 providers on all 200 challenges of the NYU CTF Bench using an extended D-CIPHER multi-agent framework. It adds multi-provider support, a Kali Linux environment with >100 pre-installed tools, and runtime tool-discovery agents. A controlled factorial design reports that the Kali environment yields a +9.5 percentage-point solve-rate improvement over Ubuntu, auto-prompting and category tips often degrade performance in equipped settings, Claude 4.5 Opus reaches the highest solve rate (59%), Gemini 3 Pro is second (52%), and Gemini 3 Flash is most cost-efficient. Same-model configurations outperform mixed-tier pairings; the authors conclude that environment tooling and model selection are the dominant drivers while prompt-engineering interventions show diminishing or negative returns. Performance is stated to reflect both reasoning ability and compatibility with tooling/API integration.

Significance. If the attribution of performance drivers holds after addressing measurement gaps, the work supplies the largest-scale controlled empirical data yet on LLM agents for offensive cyber tasks. Strengths include exhaustive use of the 200-challenge benchmark, a factorial design isolating environment and prompting factors, and explicit multi-provider tooling extensions. These elements enable direct comparisons across models and setups that prior smaller-scale studies lack.

major comments (2)

[Abstract / Results] Abstract and Results sections: the reported +9.5 pp Kali improvement and model solve rates (59%, 52%) are presented without error bars, confidence intervals, or any statistical significance tests. Because the central claim attributes performance differences to environment and model selection rather than noise or compatibility artifacts, the absence of these quantifications leaves the magnitude and reliability of the deltas unassessable.
[Methods / Results] Methods and Results: the paper states that 'reported performance reflects both model reasoning ability and compatibility with agent tooling and API integration' yet provides no per-model metrics for tool-call parsing success rate, environment setup failures, or API error frequency. Without these, the factorial comparisons (Kali vs. Ubuntu, same-model vs. mixed-tier) cannot isolate intrinsic capability from integration artifacts, directly undermining the claim that environment tooling is the strongest driver and that prompt engineering shows diminishing returns.

minor comments (2)

[Methods] The scoring procedure for 'solve rate' (binary success per challenge, partial credit, or multi-run averaging) is not detailed; a brief methods paragraph would clarify reproducibility.
[Figures / Tables] Figure captions and tables should explicitly state the number of independent runs per condition and whether the same random seeds were used across models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results sections: the reported +9.5 pp Kali improvement and model solve rates (59%, 52%) are presented without error bars, confidence intervals, or any statistical significance tests. Because the central claim attributes performance differences to environment and model selection rather than noise or compatibility artifacts, the absence of these quantifications leaves the magnitude and reliability of the deltas unassessable.

Authors: We agree that the absence of error bars and statistical tests limits the ability to assess the reliability of the reported differences. In the revised manuscript we will add 95% binomial confidence intervals for all reported solve rates and apply appropriate statistical tests (chi-squared tests for proportions and McNemar’s test for paired comparisons) to the factorial results. These additions will appear in the Results section and be referenced in the abstract. revision: yes
Referee: [Methods / Results] Methods and Results: the paper states that 'reported performance reflects both model reasoning ability and compatibility with agent tooling and API integration' yet provides no per-model metrics for tool-call parsing success rate, environment setup failures, or API error frequency. Without these, the factorial comparisons (Kali vs. Ubuntu, same-model vs. mixed-tier) cannot isolate intrinsic capability from integration artifacts, directly undermining the claim that environment tooling is the strongest driver and that prompt engineering shows diminishing returns.

Authors: This criticism is valid. Our current logs do not contain the granular per-model breakdowns of tool-call parsing success, setup failures, or API error rates needed for full isolation. We will add a new subsection in Methods describing the aggregate API and tool-usage statistics that are available from our runs, and we will expand the Limitations section to explicitly discuss how integration compatibility may contribute to observed differences. This will qualify our claims about environment and model selection without overstating the isolation achieved. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

This paper reports direct empirical measurements of LLM solve rates on the external NYU CTF Bench using factorial comparisons across environments, models, and prompting strategies. No mathematical derivations, equations, fitted parameters, or predictions appear in the text. The central claims (environment and model selection as strongest drivers) follow from observed percentage-point differences and rankings, not from any self-referential reduction or self-citation that defines the outcomes by construction. The D-CIPHER framework citation supplies the experimental setup but does not bear the load of the performance results, which are independently measured against the benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that the NYU CTF Bench and D-CIPHER framework are valid proxies for offensive cyber tasks and that API integration differences do not systematically favor certain models.

axioms (2)

domain assumption NYU CTF Bench challenges constitute a representative sample of offensive cybersecurity tasks
Invoked by using the full 200 challenges as the evaluation set without further justification in the abstract.
domain assumption Solve rate is an unbiased metric of agent capability when using the extended D-CIPHER framework
Central to interpreting all reported percentages and comparisons.

pith-pipeline@v0.9.0 · 5526 in / 1489 out tokens · 49884 ms · 2026-05-10T05:58:41.053564+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Generative AI in cybersecurity: A com- prehensive review of LLM applications and vulnerabilities,

M. A. Ferrag, F. Alwahedi, A. Battah, B. Cherif, A. Mechri, N. Tihanyi, T. Bisztray, and M. Debbah, “Generative AI in cybersecurity: A com- prehensive review of LLM applications and vulnerabilities,”Internet of Things and Cyber-Physical Systems, 2025

work page 2025
[2]

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang

R. Fang, R. Bindu, A. Gupta, Q. Zhan, and D. Kang, “LLM agents can autonomously hack websites,”arXiv preprint arXiv:2402.06664, 2024

work page arXiv 2024
[3]

PentestGPT: Evaluating and harnessing large language models for automated penetration testing,

G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinz´on, and S. Rass, “PentestGPT: Evaluating and harnessing large language models for automated penetration testing,” inUSENIX Security Symposium, 2024

work page 2024
[4]

Occult: Evaluating large language models for offensive cyber operation capabilities,

M. Kouremetis, M. Dotter, A. Byrne, D. Martin, E. Michalak, G. Russo, M. Threet, and G. Zarrella, “OCCULT: Evaluating large language models for offensive cyber operation capabilities,” 2025, arXiv:2502.15797. [Online]. Available: http://arxiv.org/abs/2502.15797

work page arXiv 2025
[5]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2025

M. Shao, S. Jancheska, M. Udeshi, B. Dolan-Gavitt, H. Xi, K. Milner, B. Chen, M. Yin, S. Garg, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique, “NYU CTF Bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security,” 2025, arXiv:2406.05590. [Online]. Available: http://arxiv.org/abs/2406.05590

work page arXiv 2025
[6]

Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press, “EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabilities,” 2025, arXiv:2409.16165. [Online]. Available: http://arxiv.o...

work page arXiv 2025
[7]

D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs/2502.10931, 2025

M. Udeshi, M. Shao, H. Xi, N. Rani, K. Milner, V . S. C. Putrevu, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique, “D-CIPHER: Dynamic collaborative intelligent multi-agent system with planner and heterogeneous executors for offensive security,” 2025, arXiv:2502.10931. [Online]. Available: http://arxiv.org/abs/2502.10931

work page arXiv 2025
[8]

CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

M. Shao, H. Xi, N. Rani, M. Udeshi, V . S. C. Putrevu, K. Milner, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique, “CRAKEN: Cybersecurity LLM agent with knowledge-based execution,” 2025, arXiv:2505.17107. [Online]. Available: http://arxiv.org/abs/2505.17107

work page arXiv 2025
[9]

An empirical evaluation of llms for solving offensive security challenges

M. Shao, B. Chen, S. Jancheska, B. Dolan-Gavitt, S. Garg, R. Karri, and M. Shafique, “An empirical evaluation of LLMs for solving offensive security challenges,” 2024, arXiv:2402.11814. [Online]. Available: http://arxiv.org/abs/2402.11814

work page arXiv 2024
[10]

K., et al

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasperet al., “Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,” 2025, arXiv:2408.08926. [Online]. Available: http://arxiv.org/abs/2408.08926

work page arXiv 2025
[11]

CTFusion: A CTF-based benchmark for LLM agent evaluation,

Anonymous, “CTFusion: A CTF-based benchmark for LLM agent evaluation,”Under review at ICLR, 2026

work page 2026
[12]

Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T

A. K. Zhang, J. Ji, C. Menders, R. Dulepet, T. Qin, R. Y . Wang, J. Wu, K. Liao, J. Li, J. Huet al., “BountyBench: Dollar impact of AI agent attackers and defenders on real-world cybersecurity systems,” 2025, arXiv:2505.15216. [Online]. Available: http://arxiv.org/abs/2505.15216

work page arXiv 2025

[1] [1]

Generative AI in cybersecurity: A com- prehensive review of LLM applications and vulnerabilities,

M. A. Ferrag, F. Alwahedi, A. Battah, B. Cherif, A. Mechri, N. Tihanyi, T. Bisztray, and M. Debbah, “Generative AI in cybersecurity: A com- prehensive review of LLM applications and vulnerabilities,”Internet of Things and Cyber-Physical Systems, 2025

work page 2025

[2] [2]

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang

R. Fang, R. Bindu, A. Gupta, Q. Zhan, and D. Kang, “LLM agents can autonomously hack websites,”arXiv preprint arXiv:2402.06664, 2024

work page arXiv 2024

[3] [3]

PentestGPT: Evaluating and harnessing large language models for automated penetration testing,

G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinz´on, and S. Rass, “PentestGPT: Evaluating and harnessing large language models for automated penetration testing,” inUSENIX Security Symposium, 2024

work page 2024

[4] [4]

Occult: Evaluating large language models for offensive cyber operation capabilities,

M. Kouremetis, M. Dotter, A. Byrne, D. Martin, E. Michalak, G. Russo, M. Threet, and G. Zarrella, “OCCULT: Evaluating large language models for offensive cyber operation capabilities,” 2025, arXiv:2502.15797. [Online]. Available: http://arxiv.org/abs/2502.15797

work page arXiv 2025

[5] [5]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2025

M. Shao, S. Jancheska, M. Udeshi, B. Dolan-Gavitt, H. Xi, K. Milner, B. Chen, M. Yin, S. Garg, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique, “NYU CTF Bench: A scalable open-source benchmark dataset for evaluating LLMs in offensive security,” 2025, arXiv:2406.05590. [Online]. Available: http://arxiv.org/abs/2406.05590

work page arXiv 2025

[6] [6]

Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press, “EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabilities,” 2025, arXiv:2409.16165. [Online]. Available: http://arxiv.o...

work page arXiv 2025

[7] [7]

D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs/2502.10931, 2025

M. Udeshi, M. Shao, H. Xi, N. Rani, K. Milner, V . S. C. Putrevu, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique, “D-CIPHER: Dynamic collaborative intelligent multi-agent system with planner and heterogeneous executors for offensive security,” 2025, arXiv:2502.10931. [Online]. Available: http://arxiv.org/abs/2502.10931

work page arXiv 2025

[8] [8]

CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

M. Shao, H. Xi, N. Rani, M. Udeshi, V . S. C. Putrevu, K. Milner, B. Dolan-Gavitt, S. K. Shukla, P. Krishnamurthy, F. Khorrami, R. Karri, and M. Shafique, “CRAKEN: Cybersecurity LLM agent with knowledge-based execution,” 2025, arXiv:2505.17107. [Online]. Available: http://arxiv.org/abs/2505.17107

work page arXiv 2025

[9] [9]

An empirical evaluation of llms for solving offensive security challenges

M. Shao, B. Chen, S. Jancheska, B. Dolan-Gavitt, S. Garg, R. Karri, and M. Shafique, “An empirical evaluation of LLMs for solving offensive security challenges,” 2024, arXiv:2402.11814. [Online]. Available: http://arxiv.org/abs/2402.11814

work page arXiv 2024

[10] [10]

K., et al

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasperet al., “Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,” 2025, arXiv:2408.08926. [Online]. Available: http://arxiv.org/abs/2408.08926

work page arXiv 2025

[11] [11]

CTFusion: A CTF-based benchmark for LLM agent evaluation,

Anonymous, “CTFusion: A CTF-based benchmark for LLM agent evaluation,”Under review at ICLR, 2026

work page 2026

[12] [12]

Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T

A. K. Zhang, J. Ji, C. Menders, R. Dulepet, T. Qin, R. Y . Wang, J. Wu, K. Liao, J. Li, J. Huet al., “BountyBench: Dollar impact of AI agent attackers and defenders on real-world cybersecurity systems,” 2025, arXiv:2505.15216. [Online]. Available: http://arxiv.org/abs/2505.15216

work page arXiv 2025