The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

Brian Tse; Geng Hong; Jiaqi Luo; Jiarun Dai; Jia Xu; Min Yang; Weibing Wang; Xudong Pan; Yawen Duan; Yuan Zhang

arxiv: 2606.13079 · v2 · pith:44UQYU3Bnew · submitted 2026-06-11 · 💻 cs.CR · cs.AI

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

Jiaqi Luo , Jiarun Dai , Zhile Chen , Jia Xu , Weibing Wang , Yawen Duan , Brian Tse , Geng Hong

show 3 more authors

Xudong Pan Yuan Zhang Min Yang

This is my paper

Pith reviewed 2026-06-30 11:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords autonomous penetrationLLM agentscybersecurity evaluationAI red linespenetration testingmodel capability scaling

0 comments

The pith

LLM-powered agents penetrate servers at success rates from 10.7% to 69.3% using only general tools and no target-specific knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a new evaluation framework to measure autonomous penetration by LLM agents more realistically than prior work. It deploys 300 target servers in two tiers, each with one vulnerable service plus either one or three secure services that have no known vulnerabilities. A general-purpose agent scaffolding supplies standard cybersecurity tools but supplies zero prior information about any specific target. When 19 models are tested, penetration success falls in the 10.7%–69.3% range. The same experiments show that success rises in step with gains in the underlying model’s overall capability.

Core claim

Current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

What carries the argument

Tier-1 and Tier-2 target environments (one or three secure services without known vulnerabilities alongside a vulnerable service) paired with general-purpose agent scaffolding that uses only standard cybersecurity tools and no target-specific prior knowledge.

If this is right

Autonomous penetration is feasible for current models without human intervention or target-specific guidance.
Penetration success increases as base model capability advances.
Red-line concerns for high-impact cyberattacks remain relevant even under the constraints of realistic, knowledge-limited agent setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the tiered environments capture real difficulty levels, then further model scaling will push autonomous cyber capabilities past current safety thresholds.
General capability scaling laws may transfer directly to offense-oriented tasks such as penetration.
Standardized benchmarks built on this tiered design could become routine for tracking whether models cross red lines in cyber domains.

Load-bearing premise

The constructed Tier-1 and Tier-2 environments plus the general-purpose agent scaffolding without target-specific prior knowledge accurately reflect the difficulty of real-world autonomous penetration tasks.

What would settle it

A test in which the same models, given only the same general tools and no target-specific knowledge, achieve near-zero success rates against live production servers that match the tiered structure.

read the original abstract

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures 10.7-69.3% penetration success across 19 models on 300 tiered servers and shows scaling with capability, but the 1-3 secure service setups likely understate real target complexity.

read the letter

The main point is that current LLMs reach autonomous penetration rates from 10.7% to 69.3% in this new setup, with stronger models doing better. They built 300 servers split into Tier-1 (one secure service plus the vulnerable one) and Tier-2 (three secure services), then ran a general agent with standard tools and no target-specific hints.

The tiered design and the scale are the clearest additions. Adding secure services forces the agent to handle some noise instead of just hitting an obvious vulnerable target, and 300 instances give more data than many earlier tests. The general scaffolding without priors also moves away from the spoon-feeding the abstract criticizes in prior work.

The environments still look too clean. Real servers usually have more services, logging that flags odd tool use, and discovery steps that these tiers skip. The abstract gives the headline numbers but leaves out error bars, per-tier breakdowns, and how they confirmed actual access. If the vulnerable services were easy to fingerprint or the servers were generated with any selection bias, the rates won't travel well to the red-line scenarios the paper invokes.

This is aimed at groups tracking AI cyber red lines and safety evaluations. The measurements are direct enough to be worth checking against other benchmarks. It deserves a serious referee to examine the server generation process, success verification, and whether the scaling claim holds once the full methods are visible.

Referee Report

2 major / 0 minor

Summary. The paper introduces a new evaluation framework for autonomous penetration capabilities of LLM-powered AI systems. It constructs Tier-1 (one secure service) and Tier-2 (three secure services) target environments alongside a vulnerable service, generating 300 servers total. A general-purpose agent scaffolding equipped with cybersecurity tools and no target-specific prior knowledge is used to evaluate 19 open-weight and proprietary LLMs. The central empirical findings are penetration success rates ranging from 10.7% to 69.3%, with the additional observation that autonomous penetration capability scales with overall model capability.

Significance. If the measured rates hold under the stated conditions, the work supplies one of the larger-scale controlled empirical datasets on a key AI red-line capability, using general-purpose tooling rather than task-specific guidance. This could inform safety benchmarks and scaling forecasts. The explicit construction of two environment tiers and the scale (300 servers, 19 models) are strengths for reproducibility of the measurement protocol.

major comments (2)

[Target server construction] Target server construction (abstract and methodology): The Tier-1/Tier-2 environments are defined solely by the count of secure services (1 or 3) plus one vulnerable service. No details are provided on service selection, fingerprinting resistance, or confirmation that the secure services truly have no known vulnerabilities or exploitable interactions. This is load-bearing for the headline rates (10.7–69.3 %) because the central claim that these rates reflect realistic autonomous penetration without excessive simplification rests on the environments being representative.
[Results] Results presentation: The success rates are stated as a range without error bars, per-tier (Tier-1 vs Tier-2) breakdowns, per-model tables, or any description of the verification procedure used to confirm successful penetration (e.g., evidence of unauthorized access or control). This directly affects assessment of the scaling claim and the reliability of the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our evaluation framework and empirical results. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Target server construction] Target server construction (abstract and methodology): The Tier-1/Tier-2 environments are defined solely by the count of secure services (1 or 3) plus one vulnerable service. No details are provided on service selection, fingerprinting resistance, or confirmation that the secure services truly have no known vulnerabilities or exploitable interactions. This is load-bearing for the headline rates (10.7–69.3 %) because the central claim that these rates reflect realistic autonomous penetration without excessive simplification rests on the environments being representative.

Authors: We agree that additional methodological detail is required to substantiate the claim that the environments are representative. In the revised manuscript we will expand the target-server construction subsection to specify the exact services and versions chosen for the secure tiers, the process used to confirm absence of known vulnerabilities (CVE database queries plus manual verification), and configuration steps taken to limit fingerprinting. These additions will directly support the headline success rates. revision: yes
Referee: [Results] Results presentation: The success rates are stated as a range without error bars, per-tier (Tier-1 vs Tier-2) breakdowns, per-model tables, or any description of the verification procedure used to confirm successful penetration (e.g., evidence of unauthorized access or control). This directly affects assessment of the scaling claim and the reliability of the reported numbers.

Authors: We concur that the current results section lacks sufficient granularity and transparency. The revised version will include a per-model table, separate Tier-1 and Tier-2 success rates, error bars derived from the 300-server sample, and an explicit description of the verification procedure (observable indicators of unauthorized access or control). These changes will allow readers to evaluate both the scaling observation and the numerical reliability of the reported rates. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of observed success rates

full rationale

The paper constructs Tier-1/Tier-2 server environments and a general-purpose agent scaffolding, then runs 19 LLMs to record penetration success rates (10.7–69.3 %). These rates are measured outcomes from agent executions on the constructed targets, not quantities obtained by fitting parameters, solving self-referential equations, or reducing via self-citation chains. No load-bearing step equates a claimed result to its own inputs by construction; the central claims are falsifiable experimental observations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters are fitted to data, no mathematical axioms are invoked beyond standard testing assumptions, and no new physical or computational entities are postulated.

pith-pipeline@v0.9.1-grok · 5852 in / 1151 out tokens · 35343 ms · 2026-06-30T11:24:43.242605+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Guo, D.et al.Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Hou, X.et al.Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 1–79 (2024)

2024
[3]

Consensus Statement on Red Lines in Artificial Intelligence

Yoshua Bengio and Geoffrey Hinton and Andrew Chi-Chih Yao and others. Consensus Statement on Red Lines in Artificial Intelligence. https://idais- beijing.baai.ac.cn/?lang=en. Accessed: 2024-10-22

2024
[4]

Managing Advanced Cyber Risks in Frontier AI Frameworks

Frontier Model Forum. Managing Advanced Cyber Risks in Frontier AI Frameworks. https://www.frontiermodelforum.org/technical-reports/managing-advanced-cyber-risks- in-frontier-ai-frameworks/. Accessed: 2026-06-01

2026
[5]

Frontier AI Risk Manage- ment Framework

Shanghai Artificial Intelligence Laboratory and Concordia AI. Frontier AI Risk Manage- ment Framework. https://concordia-ai.com/wp-content/uploads/2026/02/Frontier-AI- Risk-Management-Framework-v1.5.pdf. Accessed: 2026-06-01

2026
[6]

OpenAI GPT5.2 System Card (New)

OpenAI. OpenAI GPT5.2 System Card (New). https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai 5 2 system-card.pdf. Accessed: 2026-05-25

2026
[7]

In 33rd USENIX Security Symposium (USENIX Security 24), pp

Deng, G.et al.Pentestgpt: Evaluating and harnessing large language models for automated penetration testing (2024). In 33rd USENIX Security Symposium (USENIX Security 24), pp. 847–864, USENIX Association

2024
[8]

Antropic Claude Opus 4.5 System Card

Antropic. Antropic Claude Opus 4.5 System Card. https://www-cdn.anthropic.com/b f10f64990cfda0ba858290be7b8cc6317685f47.pdf. Accessed: 2026-05-25

2026
[9]

Gioacchini, L.et al.Autopenbench: Benchmarking generative agents for penetration testing.arXiv preprint arXiv:2410.03225(2024)

work page arXiv 2024
[10]

& Buford, J

Liu, Z., Shi, J. & Buford, J. F. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity (2024). In AAAI 2024 Workshop on Artificial Intelligence for Cyber Security, AAAI Press

2024
[11]

Shao, M.et al.Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37, 57472– 57498 (2024)

2024
[12]

Phuong, M.et al.Evaluating frontier models for dangerous capabilities.arXiv preprint arXiv:2403.13793(2024)

work page arXiv 2024
[13]

arXiv preprint arXiv:2503.11917(2025)

Rodriguez, M.et al.A framework for evaluating emerging cyberattack capabilities of ai. arXiv preprint arXiv:2503.11917(2025)

work page arXiv 2025
[14]

arXiv preprint arXiv:2512.08864(2025)

Barrett, S.et al.Toward quantitative modeling of cybersecurity risks due to ai misuse. arXiv preprint arXiv:2512.08864(2025)

work page arXiv 2025
[15]

Project Glasswing

Anthropic. Project Glasswing. https://www.anthropic.com/glasswing. Accessed: 2024- 12-05

2024
[16]

Zhu, Y.et al.Cve-bench: a benchmark for ai agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332(2025)

work page arXiv 2025
[17]

In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp

Ji, Z.et al.Measuring and augmenting large language models for solving capture-the-flag challenges (2025). In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp. 603–617, ACM

2025
[18]

& Kim, E

Isozaki, I., Shrestha, M., Console, R. & Kim, E. Towards automated penetration testing: Introducing llm benchmark, analysis, and improvements (2025). In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, pp. 404–419, ACM

2025
[19]

& Orebaugh, A

Scarfone, K., Souppaya, M., Cody, A. & Orebaugh, A. Technical guide to information security testing and assessment.NIST Special Publication800, 2–25 (2008)

2008
[20]

In 2022 IEEE Symposium on Security and Privacy (SP), pp

Pauley, E.et al.Measuring and mitigating the risk of ip reuse on public clouds (2022). In 2022 IEEE Symposium on Security and Privacy (SP), pp. 558–575, IEEE Computer Society

2022
[21]

Hack The Box: An Online Platform for Cybersecurity Training

Hack The Box Limited. Hack The Box: An Online Platform for Cybersecurity Training. Online Platform. https://www.hackthebox.com. Accessed: 2026-05-25

2026
[22]

Vulhub: Pre-Built Vulnerable Environments Based on Docker

Phith0n. Vulhub: Pre-Built Vulnerable Environments Based on Docker. GitHub Reposi- tory. https://github.com/vulhub/vulhub. Accessed: 2026-05-25

2026
[23]

Metasploit MCP Server

GH05TCREW. Metasploit MCP Server. https://github.com/GH05TCREW/Metasploi tMCP. Accessed: 2026-05-25

2026
[24]

FOFA Search Engine

FOFA. FOFA Search Engine. https://fofa.info/. Accessed: 2026-05-25

2026
[25]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, C.et al.Livebench: A challenging, contamination-free llm benchmark (2024). 2406.19314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

NVD CVE-2025-3248

National Vulnerability Database. NVD CVE-2025-3248. https://nvd.nist.gov/vuln/deta il/CVE-2025-3248. Accessed: 2026-05-25

2025
[27]

Gemini 3 Pro

Google. Gemini 3 Pro. https://docs.cloud.google.com/gemini-enterprise-agent-platfor m/models/gemini/3-pro. Accessed: 2026-05-25

2026
[28]

Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[29]

OpenAI GPT5.5 System Card

OpenAI. OpenAI GPT5.5 System Card. https://openai.com/index/gpt-5-5-system-car d/. Accessed: 2026-05-25. 21

2026

[1] [1]

Guo, D.et al.Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Hou, X.et al.Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 1–79 (2024)

2024

[3] [3]

Consensus Statement on Red Lines in Artificial Intelligence

Yoshua Bengio and Geoffrey Hinton and Andrew Chi-Chih Yao and others. Consensus Statement on Red Lines in Artificial Intelligence. https://idais- beijing.baai.ac.cn/?lang=en. Accessed: 2024-10-22

2024

[4] [4]

Managing Advanced Cyber Risks in Frontier AI Frameworks

Frontier Model Forum. Managing Advanced Cyber Risks in Frontier AI Frameworks. https://www.frontiermodelforum.org/technical-reports/managing-advanced-cyber-risks- in-frontier-ai-frameworks/. Accessed: 2026-06-01

2026

[5] [5]

Frontier AI Risk Manage- ment Framework

Shanghai Artificial Intelligence Laboratory and Concordia AI. Frontier AI Risk Manage- ment Framework. https://concordia-ai.com/wp-content/uploads/2026/02/Frontier-AI- Risk-Management-Framework-v1.5.pdf. Accessed: 2026-06-01

2026

[6] [6]

OpenAI GPT5.2 System Card (New)

OpenAI. OpenAI GPT5.2 System Card (New). https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai 5 2 system-card.pdf. Accessed: 2026-05-25

2026

[7] [7]

In 33rd USENIX Security Symposium (USENIX Security 24), pp

Deng, G.et al.Pentestgpt: Evaluating and harnessing large language models for automated penetration testing (2024). In 33rd USENIX Security Symposium (USENIX Security 24), pp. 847–864, USENIX Association

2024

[8] [8]

Antropic Claude Opus 4.5 System Card

Antropic. Antropic Claude Opus 4.5 System Card. https://www-cdn.anthropic.com/b f10f64990cfda0ba858290be7b8cc6317685f47.pdf. Accessed: 2026-05-25

2026

[9] [9]

Gioacchini, L.et al.Autopenbench: Benchmarking generative agents for penetration testing.arXiv preprint arXiv:2410.03225(2024)

work page arXiv 2024

[10] [10]

& Buford, J

Liu, Z., Shi, J. & Buford, J. F. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity (2024). In AAAI 2024 Workshop on Artificial Intelligence for Cyber Security, AAAI Press

2024

[11] [11]

Shao, M.et al.Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37, 57472– 57498 (2024)

2024

[12] [12]

Phuong, M.et al.Evaluating frontier models for dangerous capabilities.arXiv preprint arXiv:2403.13793(2024)

work page arXiv 2024

[13] [13]

arXiv preprint arXiv:2503.11917(2025)

Rodriguez, M.et al.A framework for evaluating emerging cyberattack capabilities of ai. arXiv preprint arXiv:2503.11917(2025)

work page arXiv 2025

[14] [14]

arXiv preprint arXiv:2512.08864(2025)

Barrett, S.et al.Toward quantitative modeling of cybersecurity risks due to ai misuse. arXiv preprint arXiv:2512.08864(2025)

work page arXiv 2025

[15] [15]

Project Glasswing

Anthropic. Project Glasswing. https://www.anthropic.com/glasswing. Accessed: 2024- 12-05

2024

[16] [16]

Zhu, Y.et al.Cve-bench: a benchmark for ai agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332(2025)

work page arXiv 2025

[17] [17]

In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp

Ji, Z.et al.Measuring and augmenting large language models for solving capture-the-flag challenges (2025). In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp. 603–617, ACM

2025

[18] [18]

& Kim, E

Isozaki, I., Shrestha, M., Console, R. & Kim, E. Towards automated penetration testing: Introducing llm benchmark, analysis, and improvements (2025). In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, pp. 404–419, ACM

2025

[19] [19]

& Orebaugh, A

Scarfone, K., Souppaya, M., Cody, A. & Orebaugh, A. Technical guide to information security testing and assessment.NIST Special Publication800, 2–25 (2008)

2008

[20] [20]

In 2022 IEEE Symposium on Security and Privacy (SP), pp

Pauley, E.et al.Measuring and mitigating the risk of ip reuse on public clouds (2022). In 2022 IEEE Symposium on Security and Privacy (SP), pp. 558–575, IEEE Computer Society

2022

[21] [21]

Hack The Box: An Online Platform for Cybersecurity Training

Hack The Box Limited. Hack The Box: An Online Platform for Cybersecurity Training. Online Platform. https://www.hackthebox.com. Accessed: 2026-05-25

2026

[22] [22]

Vulhub: Pre-Built Vulnerable Environments Based on Docker

Phith0n. Vulhub: Pre-Built Vulnerable Environments Based on Docker. GitHub Reposi- tory. https://github.com/vulhub/vulhub. Accessed: 2026-05-25

2026

[23] [23]

Metasploit MCP Server

GH05TCREW. Metasploit MCP Server. https://github.com/GH05TCREW/Metasploi tMCP. Accessed: 2026-05-25

2026

[24] [24]

FOFA Search Engine

FOFA. FOFA Search Engine. https://fofa.info/. Accessed: 2026-05-25

2026

[25] [25]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, C.et al.Livebench: A challenging, contamination-free llm benchmark (2024). 2406.19314

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

NVD CVE-2025-3248

National Vulnerability Database. NVD CVE-2025-3248. https://nvd.nist.gov/vuln/deta il/CVE-2025-3248. Accessed: 2026-05-25

2025

[27] [27]

Gemini 3 Pro

Google. Gemini 3 Pro. https://docs.cloud.google.com/gemini-enterprise-agent-platfor m/models/gemini/3-pro. Accessed: 2026-05-25

2026

[28] [28]

Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[29] [29]

OpenAI GPT5.5 System Card

OpenAI. OpenAI GPT5.5 System Card. https://openai.com/index/gpt-5-5-system-car d/. Accessed: 2026-05-25. 21

2026