The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems
Pith reviewed 2026-06-30 11:24 UTC · model grok-4.3
The pith
LLM-powered agents penetrate servers at success rates from 10.7% to 69.3% using only general tools and no target-specific knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.
What carries the argument
Tier-1 and Tier-2 target environments (one or three secure services without known vulnerabilities alongside a vulnerable service) paired with general-purpose agent scaffolding that uses only standard cybersecurity tools and no target-specific prior knowledge.
If this is right
- Autonomous penetration is feasible for current models without human intervention or target-specific guidance.
- Penetration success increases as base model capability advances.
- Red-line concerns for high-impact cyberattacks remain relevant even under the constraints of realistic, knowledge-limited agent setups.
Where Pith is reading between the lines
- If the tiered environments capture real difficulty levels, then further model scaling will push autonomous cyber capabilities past current safety thresholds.
- General capability scaling laws may transfer directly to offense-oriented tasks such as penetration.
- Standardized benchmarks built on this tiered design could become routine for tracking whether models cross red lines in cyber domains.
Load-bearing premise
The constructed Tier-1 and Tier-2 environments plus the general-purpose agent scaffolding without target-specific prior knowledge accurately reflect the difficulty of real-world autonomous penetration tasks.
What would settle it
A test in which the same models, given only the same general tools and no target-specific knowledge, achieve near-zero success rates against live production servers that match the tiered structure.
read the original abstract
Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a new evaluation framework for autonomous penetration capabilities of LLM-powered AI systems. It constructs Tier-1 (one secure service) and Tier-2 (three secure services) target environments alongside a vulnerable service, generating 300 servers total. A general-purpose agent scaffolding equipped with cybersecurity tools and no target-specific prior knowledge is used to evaluate 19 open-weight and proprietary LLMs. The central empirical findings are penetration success rates ranging from 10.7% to 69.3%, with the additional observation that autonomous penetration capability scales with overall model capability.
Significance. If the measured rates hold under the stated conditions, the work supplies one of the larger-scale controlled empirical datasets on a key AI red-line capability, using general-purpose tooling rather than task-specific guidance. This could inform safety benchmarks and scaling forecasts. The explicit construction of two environment tiers and the scale (300 servers, 19 models) are strengths for reproducibility of the measurement protocol.
major comments (2)
- [Target server construction] Target server construction (abstract and methodology): The Tier-1/Tier-2 environments are defined solely by the count of secure services (1 or 3) plus one vulnerable service. No details are provided on service selection, fingerprinting resistance, or confirmation that the secure services truly have no known vulnerabilities or exploitable interactions. This is load-bearing for the headline rates (10.7–69.3 %) because the central claim that these rates reflect realistic autonomous penetration without excessive simplification rests on the environments being representative.
- [Results] Results presentation: The success rates are stated as a range without error bars, per-tier (Tier-1 vs Tier-2) breakdowns, per-model tables, or any description of the verification procedure used to confirm successful penetration (e.g., evidence of unauthorized access or control). This directly affects assessment of the scaling claim and the reliability of the reported numbers.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our evaluation framework and empirical results. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Target server construction] Target server construction (abstract and methodology): The Tier-1/Tier-2 environments are defined solely by the count of secure services (1 or 3) plus one vulnerable service. No details are provided on service selection, fingerprinting resistance, or confirmation that the secure services truly have no known vulnerabilities or exploitable interactions. This is load-bearing for the headline rates (10.7–69.3 %) because the central claim that these rates reflect realistic autonomous penetration without excessive simplification rests on the environments being representative.
Authors: We agree that additional methodological detail is required to substantiate the claim that the environments are representative. In the revised manuscript we will expand the target-server construction subsection to specify the exact services and versions chosen for the secure tiers, the process used to confirm absence of known vulnerabilities (CVE database queries plus manual verification), and configuration steps taken to limit fingerprinting. These additions will directly support the headline success rates. revision: yes
-
Referee: [Results] Results presentation: The success rates are stated as a range without error bars, per-tier (Tier-1 vs Tier-2) breakdowns, per-model tables, or any description of the verification procedure used to confirm successful penetration (e.g., evidence of unauthorized access or control). This directly affects assessment of the scaling claim and the reliability of the reported numbers.
Authors: We concur that the current results section lacks sufficient granularity and transparency. The revised version will include a per-model table, separate Tier-1 and Tier-2 success rates, error bars derived from the 300-server sample, and an explicit description of the verification procedure (observable indicators of unauthorized access or control). These changes will allow readers to evaluate both the scaling observation and the numerical reliability of the reported rates. revision: yes
Circularity Check
No circularity: direct empirical measurement of observed success rates
full rationale
The paper constructs Tier-1/Tier-2 server environments and a general-purpose agent scaffolding, then runs 19 LLMs to record penetration success rates (10.7–69.3 %). These rates are measured outcomes from agent executions on the constructed targets, not quantities obtained by fitting parameters, solving self-referential equations, or reducing via self-citation chains. No load-bearing step equates a claimed result to its own inputs by construction; the central claims are falsifiable experimental observations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Guo, D.et al.Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Hou, X.et al.Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 1–79 (2024)
2024
-
[3]
Consensus Statement on Red Lines in Artificial Intelligence
Yoshua Bengio and Geoffrey Hinton and Andrew Chi-Chih Yao and others. Consensus Statement on Red Lines in Artificial Intelligence. https://idais- beijing.baai.ac.cn/?lang=en. Accessed: 2024-10-22
2024
-
[4]
Managing Advanced Cyber Risks in Frontier AI Frameworks
Frontier Model Forum. Managing Advanced Cyber Risks in Frontier AI Frameworks. https://www.frontiermodelforum.org/technical-reports/managing-advanced-cyber-risks- in-frontier-ai-frameworks/. Accessed: 2026-06-01
2026
-
[5]
Frontier AI Risk Manage- ment Framework
Shanghai Artificial Intelligence Laboratory and Concordia AI. Frontier AI Risk Manage- ment Framework. https://concordia-ai.com/wp-content/uploads/2026/02/Frontier-AI- Risk-Management-Framework-v1.5.pdf. Accessed: 2026-06-01
2026
-
[6]
OpenAI GPT5.2 System Card (New)
OpenAI. OpenAI GPT5.2 System Card (New). https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai 5 2 system-card.pdf. Accessed: 2026-05-25
2026
-
[7]
In 33rd USENIX Security Symposium (USENIX Security 24), pp
Deng, G.et al.Pentestgpt: Evaluating and harnessing large language models for automated penetration testing (2024). In 33rd USENIX Security Symposium (USENIX Security 24), pp. 847–864, USENIX Association
2024
-
[8]
Antropic Claude Opus 4.5 System Card
Antropic. Antropic Claude Opus 4.5 System Card. https://www-cdn.anthropic.com/b f10f64990cfda0ba858290be7b8cc6317685f47.pdf. Accessed: 2026-05-25
2026
- [9]
-
[10]
& Buford, J
Liu, Z., Shi, J. & Buford, J. F. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity (2024). In AAAI 2024 Workshop on Artificial Intelligence for Cyber Security, AAAI Press
2024
-
[11]
Shao, M.et al.Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37, 57472– 57498 (2024)
2024
- [12]
-
[13]
arXiv preprint arXiv:2503.11917(2025)
Rodriguez, M.et al.A framework for evaluating emerging cyberattack capabilities of ai. arXiv preprint arXiv:2503.11917(2025)
-
[14]
arXiv preprint arXiv:2512.08864(2025)
Barrett, S.et al.Toward quantitative modeling of cybersecurity risks due to ai misuse. arXiv preprint arXiv:2512.08864(2025)
-
[15]
Project Glasswing
Anthropic. Project Glasswing. https://www.anthropic.com/glasswing. Accessed: 2024- 12-05
2024
- [16]
-
[17]
In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp
Ji, Z.et al.Measuring and augmenting large language models for solving capture-the-flag challenges (2025). In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp. 603–617, ACM
2025
-
[18]
& Kim, E
Isozaki, I., Shrestha, M., Console, R. & Kim, E. Towards automated penetration testing: Introducing llm benchmark, analysis, and improvements (2025). In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, pp. 404–419, ACM
2025
-
[19]
& Orebaugh, A
Scarfone, K., Souppaya, M., Cody, A. & Orebaugh, A. Technical guide to information security testing and assessment.NIST Special Publication800, 2–25 (2008)
2008
-
[20]
In 2022 IEEE Symposium on Security and Privacy (SP), pp
Pauley, E.et al.Measuring and mitigating the risk of ip reuse on public clouds (2022). In 2022 IEEE Symposium on Security and Privacy (SP), pp. 558–575, IEEE Computer Society
2022
-
[21]
Hack The Box: An Online Platform for Cybersecurity Training
Hack The Box Limited. Hack The Box: An Online Platform for Cybersecurity Training. Online Platform. https://www.hackthebox.com. Accessed: 2026-05-25
2026
-
[22]
Vulhub: Pre-Built Vulnerable Environments Based on Docker
Phith0n. Vulhub: Pre-Built Vulnerable Environments Based on Docker. GitHub Reposi- tory. https://github.com/vulhub/vulhub. Accessed: 2026-05-25
2026
-
[23]
Metasploit MCP Server
GH05TCREW. Metasploit MCP Server. https://github.com/GH05TCREW/Metasploi tMCP. Accessed: 2026-05-25
2026
-
[24]
FOFA Search Engine
FOFA. FOFA Search Engine. https://fofa.info/. Accessed: 2026-05-25
2026
-
[25]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
White, C.et al.Livebench: A challenging, contamination-free llm benchmark (2024). 2406.19314
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
NVD CVE-2025-3248
National Vulnerability Database. NVD CVE-2025-3248. https://nvd.nist.gov/vuln/deta il/CVE-2025-3248. Accessed: 2026-05-25
2025
-
[27]
Gemini 3 Pro
Google. Gemini 3 Pro. https://docs.cloud.google.com/gemini-enterprise-agent-platfor m/models/gemini/3-pro. Accessed: 2026-05-25
2026
-
[28]
Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[29]
OpenAI GPT5.5 System Card
OpenAI. OpenAI GPT5.5 System Card. https://openai.com/index/gpt-5-5-system-car d/. Accessed: 2026-05-25. 21
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.