pith. sign in

arxiv: 2606.13079 · v2 · pith:44UQYU3Bnew · submitted 2026-06-11 · 💻 cs.CR · cs.AI

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

Pith reviewed 2026-06-30 11:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords autonomous penetrationLLM agentscybersecurity evaluationAI red linespenetration testingmodel capability scaling
0
0 comments X

The pith

LLM-powered agents penetrate servers at success rates from 10.7% to 69.3% using only general tools and no target-specific knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a new evaluation framework to measure autonomous penetration by LLM agents more realistically than prior work. It deploys 300 target servers in two tiers, each with one vulnerable service plus either one or three secure services that have no known vulnerabilities. A general-purpose agent scaffolding supplies standard cybersecurity tools but supplies zero prior information about any specific target. When 19 models are tested, penetration success falls in the 10.7%–69.3% range. The same experiments show that success rises in step with gains in the underlying model’s overall capability.

Core claim

Current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

What carries the argument

Tier-1 and Tier-2 target environments (one or three secure services without known vulnerabilities alongside a vulnerable service) paired with general-purpose agent scaffolding that uses only standard cybersecurity tools and no target-specific prior knowledge.

If this is right

  • Autonomous penetration is feasible for current models without human intervention or target-specific guidance.
  • Penetration success increases as base model capability advances.
  • Red-line concerns for high-impact cyberattacks remain relevant even under the constraints of realistic, knowledge-limited agent setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the tiered environments capture real difficulty levels, then further model scaling will push autonomous cyber capabilities past current safety thresholds.
  • General capability scaling laws may transfer directly to offense-oriented tasks such as penetration.
  • Standardized benchmarks built on this tiered design could become routine for tracking whether models cross red lines in cyber domains.

Load-bearing premise

The constructed Tier-1 and Tier-2 environments plus the general-purpose agent scaffolding without target-specific prior knowledge accurately reflect the difficulty of real-world autonomous penetration tasks.

What would settle it

A test in which the same models, given only the same general tools and no target-specific knowledge, achieve near-zero success rates against live production servers that match the tiered structure.

read the original abstract

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces a new evaluation framework for autonomous penetration capabilities of LLM-powered AI systems. It constructs Tier-1 (one secure service) and Tier-2 (three secure services) target environments alongside a vulnerable service, generating 300 servers total. A general-purpose agent scaffolding equipped with cybersecurity tools and no target-specific prior knowledge is used to evaluate 19 open-weight and proprietary LLMs. The central empirical findings are penetration success rates ranging from 10.7% to 69.3%, with the additional observation that autonomous penetration capability scales with overall model capability.

Significance. If the measured rates hold under the stated conditions, the work supplies one of the larger-scale controlled empirical datasets on a key AI red-line capability, using general-purpose tooling rather than task-specific guidance. This could inform safety benchmarks and scaling forecasts. The explicit construction of two environment tiers and the scale (300 servers, 19 models) are strengths for reproducibility of the measurement protocol.

major comments (2)
  1. [Target server construction] Target server construction (abstract and methodology): The Tier-1/Tier-2 environments are defined solely by the count of secure services (1 or 3) plus one vulnerable service. No details are provided on service selection, fingerprinting resistance, or confirmation that the secure services truly have no known vulnerabilities or exploitable interactions. This is load-bearing for the headline rates (10.7–69.3 %) because the central claim that these rates reflect realistic autonomous penetration without excessive simplification rests on the environments being representative.
  2. [Results] Results presentation: The success rates are stated as a range without error bars, per-tier (Tier-1 vs Tier-2) breakdowns, per-model tables, or any description of the verification procedure used to confirm successful penetration (e.g., evidence of unauthorized access or control). This directly affects assessment of the scaling claim and the reliability of the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our evaluation framework and empirical results. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Target server construction] Target server construction (abstract and methodology): The Tier-1/Tier-2 environments are defined solely by the count of secure services (1 or 3) plus one vulnerable service. No details are provided on service selection, fingerprinting resistance, or confirmation that the secure services truly have no known vulnerabilities or exploitable interactions. This is load-bearing for the headline rates (10.7–69.3 %) because the central claim that these rates reflect realistic autonomous penetration without excessive simplification rests on the environments being representative.

    Authors: We agree that additional methodological detail is required to substantiate the claim that the environments are representative. In the revised manuscript we will expand the target-server construction subsection to specify the exact services and versions chosen for the secure tiers, the process used to confirm absence of known vulnerabilities (CVE database queries plus manual verification), and configuration steps taken to limit fingerprinting. These additions will directly support the headline success rates. revision: yes

  2. Referee: [Results] Results presentation: The success rates are stated as a range without error bars, per-tier (Tier-1 vs Tier-2) breakdowns, per-model tables, or any description of the verification procedure used to confirm successful penetration (e.g., evidence of unauthorized access or control). This directly affects assessment of the scaling claim and the reliability of the reported numbers.

    Authors: We concur that the current results section lacks sufficient granularity and transparency. The revised version will include a per-model table, separate Tier-1 and Tier-2 success rates, error bars derived from the 300-server sample, and an explicit description of the verification procedure (observable indicators of unauthorized access or control). These changes will allow readers to evaluate both the scaling observation and the numerical reliability of the reported rates. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of observed success rates

full rationale

The paper constructs Tier-1/Tier-2 server environments and a general-purpose agent scaffolding, then runs 19 LLMs to record penetration success rates (10.7–69.3 %). These rates are measured outcomes from agent executions on the constructed targets, not quantities obtained by fitting parameters, solving self-referential equations, or reducing via self-citation chains. No load-bearing step equates a claimed result to its own inputs by construction; the central claims are falsifiable experimental observations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters are fitted to data, no mathematical axioms are invoked beyond standard testing assumptions, and no new physical or computational entities are postulated.

pith-pipeline@v0.9.1-grok · 5852 in / 1151 out tokens · 35343 ms · 2026-06-30T11:24:43.242605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Guo, D.et al.Deepseek-coder: when the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)

  2. [2]

    Hou, X.et al.Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 1–79 (2024)

  3. [3]

    Consensus Statement on Red Lines in Artificial Intelligence

    Yoshua Bengio and Geoffrey Hinton and Andrew Chi-Chih Yao and others. Consensus Statement on Red Lines in Artificial Intelligence. https://idais- beijing.baai.ac.cn/?lang=en. Accessed: 2024-10-22

  4. [4]

    Managing Advanced Cyber Risks in Frontier AI Frameworks

    Frontier Model Forum. Managing Advanced Cyber Risks in Frontier AI Frameworks. https://www.frontiermodelforum.org/technical-reports/managing-advanced-cyber-risks- in-frontier-ai-frameworks/. Accessed: 2026-06-01

  5. [5]

    Frontier AI Risk Manage- ment Framework

    Shanghai Artificial Intelligence Laboratory and Concordia AI. Frontier AI Risk Manage- ment Framework. https://concordia-ai.com/wp-content/uploads/2026/02/Frontier-AI- Risk-Management-Framework-v1.5.pdf. Accessed: 2026-06-01

  6. [6]

    OpenAI GPT5.2 System Card (New)

    OpenAI. OpenAI GPT5.2 System Card (New). https://cdn.openai.com/pdf/3a4153c8-c 748-4b71-8e31-aecbde944f8d/oai 5 2 system-card.pdf. Accessed: 2026-05-25

  7. [7]

    In 33rd USENIX Security Symposium (USENIX Security 24), pp

    Deng, G.et al.Pentestgpt: Evaluating and harnessing large language models for automated penetration testing (2024). In 33rd USENIX Security Symposium (USENIX Security 24), pp. 847–864, USENIX Association

  8. [8]

    Antropic Claude Opus 4.5 System Card

    Antropic. Antropic Claude Opus 4.5 System Card. https://www-cdn.anthropic.com/b f10f64990cfda0ba858290be7b8cc6317685f47.pdf. Accessed: 2026-05-25

  9. [9]

    Gioacchini, L.et al.Autopenbench: Benchmarking generative agents for penetration testing.arXiv preprint arXiv:2410.03225(2024)

  10. [10]

    & Buford, J

    Liu, Z., Shi, J. & Buford, J. F. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity (2024). In AAAI 2024 Workshop on Artificial Intelligence for Cyber Security, AAAI Press

  11. [11]

    Shao, M.et al.Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37, 57472– 57498 (2024)

  12. [12]

    Phuong, M.et al.Evaluating frontier models for dangerous capabilities.arXiv preprint arXiv:2403.13793(2024)

  13. [13]

    arXiv preprint arXiv:2503.11917(2025)

    Rodriguez, M.et al.A framework for evaluating emerging cyberattack capabilities of ai. arXiv preprint arXiv:2503.11917(2025)

  14. [14]

    arXiv preprint arXiv:2512.08864(2025)

    Barrett, S.et al.Toward quantitative modeling of cybersecurity risks due to ai misuse. arXiv preprint arXiv:2512.08864(2025)

  15. [15]

    Project Glasswing

    Anthropic. Project Glasswing. https://www.anthropic.com/glasswing. Accessed: 2024- 12-05

  16. [16]

    Zhu, Y.et al.Cve-bench: a benchmark for ai agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332(2025)

  17. [17]

    In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp

    Ji, Z.et al.Measuring and augmenting large language models for solving capture-the-flag challenges (2025). In Proceedings of the 2025 ACM SIGSAC Conference on Computer 20 and Communications Security, pp. 603–617, ACM

  18. [18]

    & Kim, E

    Isozaki, I., Shrestha, M., Console, R. & Kim, E. Towards automated penetration testing: Introducing llm benchmark, analysis, and improvements (2025). In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, pp. 404–419, ACM

  19. [19]

    & Orebaugh, A

    Scarfone, K., Souppaya, M., Cody, A. & Orebaugh, A. Technical guide to information security testing and assessment.NIST Special Publication800, 2–25 (2008)

  20. [20]

    In 2022 IEEE Symposium on Security and Privacy (SP), pp

    Pauley, E.et al.Measuring and mitigating the risk of ip reuse on public clouds (2022). In 2022 IEEE Symposium on Security and Privacy (SP), pp. 558–575, IEEE Computer Society

  21. [21]

    Hack The Box: An Online Platform for Cybersecurity Training

    Hack The Box Limited. Hack The Box: An Online Platform for Cybersecurity Training. Online Platform. https://www.hackthebox.com. Accessed: 2026-05-25

  22. [22]

    Vulhub: Pre-Built Vulnerable Environments Based on Docker

    Phith0n. Vulhub: Pre-Built Vulnerable Environments Based on Docker. GitHub Reposi- tory. https://github.com/vulhub/vulhub. Accessed: 2026-05-25

  23. [23]

    Metasploit MCP Server

    GH05TCREW. Metasploit MCP Server. https://github.com/GH05TCREW/Metasploi tMCP. Accessed: 2026-05-25

  24. [24]

    FOFA Search Engine

    FOFA. FOFA Search Engine. https://fofa.info/. Accessed: 2026-05-25

  25. [25]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    White, C.et al.Livebench: A challenging, contamination-free llm benchmark (2024). 2406.19314

  26. [26]

    NVD CVE-2025-3248

    National Vulnerability Database. NVD CVE-2025-3248. https://nvd.nist.gov/vuln/deta il/CVE-2025-3248. Accessed: 2026-05-25

  27. [27]

    Gemini 3 Pro

    Google. Gemini 3 Pro. https://docs.cloud.google.com/gemini-enterprise-agent-platfor m/models/gemini/3-pro. Accessed: 2026-05-25

  28. [28]

    Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361 (2020)

  29. [29]

    OpenAI GPT5.5 System Card

    OpenAI. OpenAI GPT5.5 System Card. https://openai.com/index/gpt-5-5-system-car d/. Accessed: 2026-05-25. 21