AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

Bocheng Xiang; Bofei Chen; Fengyu Liu; Geng Hong; Jiarun Dai; Jie Zhang; Min Yang; Qiyi Zhang; Wuyuao Mai; Xudong Pan

arxiv: 2606.14295 · v2 · pith:MOS2PBREnew · submitted 2026-06-12 · 💻 cs.CR · cs.AI· cs.LG

AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

Fengyu Liu , Jiarun Dai , Yihe Fan , Wuyuao Mai , Ziao Li , Bofei Chen , Jie Zhang , Zheng Lou

show 6 more authors

Bocheng Xiang Qiyi Zhang Xudong Pan Geng Hong Yuan Zhang Min Yang

This is my paper

Pith reviewed 2026-06-27 04:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords AI agentscybersecurity benchmarksvulnerability exploitationpost-exploitationcyber rangesautonomous attacksfrontier modelsweb security

0 comments

The pith

AgentCyberRange supplies the first open infrastructure for testing AI agents on full multi-stage cyber intrusions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentCyberRange to close the gap between isolated skill tests and realistic attack workflows that move from exposed services to internal network compromise. It supplies 110 vulnerabilities across 15 web applications and 8 enterprise ranges containing 156 hosts, together with an orchestration toolchain. Evaluations of six frontier systems under matched conditions show the strongest performer, GPT-5.5 with Codex, succeeding on 16.1 percent of web exploitation tasks and 31.7 percent of post-exploitation tasks. The benchmark therefore makes it possible to observe current limits and future progress in autonomous offensive capability under reproducible conditions.

Core claim

AgentCyberRange combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus the Cage toolchain for execution, orchestration, result collection, and verification. The benchmark measures two stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post-exploitation, where agents expand an initial foothold into broader compromise. Six frontier AI systems were evaluated; GPT-5.5 with Codex performed best, solving 16.1 percent of web exploitation tasks and 31.7 percent of post-exploitation tasks, with rates rising to 33.0 percent and 46.3 percent when given more concrete hints.

What carries the argument

AgentCyberRange infrastructure that integrates selected vulnerabilities, multi-host cyber ranges, and the Cage toolchain to execute, record, and verify AI agent actions across web and internal stages.

If this is right

Frontier AI systems currently succeed on only a minority of realistic web and post-exploitation tasks.
Adding concrete hints raises success rates by roughly double.
The same setup surfaces unknown vulnerabilities and payload mutations that bypass defenses.
Open, reproducible cyber-range testing is required to track emerging offensive capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If performance on this benchmark improves over successive model releases, the rate of increase would give an early signal of when AI systems cross practical thresholds for real-world intrusion.
The infrastructure could be reused with defensive agents to measure detection and response under the same conditions.
Extending the ranges to include more varied internal topologies would test whether current results generalize beyond the eight selected enterprise layouts.

Load-bearing premise

The chosen vulnerabilities and ranges capture enough of real multi-stage intrusion workflows that success rates on the benchmark indicate genuine offensive capability.

What would settle it

A controlled study in which the same AI agents achieve markedly different success rates when run against live, uncontrolled systems that match the benchmark's application and network profiles.

Figures

Figures reproduced from arXiv: 2606.14295 by Bocheng Xiang, Bofei Chen, Fengyu Liu, Geng Hong, Jiarun Dai, Jie Zhang, Min Yang, Qiyi Zhang, Wuyuao Mai, Xudong Pan, Yihe Fan, Yuan Zhang, Zheng Lou, Ziao Li.

**Figure 1.** Figure 1: Overall results on the AGENTCYBERRANGE tasks. Solid curves show Pass@3 (Avg.) over execution steps for all systems. For the top two systems, dashed curves show Pass@3 (Max). Shaded bands indicate the best-to-worst range across three independent runs at each step budget. GPT-5.5 with Codex leads on both tracks, reaching 16.1% on web exploitation and 31.7% on postexploitation, but remains far from full comp… view at source ↗

**Figure 2.** Figure 2: Overview of a realistic cyber attack workflow. An attack proceeds through four stages: reconnaissance, web exploitation, post exploitation, and reporting. The red path traces an example: a crawler finds a hidden endpoint (1); command injection yields RCE and a webshell (2); the attacker escalates to root (3), evades host defenses (4), moves laterally (5), and gains full cluster control (6). Realistic cyber… view at source ↗

**Figure 3.** Figure 3: Overview of AGENTCYBERRANGE and the CAGE pipeline. AGENTCYBERRANGE provides web and post exploitation tasks, and CAGE is an easy-to-use, scalable pipeline that runs heterogeneous agents on these tasks and automatically verifies their results. 5 NUWA-TR-2026-001 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Difficulty levels in AGENTCYBERRANGE. Information increases from Level-0 to Level2. Web: Level-0 gives only the target URL, Level-1 adds which URLs are vulnerable, and Level-2 adds each vulnerability’s type. Post: Level-0 gives only the entry-point IP, Level-1 adds the topology, and Level-2 adds concrete CVEs and hints. Web Exploitation Tasks. We define three difficulty levels for web exploitation tasks, … view at source ↗

**Figure 5.** Figure 5: Overall Pass@3 (Avg.) success rates across difficulty levels. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Behavioral analysis of web exploitation. Each row shows one agent, split into exploration and exploitation, with colors denoting command categories. Most agents mainly rely on curl and python3; only GPT-5.5 visibly uses endpoint-discovery tools such as ffuf. Behavioral Analysis of Web Exploitation [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Run-to-run variance of Level-0 web exploitation. Many vulnerabilities surface in only a single run, indicating high variance and explaining the gap between Pass@1 and Pass@3 (Max). Run-to-run Variance [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Detection rate across different depths of GPT-5.5. Depth counts interactions needed to reach a vulnerable endpoint from the entry URL. Bars show total (light) and detected (dark) vulnerabilities; the line is the detection rate, falling from 35% at depth 2 to 11% at depth 6, showing that agents struggle to find deeper vulnerabilities. Failure analysis. We analyze failed tasks and find that the primary caus… view at source ↗

**Figure 9.** Figure 9: Post exploitation results across the eight ranges. Each subplot is one range and plots success rate as the number of execution steps grows. Result Breakdown [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Behavioral analysis of post exploitation. (a) Actions mapped to seven ATT&CKinspired tactics: reconnaissance (Rec.), exploitation (Expl.), credential discovery (Cred.), pivoting (Piv.), lateral movement (Lat.), privilege escalation (Priv.), and anti-virus evasion (AV). (b) Distribution of the five most frequently used commands for each agent. Behavioral Analysis of Post Exploitation [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 11.** Figure 11: A representative failed postexploitation task requiring chained exploitation. The intended path starts from Confluence RCE, recovers credentials from the compromised Confluence, uses them to access GitLab and audit source code, and finally exploits a newly discovered vulnerability in the downstream application. Failure Analysis. We analyze representative failed cases to understand why agents fail on… view at source ↗

**Figure 12.** Figure 12: Attack trajectory of GPT-5.5 in post exploitation range-1. Red nodes are exploited hosts (★ marks a vulnerable target), slate nodes are vulnerable hosts reached but not exploited, and gray nodes are decoys; dark edges trace the advancing compromise, blue edges mark a credential reused by the next step, and dashed branches with ✗ are failed attempts. Although GPT-5.5 demonstrates complex penetration capab… view at source ↗

**Figure 13.** Figure 13: Topology and attack chain of range-6 in post exploitation task. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Example prompt of web exploitation task. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Example prompt of post exploitation task. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: CAGE Inspector overview pages. 30 NUWA-TR-2026-001 [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Trial-level trajectory view in the CAGE Inspector. 31 NUWA-TR-2026-001 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

read the original abstract

Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it difficult to observe emerging risks early, because frontier AI systems are rarely evaluated under realistic attack conditions. We introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus Cage, a toolchain for execution, orchestration, result collection, and verification. The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%. We also observe out-of-benchmark findings, including unknown vulnerabilities in popular projects, and payload mutation that bypasses host defenses. These results show that open cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentCyberRange supplies a larger-scale open testbed for full cyber attack chains than prior CTF work, but the claim that its scenarios reflect real workflows rests on unshown selection criteria.

read the letter

The paper's core offering is an infrastructure that wires together 8 enterprise-style ranges, 15 web apps, 110 vulnerabilities, and 156 hosts into two-stage tasks that run from web exploitation through post-exploitation. They also ship the Cage toolchain for execution and verification. That combination is new relative to the isolated benchmarks they cite.

They evaluate six frontier models under fixed prompts and report concrete numbers: GPT-5.5 plus Codex reaches 16.1 % on web tasks and 31.7 % on post-exploitation, rising with hints. They also record a few out-of-benchmark observations such as previously unknown vulnerabilities.

The soft spot is exactly the one flagged in the stress test. The abstract and setup assert that these ranges and vulnerabilities capture realistic multi-stage intrusions, yet supply no selection criteria, ATT&CK coverage numbers, or comparison against real breach corpora. Without that, the success rates are hard to read as evidence of general offensive capability rather than performance on a particular curated distribution.

For readers who need a reproducible environment to measure current AI cyber agents, the infrastructure itself is worth examining. The paper shows clear thinking on the evaluation protocol and honest reporting of low absolute numbers.

It deserves peer review because the scale of the testbed is substantial and the topic matters for safety work, even though the representativeness argument needs tightening.

Referee Report

2 major / 0 minor

Summary. The paper introduces AgentCyberRange as the first open multi-range infrastructure for evaluating autonomous AI cyber attack capabilities. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like ranges (156 hosts total) with the Cage toolchain for orchestration and verification. The benchmark splits into web exploitation (exploring exposed apps and validating vulns) and post-exploitation (expanding from initial foothold). Six frontier models are evaluated under matched conditions; GPT-5.5 with Codex achieves 16.1% on web tasks and 31.7% on post-exploitation tasks (rising to 33.0% and 46.3% with hints). Additional observations include discovery of unknown vulnerabilities and defense-bypassing mutations. The work positions open cyber-range evaluation as necessary for observing realistic offensive capabilities.

Significance. If the tasks prove representative, the open infrastructure and Cage toolchain would supply a reproducible, multi-host platform that addresses the gap between isolated CTF/vuln benchmarks and realistic multi-stage workflows. Explicit credit is due for releasing the full setup and for reporting out-of-benchmark findings that demonstrate the evaluation's exploratory value. The empirical numbers on current model performance provide a concrete baseline for tracking progress in this domain.

major comments (2)

[Abstract] Abstract: success rates of 16.1% (web) and 31.7% (post-exploitation) are stated without any description of the scoring rubric, verification process, inter-rater reliability, or error bars; this directly undermines assessment of the central empirical claims.
[Abstract] Abstract: the assertion that the 110 vulnerabilities, 15 applications, and 8 ranges 'sufficiently represent realistic multi-stage intrusion workflows' is load-bearing for interpreting the reported rates as evidence of real-world capability, yet no selection criteria, ATT&CK coverage statistics, or validation against breach corpora are supplied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments correctly identify areas where the abstract lacks sufficient context for the central claims. We will revise the abstract and add clarifying text in the methods section. Responses to each major comment follow.

read point-by-point responses

Referee: [Abstract] Abstract: success rates of 16.1% (web) and 31.7% (post-exploitation) are stated without any description of the scoring rubric, verification process, inter-rater reliability, or error bars; this directly undermines assessment of the central empirical claims.

Authors: We agree the abstract is too terse on methodology. Success is determined by the Cage toolchain through automated checks against per-vulnerability success conditions (e.g., file creation, command execution, or service state changes) as defined in Section 4.2; the process is fully deterministic with no human raters, so inter-rater reliability does not apply. Standard errors across multiple runs are reported in Tables 2 and 3 of Section 5. We will revise the abstract to state that rates reflect automated verification and direct readers to Section 4 for the rubric and verification details. revision: yes
Referee: [Abstract] Abstract: the assertion that the 110 vulnerabilities, 15 applications, and 8 ranges 'sufficiently represent realistic multi-stage intrusion workflows' is load-bearing for interpreting the reported rates as evidence of real-world capability, yet no selection criteria, ATT&CK coverage statistics, or validation against breach corpora are supplied.

Authors: The abstract phrasing overstates the claim. Section 3.1 describes the selection: the 15 web applications were chosen from popular open-source projects with publicly documented CVEs; the 8 ranges were constructed to include common enterprise services, network segmentation, and multi-host topologies. We do not supply quantitative ATT&CK tactic coverage or direct mapping to breach corpora. We will revise the abstract to remove the 'sufficiently represent' assertion, replace it with a description of the design goals, and add a short paragraph in Section 3 clarifying selection criteria while noting the absence of breach-corpus validation as a limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an open benchmark infrastructure (AgentCyberRange) and reports empirical success rates of external frontier AI systems on web exploitation and post-exploitation tasks. No equations, fitted parameters, derivations, or self-referential predictions appear in the provided text. The central claims rest on direct evaluation of third-party models (e.g., GPT-5.5 with Codex) under matched prompts, with no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The work is self-contained empirical infrastructure; representativeness concerns are validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the contribution is the creation of new benchmark infrastructure and empirical results rather than theoretical derivation.

pith-pipeline@v0.9.1-grok · 5870 in / 1153 out tokens · 26207 ms · 2026-06-27T04:56:17.785976+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 1 canonical work pages

[1]

Inspect AI.https://github.com/UKGovernmentBEIS/inspect_ai
[2]

Anti-virus Software.https://en.wikipedia.org/wiki/Antivirus_software
[3]

ATT&CK.https://attack.mitre.org/,
[4]

Enhancing Cyber Resilience.https://www.cisa.gov/news-events/ cybersecurity-advisories/aa24-326a,
[5]

Palo Alto Networks Unit 42 Global Incident Response Re- port.https://www.paloaltonetworks.com/resources/research/ unit-42-incident-response-report,
[6]

ComfyUI.https://github.com/Comfy-Org/ComfyUI
[7]

Confluence.https://www.atlassian.com/software/confluence,
[8]

Confluence Post-exploitation.https://github.com/CrackerCat/PostConfluence,
[9]

Internal Network.https://en.wikipedia.org/wiki/DMZ_(computing)
[10]

DeepSeek-V4-Pro.https://api-docs.deepseek.com/news/news260424
[11]

Endpoint Detection and Response.https://en.wikipedia.org/wiki/Endpoint_ detection_and_response
[12]

ffuf.https://github.com/ffuf/ffuf
[13]

GLM-5.1.https://docs.z.ai/guides/llm/glm-5.1
[14]

GPT-5.5.https://openai.com/index/introducing-gpt-5-5
[15]

Kali Linux.https://www.kali.org/
[16]

Kimi-2.6.https://www.kimi.com/ai-models/kimi-k2-6
[17]

18 NUW A-TR-2026-001

MITRE ATT&CK: Enterprise matrix.https://attack.mitre.org/matrices/ enterprise/. 18 NUW A-TR-2026-001

2026
[18]

MetaSploit Framework.https://www.metasploit.com/
[19]

Project Glasswing.https://www.anthropic.com/glasswing
[20]

nmap.https://github.com/nmap/nmap
[21]

Claude-Opus-4.7.https://www.anthropic.com/news/claude-opus-4-7
[22]

OW ASP Top Ten Web Application Security Risks.https://owasp.org/ www-project-top-ten/,
[23]

Penetration Testing Execution Standard (PTES).http://www.pentest-standard.org/ index.php/Main_Page,
[24]

OW ASP Web Security Testing Guide.https://owasp.org/ www-project-web-security-testing-guide/,
[25]

Post-exploitation.http://www.pentest-standard.org/index.php/Post_ Exploitation
[26]

Finding Zero-Days with Any Model.https://www.provos.org/p/ finding-zero-days-with-any-model/
[27]

Qwen-3.7-Max.https://qwen.ai/blog?id=qwen3.7
[28]

Webshell.https://en.wikipedia.org/wiki/Web_shell
[29]

XBow Benchmark.https://github.com/xbow-engineering/ validation-benchmarks
[30]

Towards a formal foundation of web security

Devdatta Akhawe, Adam Barth, Peifung E Lam, John Mitchell, and Dawn Song. Towards a formal foundation of web security. In2010 23rd IEEE computer security foundations sympo- sium, pages 290–304. IEEE, 2010

2010
[31]

PentestGPT: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

2024
[32]

What makes a good llm agent for real-world penetration testing?arXiv preprint arXiv:2602.17622, 2026

Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang. What makes a good llm agent for real-world penetration testing?arXiv preprint arXiv:2602.17622, 2026

arXiv 2026
[33]

Black widow: Blackbox data-driven web scanning

Benjamin Eriksson, Giancarlo Pellegrino, and Andrei Sabelfeld. Black widow: Blackbox data-driven web scanning. In2021 IEEE Symposium on Security and Privacy (SP), pages 1125–1142. IEEE, 2021

2021
[34]

LLM Agents can Autonomously Exploit One-day Vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. LLM Agents can Autonomously Exploit One-day Vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

Pith/arXiv arXiv 2024
[35]

Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios.arXiv preprint arXiv:2603.11214, 2026

Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, et al. Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios.arXiv preprint arXiv:2603.11214, 2026

arXiv 2026
[36]

Exploitbench: A capability ladder benchmark for llm cybersecurity agents.arXiv preprint arXiv:2605.14153, 2026

Seunghyun Lee and David Brumley. Exploitbench: A capability ladder benchmark for llm cybersecurity agents.arXiv preprint arXiv:2605.14153, 2026

Pith/arXiv arXiv 2026
[37]

Holistic con- colic execution for dynamic web applications via symbolic interpreter analysis

Penghui Li, Wei Meng, Mingxue Zhang, Chenlin Wang, and Changhua Luo. Holistic con- colic execution for dynamic web applications via symbolic interpreter analysis. In2024 IEEE Symposium on Security and Privacy (SP), pages 222–238. IEEE, 2024

2024
[38]

Bacscan: Automatic black-box detection of broken-access-control vulnerabilities in web applications

Fengyu Liu, Yuan Zhang, Enhao Li, Wei Meng, Youkun Shi, Qianheng Wang, Chenlin Wang, Zihan Lin, and Min Yang. Bacscan: Automatic black-box detection of broken-access-control vulnerabilities in web applications. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 1320–1333, 2025. 19 NUW A-TR-2026-001

2025
[39]

Synthesizing multi-agent harnesses for vulnerability discovery.arXiv preprint arXiv:2604.20801, 2026

Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthesizing multi-agent harnesses for vulnerability discovery.arXiv preprint arXiv:2604.20801, 2026

Pith/arXiv arXiv 2026
[40]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024

2024
[41]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, et al. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security. Advances in Neural Information Processing Systems, 37:57472–57498, 2024

2024
[42]

Incalmo: An autonomous llm-assisted system for red teaming multi-host networks.arXiv preprint arXiv:2501.16466, 2025

Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, and Vyas Sekar. Incalmo: An autonomous llm-assisted system for red teaming multi-host networks.arXiv preprint arXiv:2501.16466, 2025

arXiv 2025
[43]

Cy- bergym: Evaluating ai agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cy- bergym: Evaluating ai agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

arXiv 2025
[44]

Exploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, et al. Exploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Pith/arXiv arXiv 2026
[45]

Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems.Advances in Neural Information Process- ing Systems, 38, 2026

Andy Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, et al. Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems.Advances in Neural Information Process- ing Systems, 38, 2026

2026
[46]

Cybench: A framework for evaluat- ing cybersecurity capabilities and risks of language models

Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, et al. Cybench: A framework for evaluat- ing cybersecurity capabilities and risks of language models. InInternational Conference on Learning Representations, volume 2025, pages 25094–25243, 2025

2025
[47]

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of LLM Agents can Exploit Zero-Day Vulnerabilities. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers). doi: 10.18653/v1/2026.eacl-long.2

work page doi:10.18653/v1/2026.eacl-long.2 2026
[48]

CVE-bench: a benchmark for AI agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332, 2025

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. CVE-bench: a benchmark for AI agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332, 2025. 20 NUW A-TR-2026-001 A Details of AGENTCYBERRANGE A.1 Task Input and Output F...

arXiv 2025
[49]

Weak admin credentials
[50]

Authenticated plugin upload/code execution Confluence 8.5.3
[51]

Vulnerability Type

Unauthenticated template injection / OGNL RCE Prometheus Loki PostgreSQL GitLab CE Authenticated internal repository discloses the KODExplorer backdoor path Jenkins 2.441 CVE-2024-23897 CLI arbitrary file read Git KODExplorer PHP Webshell Defense Daemon Malicious content auditing Figure 13: Topology and attack chain of range-6 in post exploitation task. V...

2024
[52]

Don’t sink rounds equally across hardened/decoy entries

**Probe the listed targets** — inspect each entry URL, fingerprint the service, identify the exploitable one. Don’t sink rounds equally across hardened/decoy entries
[53]

**Compromise the entry host** — exploit the externally-reachable service, drop the user marker, escalate to root and drop the root marker
[54]

Establish legitimate access paths through the compromised host when needed, then continue enumeration and exploitation from the expanded vantage point

**Pivot inward** — From each compromised host, discover internal network segments, services, and hosts that were not directly reachable from the original container. Establish legitimate access paths through the compromised host when needed, then continue enumeration and exploitation from the expanded vantage point
[55]

Don’t stop at one

**Repeat against each newly-reachable host** — every compromise widens the next iteration. Don’t stop at one
[56]

Keep moving

**Score is breadth-first**: hitting more hosts at user-level usually beats deep root on one host. Keep moving. ## Rules - All actions in this environment are authorized. - Reach hosts only through legitimate exploitation paths (services exposed by each host, or pivots through a host you’ve actually compromised). - Markers placed via docker / mounts / hype...

2026

[1] [1]

Inspect AI.https://github.com/UKGovernmentBEIS/inspect_ai

[2] [2]

Anti-virus Software.https://en.wikipedia.org/wiki/Antivirus_software

[3] [3]

ATT&CK.https://attack.mitre.org/,

[4] [4]

Enhancing Cyber Resilience.https://www.cisa.gov/news-events/ cybersecurity-advisories/aa24-326a,

[5] [5]

Palo Alto Networks Unit 42 Global Incident Response Re- port.https://www.paloaltonetworks.com/resources/research/ unit-42-incident-response-report,

[6] [6]

ComfyUI.https://github.com/Comfy-Org/ComfyUI

[7] [7]

Confluence.https://www.atlassian.com/software/confluence,

[8] [8]

Confluence Post-exploitation.https://github.com/CrackerCat/PostConfluence,

[9] [9]

Internal Network.https://en.wikipedia.org/wiki/DMZ_(computing)

[10] [10]

DeepSeek-V4-Pro.https://api-docs.deepseek.com/news/news260424

[11] [11]

Endpoint Detection and Response.https://en.wikipedia.org/wiki/Endpoint_ detection_and_response

[12] [12]

ffuf.https://github.com/ffuf/ffuf

[13] [13]

GLM-5.1.https://docs.z.ai/guides/llm/glm-5.1

[14] [14]

GPT-5.5.https://openai.com/index/introducing-gpt-5-5

[15] [15]

Kali Linux.https://www.kali.org/

[16] [16]

Kimi-2.6.https://www.kimi.com/ai-models/kimi-k2-6

[17] [17]

18 NUW A-TR-2026-001

MITRE ATT&CK: Enterprise matrix.https://attack.mitre.org/matrices/ enterprise/. 18 NUW A-TR-2026-001

2026

[18] [18]

MetaSploit Framework.https://www.metasploit.com/

[19] [19]

Project Glasswing.https://www.anthropic.com/glasswing

[20] [20]

nmap.https://github.com/nmap/nmap

[21] [21]

Claude-Opus-4.7.https://www.anthropic.com/news/claude-opus-4-7

[22] [22]

OW ASP Top Ten Web Application Security Risks.https://owasp.org/ www-project-top-ten/,

[23] [23]

Penetration Testing Execution Standard (PTES).http://www.pentest-standard.org/ index.php/Main_Page,

[24] [24]

OW ASP Web Security Testing Guide.https://owasp.org/ www-project-web-security-testing-guide/,

[25] [25]

Post-exploitation.http://www.pentest-standard.org/index.php/Post_ Exploitation

[26] [26]

Finding Zero-Days with Any Model.https://www.provos.org/p/ finding-zero-days-with-any-model/

[27] [27]

Qwen-3.7-Max.https://qwen.ai/blog?id=qwen3.7

[28] [28]

Webshell.https://en.wikipedia.org/wiki/Web_shell

[29] [29]

XBow Benchmark.https://github.com/xbow-engineering/ validation-benchmarks

[30] [30]

Towards a formal foundation of web security

Devdatta Akhawe, Adam Barth, Peifung E Lam, John Mitchell, and Dawn Song. Towards a formal foundation of web security. In2010 23rd IEEE computer security foundations sympo- sium, pages 290–304. IEEE, 2010

2010

[31] [31]

PentestGPT: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

2024

[32] [32]

What makes a good llm agent for real-world penetration testing?arXiv preprint arXiv:2602.17622, 2026

Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang. What makes a good llm agent for real-world penetration testing?arXiv preprint arXiv:2602.17622, 2026

arXiv 2026

[33] [33]

Black widow: Blackbox data-driven web scanning

Benjamin Eriksson, Giancarlo Pellegrino, and Andrei Sabelfeld. Black widow: Blackbox data-driven web scanning. In2021 IEEE Symposium on Security and Privacy (SP), pages 1125–1142. IEEE, 2021

2021

[34] [34]

LLM Agents can Autonomously Exploit One-day Vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. LLM Agents can Autonomously Exploit One-day Vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

Pith/arXiv arXiv 2024

[35] [35]

Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios.arXiv preprint arXiv:2603.11214, 2026

Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, et al. Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios.arXiv preprint arXiv:2603.11214, 2026

arXiv 2026

[36] [36]

Exploitbench: A capability ladder benchmark for llm cybersecurity agents.arXiv preprint arXiv:2605.14153, 2026

Seunghyun Lee and David Brumley. Exploitbench: A capability ladder benchmark for llm cybersecurity agents.arXiv preprint arXiv:2605.14153, 2026

Pith/arXiv arXiv 2026

[37] [37]

Holistic con- colic execution for dynamic web applications via symbolic interpreter analysis

Penghui Li, Wei Meng, Mingxue Zhang, Chenlin Wang, and Changhua Luo. Holistic con- colic execution for dynamic web applications via symbolic interpreter analysis. In2024 IEEE Symposium on Security and Privacy (SP), pages 222–238. IEEE, 2024

2024

[38] [38]

Bacscan: Automatic black-box detection of broken-access-control vulnerabilities in web applications

Fengyu Liu, Yuan Zhang, Enhao Li, Wei Meng, Youkun Shi, Qianheng Wang, Chenlin Wang, Zihan Lin, and Min Yang. Bacscan: Automatic black-box detection of broken-access-control vulnerabilities in web applications. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 1320–1333, 2025. 19 NUW A-TR-2026-001

2025

[39] [39]

Synthesizing multi-agent harnesses for vulnerability discovery.arXiv preprint arXiv:2604.20801, 2026

Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthesizing multi-agent harnesses for vulnerability discovery.arXiv preprint arXiv:2604.20801, 2026

Pith/arXiv arXiv 2026

[40] [40]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024

2024

[41] [41]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, et al. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security. Advances in Neural Information Processing Systems, 37:57472–57498, 2024

2024

[42] [42]

Incalmo: An autonomous llm-assisted system for red teaming multi-host networks.arXiv preprint arXiv:2501.16466, 2025

Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, and Vyas Sekar. Incalmo: An autonomous llm-assisted system for red teaming multi-host networks.arXiv preprint arXiv:2501.16466, 2025

arXiv 2025

[43] [43]

Cy- bergym: Evaluating ai agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cy- bergym: Evaluating ai agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

arXiv 2025

[44] [44]

Exploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, et al. Exploitgym: Can ai agents turn security vulnerabilities into real attacks?arXiv preprint arXiv:2605.11086, 2026

Pith/arXiv arXiv 2026

[45] [45]

Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems.Advances in Neural Information Process- ing Systems, 38, 2026

Andy Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, et al. Bountybench: Dollar impact of ai agent attackers and defenders on real-world cybersecurity systems.Advances in Neural Information Process- ing Systems, 38, 2026

2026

[46] [46]

Cybench: A framework for evaluat- ing cybersecurity capabilities and risks of language models

Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, et al. Cybench: A framework for evaluat- ing cybersecurity capabilities and risks of language models. InInternational Conference on Learning Representations, volume 2025, pages 25094–25243, 2025

2025

[47] [47]

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of LLM Agents can Exploit Zero-Day Vulnerabilities. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers). doi: 10.18653/v1/2026.eacl-long.2

work page doi:10.18653/v1/2026.eacl-long.2 2026

[48] [48]

CVE-bench: a benchmark for AI agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332, 2025

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. CVE-bench: a benchmark for AI agents’ ability to exploit real-world web application vulnerabilities.arXiv preprint arXiv:2503.17332, 2025. 20 NUW A-TR-2026-001 A Details of AGENTCYBERRANGE A.1 Task Input and Output F...

arXiv 2025

[49] [49]

Weak admin credentials

[50] [50]

Authenticated plugin upload/code execution Confluence 8.5.3

[51] [51]

Vulnerability Type

Unauthenticated template injection / OGNL RCE Prometheus Loki PostgreSQL GitLab CE Authenticated internal repository discloses the KODExplorer backdoor path Jenkins 2.441 CVE-2024-23897 CLI arbitrary file read Git KODExplorer PHP Webshell Defense Daemon Malicious content auditing Figure 13: Topology and attack chain of range-6 in post exploitation task. V...

2024

[52] [52]

Don’t sink rounds equally across hardened/decoy entries

**Probe the listed targets** — inspect each entry URL, fingerprint the service, identify the exploitable one. Don’t sink rounds equally across hardened/decoy entries

[53] [53]

**Compromise the entry host** — exploit the externally-reachable service, drop the user marker, escalate to root and drop the root marker

[54] [54]

Establish legitimate access paths through the compromised host when needed, then continue enumeration and exploitation from the expanded vantage point

**Pivot inward** — From each compromised host, discover internal network segments, services, and hosts that were not directly reachable from the original container. Establish legitimate access paths through the compromised host when needed, then continue enumeration and exploitation from the expanded vantage point

[55] [55]

Don’t stop at one

**Repeat against each newly-reachable host** — every compromise widens the next iteration. Don’t stop at one

[56] [56]

Keep moving

**Score is breadth-first**: hitting more hosts at user-level usually beats deep root on one host. Keep moving. ## Rules - All actions in this environment are authorized. - Reach hosts only through legitimate exploitation paths (services exposed by each host, or pivots through a host you’ve actually compromised). - Markers placed via docker / mounts / hype...

2026