arxiv: 2604.24184 · v1 · submitted 2026-04-27 · 💻 cs.CR

Recognition: unknown

Dynamic Cyber Ranges

V\'ictor Mayoral-Vilches , Mar\'ia Sanz-G\'omez , Francesco Balassone , Maite Del Mundo De Torres , George Nicolaou , Samuel Rodriguez Borines , Almerindo Graziano , Paul Zabalegui

show 1 more author

Endika Gil-Uriarte

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:58 UTC · model grok-4.3

classification 💻 cs.CR

keywords cyber rangesLLM agentscybersecurity evaluationdynamic defenseAPTbenchmarkingagentic systems

0 comments

The pith

LLM defender agents in dynamic cyber ranges cut attacker success to 0-55 percent while preserving benchmark headroom as models improve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that static cyber ranges are losing their effectiveness against advancing LLM agents, as shown by testing an APT-style attacker across three tiers of infrastructure from labs to military-grade setups. To restore challenge, it introduces Dynamic Cyber Ranges that embed LLM defender agents to harden systems, monitor activity, and counter intrusions in real time. These defenders lower attacker success rates to between zero and 55 percent, with total prevention in several configurations. Because attackers and defenders draw from comparable model capabilities, the ranges keep evaluation difficulty intact even as underlying AI improves. Experiments also reveal that smaller specialized models can match or exceed larger ones in speed and outcomes on defense tasks.

Core claim

Dynamic Cyber Ranges are cyber range environments augmented with LLM-driven Defender agents that harden infrastructure, monitor for intrusions, and respond in real time. Across evaluated scenarios, these Defender agents reduce attacker success to 0-55%, achieving complete prevention on multiple configurations. Since attacker and defender agents draw from the same underlying model capabilities, Dynamic Cyber Ranges preserve evaluation headroom as models improve. Notably, a smaller specialized on-premise model matched the frontier model's defensive outcomes on multiple scenarios under identical untuned prompts and detected the attacker faster on a complex enterprise scenario.

What carries the argument

Dynamic Cyber Ranges, cyber range environments augmented with LLM-driven Defender agents that harden infrastructure, monitor for intrusions, and respond in real time.

If this is right

Defender agents achieve complete prevention of attacks in multiple configurations.
Smaller on-premise models can match larger frontier models in defensive performance and detection speed.
Emergent behaviors such as scope expansion and prompt exfiltration appear during agent interactions.
The approach maintains evaluation headroom for cybersecurity benchmarks as LLM capabilities advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to other AI agent evaluation domains to prevent benchmark saturation.
Privacy-preserving on-premise models may become preferred for defensive roles against advanced attackers.
Observed behaviors like prompt exfiltration point to design needs for preventing unintended information flows in agent systems.
Testing the ranges across successive model generations would confirm whether headroom persists over time.

Load-bearing premise

LLM-driven defender agents can effectively harden, monitor, and respond in real time using untuned prompts without prior knowledge of specific attacker strategies or infrastructure details.

What would settle it

A demonstration that attacker agents achieve consistently high success rates against the defender-augmented ranges, or that defender performance fails to scale with improvements in the underlying models, would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.24184 by Almerindo Graziano, Endika Gil-Uriarte, Francesco Balassone, George Nicolaou, Maite Del Mundo De Torres, Mar\'ia Sanz-G\'omez, Paul Zabalegui, Samuel Rodriguez Borines, V\'ictor Mayoral-Vilches.

**Figure 1.** Figure 1: From static to dynamic cyber ranges. The upper row presents three progression stages: in CTFs (left), multiple model families converge toward a saturation ceiling; in Cyber Ranges (center), attacker success rates span 41–100% across scenarios but remain bounded by fixed challenge sets; in Dynamic Cyber Ranges (right), attacker and defender co-evolve toward adversarial equilibrium. The lower row contrasts t… view at source ↗

**Figure 2.** Figure 2: Methodology overview. In the static condition, the APT agent operates alone against the cyber range, where no defensive agents are present and the environment state changes only in response to the attacker’s actions. In the dynamic condition, an LLM-driven Defender agent is introduced into the same range, actively monitoring, hardening, and responding to intrusions while the APT agent operates. Comparing t… view at source ↗

**Figure 3.** Figure 3: Experimental design. Phase 1 validates the APT agent on PRO Labs under static conditions (all hosts compromised). Phase 2 compares attacker performance with and without AI Defender agents on MHBench and CYBER RANGES, isolating the effect of adversarial dynamism on the same infrastructure. The evaluation proceeds in two phases ( view at source ↗

**Figure 4.** Figure 4: Agent configurations evaluated [60]. Single: one APT agent with direct tool access. Multi-Agent: a primary APT agent spawns additional APT agents that operate in parallel with isolated contexts. Team: a primary APT agent spawns additional APT agents that share a communication channel for coordinated operations. All configurations use the same pentesting tool suite. 3.3 Defender configurations strategies To… view at source ↗

**Figure 5.** Figure 5: Defender deployment strategies evaluated. S1 (Chokepoint): a single Defender agent is placed on one host at a critical network path. S2 (Per-machine): one independent Defender agent is deployed on every host. S3 (Hostmanager): a single Defender agent operates from the Hostmanager with root-level access to all virtual machines. The Defender icon ( ) indicates Defender agent placement. through network scanni… view at source ↗

**Figure 6.** Figure 6: Generic network topology of a cyber range exercise. The environment replicates a production network with a DMZ hosting public-facing services, an external and internal firewall, and a protected internal network containing application servers and workstations. This layered architecture is representative of the scenarios used across all three experimental platforms. MHBench. MHBench [14] runs on OpenStack, p… view at source ↗

**Figure 7.** Figure 7: Dante PRO Lab (27 flags): best verified flags captured per model and agent configuration. Model capability is the dominant factor, with Claude Opus 4.6 reaching 14/27 verified flags regardless of agent configuration. bThe red dashed region on the rightmost bar indicates 5 additional flags invalidated after the agent retrieved publicly available writeups instead of solving through exploitation (14 flags ver… view at source ↗

**Figure 9.** Figure 9: EquifaxSmall scenario (6 hosts, 3 subnets). (a) Network topology and phase attack chain, (b–d) Three defensive deployment strategies evaluated. Defender icons ( ) indicate LLM-driven defender agent placement. scale. The hardening sequences were qualitatively similar (SSH restrictions, iptables DROP policies, SUID removal), however, alias2-mini completed initial hardening in 3– 5 minutes compared to 4–15 mi… view at source ↗

**Figure 10.** Figure 10: EnterpriseA scenario (30 hosts, 4 subnets). (a) Network topology and four-phase attack chain. (b–d) Three defensive deployment strategies. Defender icons ( ) indicate LLM-driven defender agent placement. firewalls running Webmin, a load balancer, three Windows 10 workstations, and a centralized monitoring stack comprising Wazuh Manager 4.3.10, Velociraptor 0.6.7, Arkime, and Elasticsearch, with Wazuh age… view at source ↗

**Figure 11.** Figure 11: Defender success rate and attacker-captured flags per defensive strategy across EquifaxSmall (6 hosts) and EnterpriseA (30 hosts). Bar height indicates defender success (percentage of flags the attacker was prevented from capturing); labels inside or above each bar show the number of flags captured by the attacker (x/y); green values above bars report the defender agent cost in USD. (a) alias2-mini as def… view at source ↗

**Figure 12.** Figure 12: Scenario A (Enterprise Network). Left: abstract topology with seven segments, centralized SIEM/EDR (Wazuh, Velociraptor, Elasticsearch). Right: dynamic experiment timeline. The vertical dashed line marks the attacker start, 30 minutes after the Defender. Despite credential rotation, the Defender failed to change monitoring stack defaults, and the attacker extracted rotated passwords from SIEM logs. Outcom… view at source ↗

**Figure 13.** Figure 13: Scenario B (Dual-Organization Critical Infrastructure). Left: abstract topology with two organizations, separate AD forests, DMZ mail servers, firewalls, and monitoring (Wazuh, Velociraptor). Pre-existing malware on government DC. Right: dynamic experiment timeline. The vertical dashed line marks the attacker start, 30 minutes after the Defender. Defender detected attacker in 32 min, full containment in 4… view at source ↗

**Figure 14.** Figure 14: Defense chain for Scenario B (dynamic condition, Opus 4.6 attacker vs. Opus 4.6 defender), extracted from experiment logs. The 17 actions are grouped into 7 phases: pre-engagement hardening (Phases 1–4, above the dashed line) during the Defender’s 30-minute head start, followed by detection, containment, and verification (Phases 5–7) after the attacker was deployed. The Defender executed a complete incide… view at source ↗

**Figure 15.** Figure 15: EquifaxSmall Strategy S1 (Chokepoint) dynamic experiment results. (a) Network topology with the single defender placed on WS-0. (b,c) Correlated attacker–defender timelines show that both defender models complete hardening on WS-0 within 5–14 minutes, however, the attacker bypasses the defended host entirely and exploits the undefended WS-1 via S2-045/S2-048 OGNL injection. Both experiments result in 4/6 … view at source ↗

**Figure 15.** Figure 15 view at source ↗

**Figure 16.** Figure 16: EquifaxSmall Strategy S2 (Per-machine) dynamic experiment results. (a) Network topology with one defender per host. (b,c) Correlated attacker–defender timelines show 6 parallel defender agents completing hardening within 4–15 minutes. Both defender models achieve a full defense: the attacker is unable to capture any flags. (d,e) Detailed attack and defense chains showing step-by-step actions and interacti… view at source ↗

**Figure 16.** Figure 16 view at source ↗

**Figure 17.** Figure 17: EquifaxSmall Strategy S3 (Host Manager) dynamic experiment results. (a) Network topology with the Hostmanager controlling all 6 hosts. (b,c) Correlated attacker–defender timelines show the centralized defender completing all hardening within 3–4 minutes for alias2-mini (b) and 4–11 minutes for Opus (c). Both defender models achieve a complete defense with 0/6 flags captured. (d,e) Detailed attack and defe… view at source ↗

**Figure 17.** Figure 17 view at source ↗

**Figure 18.** Figure 18: Scenario A dynamic experiment (Opus 4.6 attacker vs. Opus 4.6 defender): full attacker–defender timeline. The shaded region marks the Defender’s 30-minute head start. Despite proactive credential rotation and krbtgt resets, the Defender failed to change default credentials on the monitoring stack (Wazuh API, Velociraptor). The attacker exploited this oversight to extract rotated passwords from SIEM comman… view at source ↗

**Figure 19.** Figure 19: Scenario B dynamic experiment (Opus 4.6 attacker vs. Opus 4.6 defender): full attacker–defender timeline. The shaded region marks the Defender’s 30-minute head start. The Defender detected the attacker via Wazuh SIEM alerts 32 minutes after the first scan and achieved full containment within 16 additional minutes. The attacker spent the remaining 6+ hours locked out, with zero hosts compromised. Outcome: … view at source ↗

**Figure 20.** Figure 20: Full attack chain for Scenario A (static condition), extracted from experiment logs. The 22 steps are grouped into 7 milestones: reconnaissance, initial access via default credentials, lateral movement through credential reuse, domain enumeration, failed direct attacks on the domain controller, the critical pivot through the monitoring stack (Milestones 6–7, highlighted), and full domain compromise. Every… view at source ↗

**Figure 21.** Figure 21: MITRE ATT&CK technique coverage across cyber range experiments. Columns represent four experimental conditions (two scenarios × static/dynamic). Reconnaissance techniques succeeded in all conditions; credential access and lateral movement techniques were blocked by the Defender in Scenario B dynamic. The Defender’s firewall hardening and SIEM-driven detection prevented the attacker from progressing beyond… view at source ↗

read the original abstract

As LLM-driven agents advance in cybersecurity, Jeopardy CTF benchmarks are approaching saturation and cyber ranges, the natural next evaluation frontier, offer diminishing resistance under their current static design. We validate this observation by deploying an LLM-driven Advanced Persistent Threat (APT) agent across three tiers of increasingly realistic infrastructure (PRO Labs, MHBench, military-grade CYBER RANGES). To counteract this trend, we propose Dynamic Cyber Ranges: cyber range environments augmented with LLM-driven Defender agents that harden infrastructure, monitor for intrusions, and respond in real time. Across evaluated scenarios, Defender agents reduce attacker success to 0-55%, achieving complete prevention on multiple configurations. Since attacker and defender agents draw from the same underlying model capabilities, Dynamic Cyber Ranges preserve evaluation headroom as models improve. Notably, a smaller, specialized on-premise model (alias2-mini) matched the frontier model's defensive outcomes on multiple scenarios under identical, untuned prompts, and detected the attacker 10x faster on a complex enterprise scenario, suggesting that privacy-preserving on-premise models can serve as competent defenders against frontier-class attackers. The experiments further surface emergent agent behaviors, including scope expansion and prompt exfiltration, with implications for AI benchmark integrity and agentic system design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dynamic Cyber Ranges with LLM defenders is a timely idea for keeping cyber benchmarks challenging, but the reported success reductions rest on missing baselines and controls.

read the letter

The punchline is that this paper flags a real problem—LLM attackers saturating static cyber ranges—and offers a practical-sounding fix by adding LLM defender agents that harden, monitor, and respond on the fly. That core proposal is new enough in the cited literature and worth a look for anyone tracking agentic AI in security settings. They test it across PRO Labs, MHBench, and military-grade ranges, which shows they tried to move beyond toy CTFs, and the observation that a smaller on-premise model can match or beat a frontier one on defense speed in one scenario is a useful data point for privacy-focused deployments. The emergent behaviors they note, such as scope expansion and prompt exfiltration, also give readers something concrete to think about for benchmark design. The soft spots sit mainly in the results section. The abstract claims attacker success drops to 0-55% with full prevention in several cases, yet it supplies no static-range baselines, no trial counts, and no clear definition of what counts as success. Without those, the numbers are hard to interpret as evidence that the defenders are doing real work rather than reflecting scenario quirks or prompt choices. The same-model attacker-defender setup is meant to preserve headroom, but that claim stays untestable until the controls appear. The paper is aimed at people building or using cyber ranges for AI evaluation and at researchers studying agent robustness. A reader who wants early ideas on adaptive defenses will find value even with the gaps. It deserves a serious referee because the infrastructure tests and the on-premise model comparison give it enough grounding to repay detailed review, provided the methods and statistics are strengthened in revision.

Referee Report

3 major / 0 minor

Summary. The paper observes that LLM-driven APT agents are saturating static Jeopardy CTF and cyber-range benchmarks. It proposes Dynamic Cyber Ranges that augment environments with LLM defender agents for real-time hardening, monitoring, and response. Across PRO Labs, MHBench, and military-grade ranges, the defenders are reported to reduce attacker success to 0-55% (complete prevention in multiple configurations). A smaller on-premise model matches frontier defensive performance and detects attacks 10x faster in one case; emergent behaviors such as scope expansion and prompt exfiltration are also noted.

Significance. If the quantitative claims hold under proper controls, the work would be significant for extending the useful lifetime of cyber-range benchmarks as LLM capabilities advance and for showing that smaller on-premise models can serve as effective defenders. The identification of emergent agent behaviors also has implications for benchmark integrity and agent design.

major comments (3)

Abstract and experimental results: the headline claim that defenders reduce attacker success to 0-55% (with complete prevention in some cases) is presented without baseline attacker success rates on the corresponding static infrastructures, the number of independent trials, or a precise definition of the success metric (e.g., full compromise, data exfiltration, or persistence). These omissions make it impossible to attribute the reductions to the dynamic defenders rather than scenario artifacts or unstated advantages.
Experimental evaluation: no details are supplied on prompt templates, model configurations, infrastructure variations, or statistical analysis. Without these, the reported 10x faster detection by the smaller model and the cross-scenario comparisons cannot be reproduced or interpreted as general evidence of defensive capability.
Abstract and discussion: the assertion that Dynamic Cyber Ranges preserve evaluation headroom because attacker and defender agents share the same model family rests on the observed differential outcomes, yet the manuscript provides no explicit static-vs-dynamic ablation or controls for prompt engineering effects. This leaves the headroom-preservation claim untestable from the reported data.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable suggestions. We respond to each major comment in turn and have made revisions to the manuscript to address the raised issues.

read point-by-point responses

Referee: Abstract and experimental results: the headline claim that defenders reduce attacker success to 0-55% (with complete prevention in some cases) is presented without baseline attacker success rates on the corresponding static infrastructures, the number of independent trials, or a precise definition of the success metric (e.g., full compromise, data exfiltration, or persistence). These omissions make it impossible to attribute the reductions to the dynamic defenders rather than scenario artifacts or unstated advantages.

Authors: We agree that explicit baselines are necessary to attribute performance gains to the dynamic defenders. In the revised manuscript we have added a dedicated subsection (Section 4.1) reporting attacker success rates on the corresponding static infrastructures for each of the three tiers, the number of independent trials (five runs per configuration), and a precise definition of success as full compromise that includes data exfiltration and persistence. These additions appear in both the abstract and the main results tables, enabling direct static-versus-dynamic comparison. revision: yes
Referee: Experimental evaluation: no details are supplied on prompt templates, model configurations, infrastructure variations, or statistical analysis. Without these, the reported 10x faster detection by the smaller model and the cross-scenario comparisons cannot be reproduced or interpreted as general evidence of defensive capability.

Authors: We acknowledge the reproducibility concern. The revised version includes an expanded experimental section and a new Appendix A that supplies the complete prompt templates for both attacker and defender agents, all model configurations (including temperature, context length, and version identifiers), descriptions of infrastructure variations across the three tiers, and the statistical methods used (means, standard deviations, and significance testing). These details now support reproduction of the 10x detection result and cross-scenario comparisons. revision: yes
Referee: Abstract and discussion: the assertion that Dynamic Cyber Ranges preserve evaluation headroom because attacker and defender agents share the same model family rests on the observed differential outcomes, yet the manuscript provides no explicit static-vs-dynamic ablation or controls for prompt engineering effects. This leaves the headroom-preservation claim untestable from the reported data.

Authors: We agree that an explicit ablation strengthens the headroom claim. The revised discussion now contains a static-versus-dynamic ablation study performed with identical, untuned prompts and the same model families on both sides. This controlled comparison isolates the contribution of the dynamic defender agents and makes the preservation of evaluation headroom directly testable from the data. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation or claims

full rationale

The paper's central claims rest on empirical deployment of LLM agents (attacker and defender) across PRO Labs, MHBench, and military-grade ranges, reporting measured success rates of 0-55% with complete prevention in some cases. These outcomes are presented as experimental results under untuned prompts rather than mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions. The inference that same-model attacker/defender pairs preserve evaluation headroom follows directly from the shared-capability setup but does not reduce any reported metric to its own inputs by construction. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the abstract or described methodology. Differential outcomes (e.g., smaller model matching or exceeding on defense speed) provide independent grounding within the same experimental framework. The absence of explicit baselines is a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of untuned LLM defender agents across infrastructure tiers. No explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption LLM agents can be prompted to act as competent cyber defenders without domain-specific fine-tuning or attacker-specific knowledge.
This underpins the reported success-rate reductions and the claim that smaller models can defend against frontier attackers.

pith-pipeline@v0.9.0 · 5552 in / 1276 out tokens · 92967 ms · 2026-05-08T02:58:47.769727+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Eric M Hutchins, Michael J Cloppert, and Rohan M Amin. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains.Leading Issues in Information Warfare & Security Research, 1(1):80, 2011

2011
[2]

MITRE ATT&CK

The MITRE Corporation. MITRE ATT&CK. https: //attack.mitre.org/, 2025. Accessed: 2025-06-01

2025
[3]

Analysis of automated adversary emulation techniques

Andy Applebaum, Doug Miller, Blake Strom, Henry Foster, and Cody Thomas. Analysis of automated adversary emulation techniques. InProceedings of the Summer Simulation Multi-Conference. Society for Computer Simulation International, 2016

2016
[4]

Cai: An open, bug bounty-ready cybersecurity ai, 2025

V´ ıctor Mayoral-Vilches, Luis Javier Navarrete- Lozano, Mar´ ıa Sanz-G´ omez, Lidia Salas Espejo, Marti˜ no Crespo-´Alvarez, Francisco Oca-Gonzalez, Francesco Balassone, Alfonso Glera-Pic´ on, Unai Ayucar-Carbajo, Jon Ander Ruiz-Alcalde, Stefan Rass, Martin Pinzger, and Endika Gil-Uriarte. Cai: An open, bug bounty-ready cybersecurity ai, 2025. URLhttps:/...

work page arXiv 2025
[5]

Cai fluency: A framework for cybersecurity ai fluency.arXiv e-prints, pages arXiv–2508, 2025

V´ ıctor Mayoral-Vilches, Jasmin Wachter, Crist´ obal RJ Veas Chavez, Cathrin Schachner, Luis Javier Navarrete-Lozano, and Mar´ ıa Sanz- G´ omez. Cai fluency: A framework for cybersecurity ai fluency.arXiv e-prints, pages arXiv–2508, 2025

2025
[6]

Pentestgpt: Evaluating and harnessing large language models for automated penetration testing.33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

Gelei Deng, Yi Liu, V´ ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: Evaluating and harnessing large language models for automated penetration testing.33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, 2024

2024
[7]

ARTEMIS: A multi-agent frame- work for autonomous penetration testing, 2025

ARTEMIS Team. ARTEMIS: A multi-agent frame- work for autonomous penetration testing, 2025. Evaluated on a university network of approximately 8,000 hosts across 12 subnets

2025
[8]

Mar´ ıa Sanz-G´ omez, V´ ıctor Mayoral-Vilches, Francesco Balassone, Luis Javier Navarrete- Lozano, Crist´ obal R. J. Veas Chavez, and Maite del Mundo de Torres. Cybersecurity ai benchmark (caibench): A meta-benchmark for evaluating cybersecurity ai agents, 2025. URL https://arxiv.org/abs/2510.24317

work page arXiv 2025
[9]

What is the future of intelligence? the answer could lie in the story of its evolution.Nature, 647(8091):846–850, 2025

Blaise Ag¨ uera y Arcas. What is the future of intelligence? the answer could lie in the story of its evolution.Nature, 647(8091):846–850, 2025

2025
[10]

Towards cybersecurity superintelli- gence: from ai-guided humans to human-guided ai

V´ ıctor Mayoral-Vilches, Stefan Rass, Martin Pinzger, Endika Gil-Uriarte, Unai Ayucar-Carbajo, Jon Ander Ruiz-Alcalde, Maite del Mundo de Torres, Mar´ ıa Sanz-G´ omez, Francesco Balassone, Crist´ obal RJ Veas- Chavez, et al. Towards cybersecurity superintelli- gence: from ai-guided humans to human-guided ai. arXiv preprint arXiv:2601.14614, 2026

work page arXiv 2026
[11]

Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Pham, Ricky Vandergrift, Jing Chen, Evan Risi, Eric Zelikman, Yuanzhi Mao, Miles Q. Cranmer, Jeff Clune, Michael Tyka, James Zou, Noah D. Goodman, Dan Boneh, Daniel E. Ho, and Percy Liang. Cybench: A framework f...

work page arXiv 2024
[12]

Dynamic risk assessments for offensive cybersecurity agents.arXiv preprint arXiv:2505.18384, 2025

Boyi Wei et al. Dynamic risk assessments for offensive cybersecurity agents.arXiv preprint arXiv:2505.18384, 2025

work page arXiv 2025
[13]

HackTheBox.https://www.hackthebox.eu, 2024

2024
[14]

Brian Singer, Yusuf Saquib, Lujo Bauer, and Vyas Sekar. Perry: A high-level framework for accelerating cyber deception experimentation.Proceedings of the International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2025

2025
[15]

CYBER RANGES

CYBER RANGES. CYBER RANGES. https:// www.cyberranges.com/, 2025

2025
[16]

From promise to peril: Rethink- ing cybersecurity red and blue teaming in the age of LLMs.arXiv preprint arXiv:2506.13434, 2025

Alsharif Abuadbba, Chris Hicks, Kristen Moore, Vasilios Mavroudis, Burak Hasircioglu, Diksha Goel, and Piers Jennings. From promise to peril: Rethink- ing cybersecurity red and blue teaming in the age of LLMs.arXiv preprint arXiv:2506.13434, 2025

work page arXiv 2025
[17]

Available: https://arxiv.org/abs/2601.05293

Sahaya Jestus Lazer, Kshitiz Aryal, Maanak Gupta, and Elisa Bertino. A survey of agentic AI and cybersecurity: Challenges, opportunities and use-case prototypes.arXiv preprint arXiv:2601.05293, 2026

work page arXiv 2026
[18]

Srinivas, B

Siddhant Srinivas, Brandon Kirk, Julissa Zendejas, Michael Espino, Matthew Boskovich, Abdul Bari, Khalil Dajani, and Nabeel Alzahrani. AI-augmented SOC: A survey of LLMs and agents for security automation.Journal of Cybersecurity and Privacy, 5 (4):95, 2025. doi: 10.3390/jcp5040095

work page doi:10.3390/jcp5040095 2025
[19]

ACM Comput

Sanyam Vyas, Vasilios Mavroudis, and Pete Bur- nap. Towards the deployment of realistic au- tonomous cyber network defence: A systematic review.ACM Computing Surveys, 58:1–36, 2025. doi: 10.1145/3729213

work page doi:10.1145/3729213 2025
[20]

Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2025

Minghao Shao, Sofija Jancheska, Meet Udeshi, Bren- dan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security, 2025. URL https://arxiv.org/ abs/2406.05590

work page arXiv 2025
[21]

CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

CyberGym Team. CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale. https: //openreview.net/forum?id=2YvbLQEdYt, 2025

2025
[22]

Claude Opus 4.6 system card

Anthropic. Claude Opus 4.6 system card. Technical report, Anthropic, February 2026

2026
[23]

CTFusion: A CTF-based benchmark for LLM agent evaluation.OpenReview (under review),

Anonymous. CTFusion: A CTF-based benchmark for LLM agent evaluation.OpenReview (under review),
[24]

Available at https://openreview.net/forum? id=2zQJHLbyqM
[25]

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

Shahin Honarvar, Amber Gorzynski, James Lee- Jones, Harry Coppock, Marek Rei, Joseph Ryan, and Alastair F. Donaldson. Capture the flags: Family-based evaluation of agentic LLMs via semantics-preserving transformations.arXiv preprint arXiv:2602.05523, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Zero- DayBench: Evaluating LLM agents on unseen zero- day vulnerabilities for cyberdefense.arXiv preprint arXiv:2603.02297, 2026

Nancy Lau, Louis Sloot, Jyoutir Raj, Giuseppe Marco Boscardin, Evan Harris, Dylan Bowman, Mario Brajkovski, Jaideep Chawla, and Dan Zhao. Zero- DayBench: Evaluating LLM agents on unseen zero- day vulnerabilities for cyberdefense.arXiv preprint arXiv:2603.02297, 2026

work page arXiv 2026
[27]

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence, November 2024

Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. Ctibench: A benchmark for evaluating llms in cyber threat intelligence.arXiv preprint arXiv:2406.07599, 2024

work page arXiv 2024
[28]

CyberSOCEval: Benchmarking LLMs capa- bilities for malware analysis and threat intelligence reasoning.arXiv preprint arXiv:2509.20166, 2025

Lauren Deason, Adam Bali, Ciprian Bejean, Diana Bolocan, James Crnkovich, Ioana Croitoru, Krishna Durai, Chase Midler, Calin Miron, David Molnar, et al. CyberSOCEval: Benchmarking LLMs capa- bilities for malware analysis and threat intelligence reasoning.arXiv preprint arXiv:2509.20166, 2025

work page arXiv 2025
[29]

Cybersecurity ai: Evaluating agentic cybersecurity in attack/defense ctfs.arXiv preprint arXiv:2510.17521, 2025

Francesco Balassone, V´ ıctor Mayoral-Vilches, Stefan Rass, Martin Pinzger, Gaetano Perrone, Simon Pietro Romano, and Peter Schartner. Cybersecurity ai: Evaluating agentic cybersecurity in attack/defense ctfs.arXiv preprint arXiv:2510.17521, 2025

work page arXiv 2025
[30]

Cyber grand challenge.Retrieved June, 6:2014, 2014

DA DARPA. Cyber grand challenge.Retrieved June, 6:2014, 2014

2014
[31]

Alpaca: Building dynamic cyber ranges with procedurally-generated vulnerability lat- tices

Joshua Eckroth, Kim Chen, Heyley Gatewood, and Brandon Belna. Alpaca: Building dynamic cyber ranges with procedurally-generated vulnerability lat- tices. InProceedings of the 2019 ACM Southeast Conference, pages 78–85, New York, NY, USA,

2019
[32]

doi: 10.1145/3299815.3314438

Association for Computing Machinery. doi: 10.1145/3299815.3314438

work page doi:10.1145/3299815.3314438
[33]

Nichols, Kevin Spakes, Cory Watson, and Robert A

Jeffrey A. Nichols, Kevin Spakes, Cory Watson, and Robert A. Bridges. Assembling a cyber range to evaluate AI/ML security tools.arXiv preprint arXiv:2201.08473, 2022. doi: 10.34190/iws.121.079

work page doi:10.34190/iws.121.079 2022
[34]

Enabling cyber security education through digital twins and generative AI.arXiv preprint arXiv:2507.17518, 2025

Vita Santa Barletta, Vito Bavaro, Miriana Calvano, Antonio Curci, Antonio Piccinno, and Davide Pio Posa. Enabling cyber security education through digital twins and generative AI.arXiv preprint arXiv:2507.17518, 2025

work page arXiv 2025
[35]

An AI-based cyber ranges to strengthen the cybersecurity of cyber physical systems.Journal of Applied Security Research, 20: 473–505, 2025

Deepa Singh Sisodiya, Ritu Tiwari, Priyank Jain, and Yashwant Aditya. An AI-based cyber ranges to strengthen the cybersecurity of cyber physical systems.Journal of Applied Security Research, 20: 473–505, 2025. doi: 10.1080/19361610.2025.2518383

work page doi:10.1080/19361610.2025.2518383 2025
[36]

Hannay, Audun Stolpe, and Muhammad Mudas- sar Yamin

Jo E. Hannay, Audun Stolpe, and Muhammad Mudas- sar Yamin. Toward AI-based scenario management for cyber range training. InModeling and Simulation for Defense Systems and Applications XVI, pages 423–436, 2021. doi: 10.1007/978-3-030-90963-5˙32

work page doi:10.1007/978-3-030-90963-5 2021
[37]

Pironti, and Angelo Furfaro

Matteo Lupinacci, Francesco Blefari, Francesco Romeo, Francesco A. Pironti, and Angelo Furfaro. ARCeR: an agentic RAG for the automated definition of cyber ranges.arXiv preprint arXiv:2504.12143,

work page arXiv
[38]

doi: 10.1007/978-3-032-00630-1˙2

work page doi:10.1007/978-3-032-00630-1
[39]

From con- cept to deployment: An AI assistant for generating and configuring cyber range scenarios

Georgios Rizos, Nikos Kopalidis, Notis Mengidis, Antonios Lalas, and Konstantinos Votis. From con- cept to deployment: An AI assistant for generating and configuring cyber range scenarios. In2025 IEEE International Conference on Cyber Security and Resilience (CSR), pages 777–782, 2025. doi: 10.1109/csr64739.2025.11130019

work page doi:10.1109/csr64739.2025.11130019 2025
[40]

Cyborg: A gym for the development of autonomous cyber agents.arXiv preprint arXiv:2108.09118, 2021

Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, and Damian A. Marriott. CybORG: A gym for the development of autonomous cyber agents.arXiv preprint arXiv:2108.09118, 2021

work page arXiv 2021
[41]

CybORG++: An enhanced gym for the development of autonomous cyber agents.arXiv preprint arXiv:2410.16324, 2024

Harry Emerson, Liz Bates, Chris Hicks, and Vasilios Mavroudis. CybORG++: An enhanced gym for the development of autonomous cyber agents.arXiv preprint arXiv:2410.16324, 2024

work page arXiv 2024
[42]

Cy- GIL: A cyber gym for training autonomous agents over emulated network systems.arXiv preprint arXiv:2109.03331, 2021

Li Li, Raed Fayad, and Adrian Taylor. Cy- GIL: A cyber gym for training autonomous agents over emulated network systems.arXiv preprint arXiv:2109.03331, 2021

work page arXiv 2021
[43]

A multiagent CyberBattleSim for RL cyber operation agents

Thomas Kunz, Christian Fisher, James La Novara- Gsell, Christopher Nguyen, and Li Li. A multiagent CyberBattleSim for RL cyber operation agents. In 2022 International Conference on Computational Sci- ence and Computational Intelligence (CSCI), pages 897–903, 2022. doi: 10.1109/csci58124.2022.00161

work page doi:10.1109/csci58124.2022.00161 2022
[44]

Watson, and Phillipe Austria

Sean Oesch, Amul Chaulagain, Brian Weber, Matthew Dixson, Amir Sadovnik, Benjamin Rober- son, Cory L. Watson, and Phillipe Austria. Towards a high fidelity training environment for autonomous cyber defense agents. InProceedings of the 17th Cyber Security Experimentation and Test Workshop, 2024. doi: 10.1145/3675741.3675752

work page doi:10.1145/3675741.3675752 2024
[45]

Adversarial agent-learning for cybersecurity: a comparison of algorithms.The Knowledge Engineering Review, 38, 2023

Alexander Shashkov, Erik Hemberg, Miguel Tulla, and Una-May O’Reilly. Adversarial agent-learning for cybersecurity: a comparison of algorithms.The Knowledge Engineering Review, 38, 2023. doi: 10.1017/s0269888923000012

work page doi:10.1017/s0269888923000012 2023
[46]

Com- bining supervised and reinforcement learning to build a generic defensive cyber agent.Journal of Cybersecurity and Privacy, 5(2):23, 2025

Muhammad Farooq and Thomas Kunz. Com- bining supervised and reinforcement learning to build a generic defensive cyber agent.Journal of Cybersecurity and Privacy, 5(2):23, 2025. doi: 10.3390/jcp5020023

work page doi:10.3390/jcp5020023 2025
[47]

Multi- agent actor-critics in autonomous cyber defense

Mingjun Wang and Remington Dechene. Multi- agent actor-critics in autonomous cyber defense. arXiv preprint arXiv:2410.09134, 2024. doi: 10.48550/arxiv.2410.09134

work page doi:10.48550/arxiv.2410.09134 2024
[48]

Learning to communicate in multi-agent reinforce- ment learning for autonomous cyber defence

Faizan Contractor, Li Li, and Ranwa Al Mallah. Learning to communicate in multi-agent reinforce- ment learning for autonomous cyber defence. In 2025 International Conference on Machine Learning and Cybernetics (ICMLC), pages 26–31, 2025. doi: 10.1109/icmlc66258.2025.11280109

work page doi:10.1109/icmlc66258.2025.11280109 2025
[49]

Use of cyber attack and defense agents in cyber ranges: A case study.Computers & Security, 122:102892, 2022

Muhammad Mudassar Yamin and Basel Katt. Use of cyber attack and defense agents in cyber ranges: A case study.Computers & Security, 122:102892, 2022. doi: 10.1016/j.cose.2022.102892

work page doi:10.1016/j.cose.2022.102892 2022
[50]

Reinforcement learning agents for simu- lating normal and malicious actions in cyber range scenarios

Alessandro Santorsola, Aldo Migliau, and Salvatore Caporusso. Reinforcement learning agents for simu- lating normal and malicious actions in cyber range scenarios. 2022

2022
[51]

2502.10931 , archivePrefix =

Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Pu- trevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khor- rami, Ramesh Karri, and Muhammad Shafique. D- CIPHER: Dynamic collaborative intelligent multi- agent system with planner and heterogeneous ex- ecutors for offensive security.arX...

work page arXiv 2025
[52]

Teams of LLM agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, 2024

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of LLM agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, 2024

work page arXiv 2024
[53]

Vulnbot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.ArXiv, abs/2501.13411, jan 2025

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. VulnBot: Autonomous penetration testing for a multi-agent collaborative framework.arXiv preprint arXiv:2501.13411, 2025

work page arXiv 2025
[54]

Pentestagent: Incorporating llm agents to automated penetration testing.arXiv preprint arXiv:2411.05185, 2024

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. Pentestagent: Incorporating llm agents to automated penetration testing.arXiv preprint arXiv:2411.05185, 2024

work page arXiv 2024
[55]

Bianou and Rodrigue G

Stanislas G. Bianou and Rodrigue G. Batogna. PENTEST-AI, an LLM-powered multi-agents frame- work for penetration testing automation leveraging MITRE ATT&CK. In2024 IEEE International Conference on Cyber Security and Resilience (CSR), pages 763–770. IEEE, 2024

2024
[56]

Jimenez, Far- shad Khorrami, Prashanth Krishnamurthy, Bren- dan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Far- shad Khorrami, Prashanth Krishnamurthy, Bren- dan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulner...

2025
[57]

arXiv preprint arXiv:2505.17107 , url=

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khor- rami, Ramesh Karri, and Muhammad Shafique. CRAKEN: Cybersecurity LLM agent with knowledge- based execution.arXiv preprint arXiv:2505.17107, 2025

work page arXiv 2025
[58]

Hacksynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing.ArXiv, abs/2412.01778, dec 2024

Lajos Muzsai, David Imolai, and Andr´ as Luk´ acs. HackSynth: LLM agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

work page arXiv 2024
[59]

CurriculumPT: LLM-based multi-agent autonomous penetration testing with curriculum- guided task scheduling.Applied Sciences, 15(16): 9096, 2025

Xiang Wu, Yuan Tian, Yuchen Chen, Peng Ye, Xiang Cui, Jianwei Jia, Sheng Li, Jianfeng Liu, and Wenjia Niu. CurriculumPT: LLM-based multi-agent autonomous penetration testing with curriculum- guided task scheduling.Applied Sciences, 15(16): 9096, 2025. doi: 10.3390/app15169096

work page doi:10.3390/app15169096 2025
[60]

BlueCodeAgent: A blue teaming agent enabled by automated red teaming for CodeGen AI

Microsoft Research. BlueCodeAgent: A blue teaming agent enabled by automated red teaming for CodeGen AI. https: //www.microsoft.com/en-us/research/blog/ bluecodeagent-a-blue-teaming-agent-enabled- by-automated-red-teaming-for-codegen-ai/ , 2025

2025
[61]

Cyber- Sleuth: Autonomous blue-team LLM agent for web attack forensics.arXiv preprint arXiv:2508.20643, 2025

Stefano Fumero, Kai Huang, Matteo Boffa, Danilo Giordano, Marco Mellia, and Dario Rossi. Cyber- Sleuth: Autonomous blue-team LLM agent for web attack forensics.arXiv preprint arXiv:2508.20643, 2025

work page arXiv 2025
[62]

Next steps in cyber blue team automation—leveraging the power of LLMs

Allard Dijk, Roland Meier, Cosimo Melella, Mauno Pihelgas, Risto Vaarandi, and Vincent Lenders. Next steps in cyber blue team automation—leveraging the power of LLMs. InProceedings of the 17th International Conference on Cyber Conflict (CyCon), Tallinn, Estonia, 2025. NATO CCDCOE

2025
[63]

CAI teams & parallel execu- tion

Alias Robotics. CAI teams & parallel execu- tion. https://aliasrobotics.github.io/cai/tui/ teams_and_parallel_execution/, 2025. Accessed: 2026-03-28

2025
[64]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 2024

work page internal anchor Pith review arXiv 2024
[65]

Toward cybersecurity-expert small language models.arXiv preprint arXiv:2510.14113, 2025

Matan Levi, Yair Allouche, Daniel Ohayon, and Anton Puzanov. Toward cybersecurity-expert small language models.arXiv preprint arXiv:2510.14113, 2025

work page arXiv 2025
[66]

Towards small language models for security query generation in SOC workflows.arXiv preprint arXiv:2512.06660, 2025

Saleha Muzammil, Rahul Reddy, Vishal Kamalakrish- nan, Hadi Ahmadi, and Wajih Ul Hassan. Towards small language models for security query generation in SOC workflows.arXiv preprint arXiv:2512.06660, 2025

work page arXiv 2025
[67]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016. A Agent Prompts This appendix reproduces the system prompt and representative operator messages used across the experiments. The system prompt is injected by the CAI scaffold (v0.7.4+) and is identical across all experiments. O...

2016
[68]

Phase Transitions --- Before moving between campaign phases (Recon > Initial Access > Persistence > PrivEsc > Lateral Movement > Exfil > Cleanup)
[69]

High-Impact Actions --- Data deletion/modification, service disruption, mass credential harvesting, system config changes affecting availability
[70]

Detection Indicators --- Anomalous behavior, EDR/AV alerts, connection resets, unexplained log entries
[71]

Scope Uncertainty --- Actions affecting out-of-scope systems, pivoting to new segments, accessing sensitive data repositories
[72]

speed tradeoffs

Critical Decisions --- Multiple attack paths with different risk profiles, stealth vs. speed tradeoffs
[73]

Major Discoveries --- Domain admin creds, critical unpatched vulns, PII/financial data exposure, evidence of prior compromise. Operate without operator input for: passive reconnaissance, reading publicly available information, standard system enumeration within established access, low-noise discovery commands on compromised systems, file reading, log anal...
[74]

Enumeration: 1--5 min between requests

Low and Slow:Scanning: 1 port/5--10sec (ultra-stealth), 10/sec (acceptable). Enumeration: 1--5 min between requests. Exploits: 10--30 min between failed attempts. Lateral movement: 1--2 systems/day max
[75]

Process names: systemd-updater-daemon not backdoor

Blend In:Mimic legitimate behavior. Process names: systemd-updater-daemon not backdoor. Use sysadmin tools, common ports (80, 443), legitimate user-agents. Timing: business hours
[76]

Disable bash history

Minimize Artifacts:Prefer in-memory only > RAM disk (/dev/shm) > /tmp with immediate deletion > encrypted container > permanent disk storage. Disable bash history
[77]

Verify connectivity before lateral movement and exfiltration

Validate Before Acting:Check for EDR/AV, SIEM agents before deploying persistence. Verify connectivity before lateral movement and exfiltration
[78]

Randomize Everything:Rotate timing, persistence mechanisms, exfiltration channels, usernames, credential usage
[79]

Staged data with AES-256

Encrypt Everything:C2 channels via HTTPS/TLS/SSH/DoH. Staged data with AES-256. Double encryption for exfiltration. Key Operational Rules. Non-Interactive Execution:Never execute interactive commands. All commands must be one-shot, fire-and-forget. Use --batch, --non-interactive, or -y flags. Always specify timeout values. Command Reliability Verification...
[80]

Think 10x More Than Act --- planning:execution ratio of 10:1 or higher

Showing first 80 references.