Building Better Environments for Autonomous Cyber Defence
Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3
The pith
A framework decomposes the interface between RL cyber environments and real systems to guide better agent training and evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present two main contributions: a framework that decomposes the interface between RL cyber environments and real systems, and a set of guidelines for environment development and agent evaluation drawn directly from the shared experience of workshop participants who have hands-on work in this area.
What carries the argument
The framework for decomposing the interface between RL cyber environments and real systems, which isolates connection points so that design choices in simulation can be made more deliberately and transferred more reliably.
If this is right
- Environment designers gain a clearer separation of concerns when linking simulation to real systems, reducing repeated trial-and-error.
- Agent evaluation becomes more consistent across different projects because the guidelines supply shared criteria.
- Training runs are more likely to produce agents whose behaviors transfer to operational networks in sensitive sectors.
- Common hazards in environment construction are identified and can be avoided from the start of new projects.
Where Pith is reading between the lines
- Adoption of the interface framework could make it easier to compare results across separate RL ACD research groups by standardizing how environments are described.
- The guidelines implicitly highlight the need for future benchmarks that test generalization from simulation to live network conditions.
- Extending the decomposition approach to other RL security tasks, such as intrusion response, could follow the same structure.
- A practical next step would be to release example environment templates built according to the workshop practices for others to adapt.
Load-bearing premise
The collective tradecraft and domain knowledge from the workshop participants forms a comprehensive and transferable set of best practices that will produce RL environments able to generalize to government and critical infrastructure networks.
What would settle it
A side-by-side test in which multiple teams build RL cyber environments according to the guidelines versus without them, then measure whether agents trained in the guideline-compliant environments achieve measurably higher defense success rates against realistic attack sequences on a held-out network simulation.
Figures
read the original abstract
In November 2025, the authors ran a workshop on the topic of what makes a good reinforcement learning (RL) environment for autonomous cyber defence (ACD). This paper details the knowledge shared by participants both during the workshop and shortly afterwards by contributing herein. The workshop participants come from academia, industry, and government, and have extensive hands-on experience designing and working with RL and cyber environments. While there is now a sizeable body of literature describing work in RL for ACD, there is nevertheless a great deal of tradecraft, domain knowledge, and common hazards which are not detailed comprehensively in a single resource. With a specific focus on building better environments to train and evaluate autonomous RL agents in network defence scenarios, including government and critical infrastructure networks, the contributions of this work are twofold: (1) a framework for decomposing the interface between RL cyber environments and real systems, and (2) guidelines on current best practice for RL-based ACD environment development and agent evaluation, based on the key findings from our workshop.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript details outcomes from a November 2025 workshop on reinforcement learning (RL) environments for autonomous cyber defence (ACD). Drawing on input from participants across academia, industry, and government, it presents a framework for decomposing the interface between RL cyber environments and real systems, together with guidelines for environment development and agent evaluation. The work positions these as a synthesis of tradecraft and domain knowledge to address gaps in the literature, with particular attention to applications in government and critical infrastructure networks.
Significance. If the framework and guidelines accurately reflect transferable practitioner insights, the paper could help standardize RL environment design for cyber defence and reduce common implementation hazards. The interdisciplinary synthesis is a clear strength and fills a documented gap in comprehensive resources. However, the absence of quantitative validation, formal proofs, or comparative experiments limits the result to a consolidation of expert opinion rather than a demonstrably superior methodology.
minor comments (2)
- The abstract announces the twofold contributions but provides no high-level outline of the framework's decomposition steps; adding one sentence would improve immediate accessibility for readers.
- Guidelines are presented as synthesized best practice; including at least one concrete workshop-derived example per major guideline would make the advice more actionable without altering the synthesis nature of the work.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation of minor revision. The manuscript is a synthesis of practitioner insights from the November 2025 workshop rather than an empirical study, and we have clarified this scope in the revision to better align with the referee's observations on the nature of the contribution.
read point-by-point responses
-
Referee: However, the absence of quantitative validation, formal proofs, or comparative experiments limits the result to a consolidation of expert opinion rather than a demonstrably superior methodology.
Authors: We agree that the work does not include new quantitative validation, formal proofs, or comparative experiments, as its purpose is to consolidate tradecraft and domain knowledge shared by workshop participants from academia, industry, and government. This is explicitly stated in the abstract and introduction as a synthesis addressing gaps in the literature. The framework and guidelines are presented as best practices distilled from expert input rather than a validated superior approach. In the revised manuscript, we have added a short clarifying paragraph in Section 1 (Introduction) to emphasize the scope as expert consensus synthesis and note the absence of empirical benchmarking, which directly addresses this point without changing the core contributions. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a workshop synthesis report whose contributions consist of a framework for decomposing RL-to-real-system interfaces and collated practitioner guidelines drawn from participant tradecraft. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The central claims rest on accurate reporting of external workshop outputs rather than any self-referential reduction, self-citation chain, or renaming of known results. The argument is therefore self-contained against external benchmarks with no load-bearing steps that reduce to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert consensus from a multi-sector workshop accurately captures transferable best practices for RL environment design in cyber defence.
Reference graph
Works this paper leans on
-
[1]
Disrupting the first reported AI-orchestrated cyber espionage cam- paign. Tech. rep., Anthropic (2025), (Online, Accessed 21st January 2026) https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first- reported-AI-orchestrated-cyber-espionage-campaign.pdf
work page 2025
-
[2]
In: Proceedings of the 35th International Conference on Neural Information Processing Systems
Agarwal, R., Schwarzer, M., Castro, P.S., Courville, A., Bellemare, M.G.: Deep reinforcement learning at the edge of the statistical precipice. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. NIPS ’21 (2021)
work page 2021
-
[3]
In: 32nd USENIX Security Symposium (USENIX Security 23) (2023)
Al Wahaibi, S., Foley, M., Maffeis, S.: SQIRL: Grey-box detection of sql injec- tion vulnerabilities using reinforcement learning. In: 32nd USENIX Security Symposium (USENIX Security 23) (2023)
work page 2023
-
[4]
science 314(5799), 610–613 (2006)
Anderson, R., Moore, T.: The economics of information security. science 314(5799), 610–613 (2006)
work page 2006
-
[5]
In: Workshop on Machine Learning for Cybersecurity (ML4Cyber) (07 2022)
Andrew, A., Spillard, S., Collyer, J., Dhir, N.: Developing optimal causal cyber- defence agents via cyber security simulation. In: Workshop on Machine Learning for Cybersecurity (ML4Cyber) (07 2022)
work page 2022
- [6]
-
[7]
In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (2023)
Bates, E., Mavroudis, V., Hicks, C.: Reward shaping for happier autonomous cyber security agents. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (2023)
work page 2023
-
[8]
arXiv preprint arXiv:2503.03245 (2025)
Bates, E., Hicks, C., Mavroudis, V.: Less is more? rewards in rl for cyber defence. arXiv preprint arXiv:2503.03245 (2025)
- [9]
- [10]
-
[11]
https://doi.org/10.5281/zenodo.15147271, https://github.com/alan-turing- institute/r3ace
Chapman, E., Hicks, C., Mavroudis, V.: r3ace. https://doi.org/10.5281/zenodo.15147271, https://github.com/alan-turing- institute/r3ace
-
[12]
Chen, J., Hu, S., Zheng, H., Xing, C., Zhang, G.: GAIL-PT: An intelligent penetra- tion testing framework with generative adversarial imitation learning. Computers & Security (2023)
work page 2023
-
[13]
Advances in Neural Information Processing Systems (2024)
Chen, X., Nie, Y., Guo, W., Zhang, X.: When llm meets drl: Advancing jailbreaking efficiency via drl-guided search. Advances in Neural Information Processing Systems (2024)
work page 2024
-
[14]
In: 33rd USENIX Security Symposium (USENIX Security 24) (2024)
De Silva, R., Guo, W., Ruaro, N., Grishchenko, I., Kruegel, C., Vigna, G.: {GuideEnricher}: Protecting the anonymity of ethereum mixing service users with deep reinforcement learning. In: 33rd USENIX Security Symposium (USENIX Security 24) (2024)
work page 2024
-
[15]
Defence Science and Technology Laboratory UK: Primaite (primary-level ai train- ing environment), https://github.com/Autonomous-Resilient-Cyber-Defence/ PrimAITE, gitHub repository (tag: v4.0.0). Accessed 2026-02-19
work page 2026
- [16]
- [17]
-
[18]
Autonomous network defence using reinforcement learning,
Foley, M., Hicks, C., Highnam, K., Mavroudis, V.: Autonomous Network De- fence Using Reinforcement Learning. In: Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security. ASIA CCS ’22 (2022), https://doi.org/10.1145/3488932.3527286
-
[19]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)
Foley, M., Maffeis, S.: Apirl: Deep reinforcement learning for rest api fuzzing. In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)
work page 2025
-
[20]
In: Conference on Applied Machine Learning in Information Security (CAMLIS) (2022)
Foley, M., Wang, M., M, Z., Hicks, C., Mavroudis, V.: Inroads into Autonomous Network Defence using Explained Reinforcement Learning. In: Conference on Applied Machine Learning in Information Security (CAMLIS) (2022)
work page 2022
-
[21]
In: 2022 IEEE International Conference on Omni-layer Intelligent Systems (COINS)
Gangupantulu, R., Cody, T., Park, P., Rahman, A., Eisenbeiser, L., Radke, D., Clark, R., Redino, C.: Using cyber terrain in reinforcement learning for penetration testing. In: 2022 IEEE International Conference on Omni-layer Intelligent Systems (COINS). IEEE (2022)
work page 2022
-
[22]
In: 2020 25th IEEE International Conference on Emerging Tech- nologies and Factory Automation (ETFA)
Geiger, M., Bauer, J., Masuch, M., Franke, J.: An analysis of black energy 3, crashoverride, and trisis, three malware approaches targeting operational tech- nology systems. In: 2020 25th IEEE International Conference on Emerging Tech- nologies and Factory Automation (ETFA). vol. 1, pp. 1537–1543. IEEE (2020)
work page 2020
-
[23]
In: European Symposium on Research in Computer Security
Goel, D., Moore, K., Guo, M., Wang, D., Kim, M., Camtepe, S.: Optimizing cy- ber defense in dynamic active directories through reinforcement learning. In: European Symposium on Research in Computer Security. Springer (2024)
work page 2024
-
[24]
In: Proceedings of the 2022 ACM SIGSAC conference on computer and communications security (2022)
Gohil, V., Guo, H., Patnaik, S., Rajendran, J.: Attrition: Attacking static hardware trojan detection techniques using reinforcement learning. In: Proceedings of the 2022 ACM SIGSAC conference on computer and communications security (2022)
work page 2022
-
[25]
Artificial Intelligence Review55(2), 895–943 (2022)
Gronauer, S., Diepold, K.: Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review55(2), 895–943 (2022)
work page 2022
-
[26]
https://github.com/cage-challenge/cage- challenge-3 (2022)
Group, T.C.W.: Ttcp cage challenge 3. https://github.com/cage-challenge/cage- challenge-3 (2022)
work page 2022
-
[27]
IEEE Transactions on Network and Service Management19(3), 2333–2348 (2022)
Hammar, K., Stadler, R.: Intrusion prevention through optimal stopping. IEEE Transactions on Network and Service Management19(3), 2333–2348 (2022). https://doi.org/10.1109/TNSM.2022.3176781
-
[28]
In: International conference on decision and game theory for security
Han, Y., Rubinstein, B.I., Abraham, T., Alpcan, T., De Vel, O., Erfani, S., Hubczenko, D., Leckie, C., Montague, P.: Reinforcement learning for autonomous defence in software-defined networking. In: International conference on decision and game theory for security. Springer (2018)
work page 2018
-
[29]
In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security
Hicks, C., Mavroudis, V., Foley, M., Davies, T., Highnam, K., Watson, T.: Canaries and Whistles: Resilient Drone Communication Networks with (or without) Deep Reinforcement Learning. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. AISec ’23 (2023)
work page 2023
-
[30]
ACM Transactions on Privacy and Security (2025)
Hore, S., Ghadermazi, J., Paudel, D., Shah, A., Das, T., Bastian, N.: Deep pack- gen: A deep reinforcement learning framework for adversarial network packet generation. ACM Transactions on Privacy and Security (2025)
work page 2025
-
[31]
arXiv preprint arXiv:1912.01798 (2019)
Hou, C., Zhou, M., Ji, Y., Daian, P., Tramer, F., Fanti, G., Juels, A.: Squirrl: Automat- ing attack analysis on blockchain incentive mechanisms with deep reinforcement learning. arXiv preprint arXiv:1912.01798 (2019)
-
[32]
In: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)
Hu, Z., Beuran, R., Tan, Y.: Automated penetration testing using deep reinforce- ment learning. In: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). IEEE (2020)
work page 2020
-
[33]
Multi-agent reinforcement learning: A comprehensive survey
Huh, D., Mohapatra, P.: Multi-agent reinforcement learning: A comprehensive survey. arXiv preprint arXiv:2312.10256 (2023)
-
[34]
In: European Symposium on Research in Computer Security
Janisch, J., Pevn`y, T., Lis`y, V.: Nasimemu: Network attack simulator & emulator for training agents generalizing to novel scenarios. In: European Symposium on Research in Computer Security. pp. 589–608. Springer (2023)
work page 2023
-
[35]
Kaloroumakis, P., Smith, M.: Toward a knowledge graph of cybersecurity coun- termeasures (2020)
work page 2020
- [36]
-
[37]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Kiely, M., Ahiskali, M., Borde, E., Bowman, B., Bowman, D., van Bruggen, D., Cowan, K., Dasgupta, P., Devendorf, E., Edwards, B., et al.: Exploring the efficacy of multi-agent reinforcement learning for autonomous cyber defence: A cage challenge 4 perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 28907–28913 (2025)
work page 2025
-
[38]
AI Magazine46(3), e70021 (2025)
Kiely, M., Ahiskali, M., Borde, E., Bowman, B., Bowman, D., Van Bruggen, D., Cowan, K., Dasgupta, P., Devendorf, E., Edwards, B., et al.: Cage challenge 4: A scalable multi-agent reinforcement learning gym for autonomous cyber defence. AI Magazine46(3), e70021 (2025)
work page 2025
- [39]
-
[40]
King, I.J., Bowman, B., Huang, H.H.: Automated cyber defense with generalizable graph-based reinforcement learning agents (2025), https://arxiv.org/abs/2509. 16151
work page 2025
-
[41]
ieee Spectrum50(3), 48–53 (2013)
Kushner, D.: The real story of stuxnet. ieee Spectrum50(3), 48–53 (2013)
work page 2013
-
[42]
In: European Symposium on Research in Computer Security
Kvasov, A., Sahin, M., Hebert, C., De Oliveira, A.S.: Simulating deception for web applications using reinforcement learning. In: European Symposium on Research in Computer Security. Springer (2023)
work page 2023
-
[43]
Applied Intelligence 53, 27110–27127 (2023)
Li, Q., Hu, M., Hao, H., Zhang, M., Li, Y.: INNES: An intelligent network penetra- tion testing model based on deep reinforcement learning. Applied Intelligence 53, 27110–27127 (2023)
work page 2023
-
[44]
In: 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
Luo, M., Xiong, W., Lee, G., Li, Y., Yang, X., Zhang, A., Tian, Y., Lee, H.H.S., Suh, G.E.: Autocat: Reinforcement learning for automated exploration of cache-timing attacks. In: 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE (2023)
work page 2023
-
[45]
Maeda, R., Mimura, M.: Automating post-exploitation with deep reinforcement learning. Computers & Security (2021)
work page 2021
- [46]
-
[47]
In: IEEE Workshop on Deep Learning Security and Privacy (DLSP) (2024)
McFadden, S., Maugeri, M., Hicks, C., Mavroudis, V., Pierazzi, F.: Wendigo: Deep reinforcement learning for denial-of-service query discovery in graphql. In: IEEE Workshop on Deep Learning Security and Privacy (DLSP) (2024)
work page 2024
-
[48]
arXiv preprint arXiv:2602.08690 (2026)
McFadden, S., Foley, M., Bates, E., Tsingenopoulos, I., Vyas, S., Mavroudis, V., Hicks, C., Pierazzi, F.: Sok: The pitfalls of deep reinforcement learning for cyber- security. arXiv preprint arXiv:2602.08690 (2026)
-
[49]
Mern, J., Hatch, K., Silva, R., Hickert, C., Sookoor, T., Kochenderfer, M.J.: Au- tonomous attack mitigation for industrial control systems (2021), https://arxiv. org/abs/2111.02445
- [50]
-
[51]
Miles, I., Farmer, S., Foster, D., Harrold, D., Palmer, G., Parry, C., Willis, C., Casassa Mont, M., Gralewski, L., Menzies, R., Morarji, N., Turkbeyler, E., Wilson, A., Beard, A., Marques, P., Francis Roscoe, J., Bailey, S., Cheah, M., Dorn, M., Haubrick, P., Lacey, M., Rimmer, D., Stone, J., Till, D., Heartfield, R., Harrison, A., Short, J., Wilson, T.,...
work page 2024
- [52]
-
[53]
arXiv preprint arXiv:2505.22531 (2025)
Molina-Markham, A., Robaina, L., Steinle, S., Trivedi, A., Tsui, D., Potteiger, N., Brandt, L., Winder, R., Ridley, A.: Training rl agents for multi-objective network defense tasks. arXiv preprint arXiv:2505.22531 (2025)
- [54]
-
[55]
In: Proceedings of the 17th Cyber Security Experimenta- tion and Test Workshop
Oesch, S., Chaulagain, A., Weber, B., Dixson, M., Sadovnik, A., Roberson, B., Wat- son, C., Austria, P.: Towards a high fidelity training environment for autonomous cyber defense agents. In: Proceedings of the 17th Cyber Security Experimenta- tion and Test Workshop. p. 91–99. CSET ’24, Association for Computing Ma- chinery, New York, NY, USA (2024). https...
-
[56]
Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., Song, D.: Assessing gener- alization in deep reinforcement learning (2018)
work page 2018
- [57]
-
[58]
Journal of Machine Learning Research (2024)
Patterson, A., Neumann, S., White, M., White, A.: Empirical design in reinforce- ment learning. Journal of Machine Learning Research (2024)
work page 2024
-
[59]
Com- puters, Materials & Continua (2022)
Praveena, V., V., A., Chinnasamy, P., Ali, I., Alroobaea, R., Alyahyan, S.Y., Raza, M.A.: Optimal deep reinforcement learning for intrusion detection in uavs. Com- puters, Materials & Continua (2022)
work page 2022
-
[60]
Operations Research70(6), 3601–3628 (2022)
Qu, G., Wierman, A., Li, N.: Scalable reinforcement learning for multiagent networked systems. Operations Research70(6), 3601–3628 (2022)
work page 2022
-
[61]
Applying communication privacy management theory to youth privacy management in AI contexts,
Samaddar, A., Potteiger, N., Koutsoukos, X.: Out-of-distribution detec- tion for neurosymbolic autonomous cyber agents. In: 2025 IEEE 4th In- ternational Conference on AI in Cybersecurity (ICAIC). pp. 1–9 (2025). https://doi.org/10.1109/ICAIC63015.2025.10849024
-
[62]
https:// networkattacksimulator.readthedocs.io/ (2019)
Schwartz, J., Kurniawatti, H.: Nasim: Network attack simulator. https:// networkattacksimulator.readthedocs.io/ (2019)
work page 2019
-
[63]
Short, J.: The essential role of modelling and simulation in helping ai fight cyber-attacks. In: Force Readiness for Multi-Domain Operations through Modelling and Simulation: NATO Modelling and Simulation Group (MSG) Symposium (MSG-229). No. STO-MP-MSG-229 in STO Meeting Proceedings, NATO Science and Technology Organization (STO) (2025), https://publicati...
work page 2025
-
[64]
(2022), https://docs.oasis-open.org/openc2/oc2arch/v1.0/oc2arch- v1.0.html
Sparrell, D.: Open command and control (openc2) architecture specification version 1.0. (2022), https://docs.oasis-open.org/openc2/oc2arch/v1.0/oc2arch- v1.0.html
work page 2022
-
[65]
In: IJCAI-21 1st Interna- tional Workshop on Adaptive Cyber Defense (2021)
Standen, M., Lucas, M., Bowman, D., Richer, T., Kim, J., Marriott, D.: CybORG: A gym for the development of autonomous cyber agents. In: IJCAI-21 1st Interna- tional Workshop on Adaptive Cyber Defense (2021)
work page 2021
-
[66]
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, second edn. (2018), http://incompleteideas.net/book/the-book-2nd.html
work page 2018
-
[67]
Symes Thompson, I., Caron, A., Hicks, C., Mavroudis, V.: Entity-based reinforce- ment learning for autonomous cyber defence (2025), https://arxiv.org/abs/2410. 17647
work page 2025
-
[68]
Team., M.D.R.: Cyberbattlesim. https://github.com/microsoft/ cyberbattlesim (2021), created by Christian Seifert, Michael Betser, William Blum, James Bono, Kate Farris, Emily Goren, Justin Grana, Kris- tian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei
work page 2021
-
[69]
Terranova, F., Lahmadi, A., Chrisment, I.: Leveraging deep reinforcement learning for cyber-attack paths prediction: Formulation, generalization, and evaluation. In: Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses (2024)
work page 2024
-
[70]
Tsingenopoulos, I., Cortellazzi, J., Bosansk`y, B., Aonzo, S., Preuveneers, D., Joosen, W., Pierazzi, F., Cavallaro, L.: How to train your antivirus: Rl-based hardening through the problem space. In: Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses (2024)
work page 2024
-
[71]
Vyas, S., Mavroudis, V., Burnap, P.: Towards the Deployment of Realistic Au- tonomous Cyber Network Defence: A Systematic Review. ACM Comput. Surv. 58(1) (Aug 2025). https://doi.org/10.1145/3729213
-
[72]
Frontiers of Computer Science (2025)
Yang, Y., Chen, L., Liu, S., Wang, L., Fu, H., Liu, X., Chen, Z.: Behaviour-diverse automatic penetration testing: a coverage-based deep reinforcement learning approach. Frontiers of Computer Science (2025)
work page 2025
-
[73]
A survey on self-play methods in reinforcement learning
Zhang, R., Xu, Z., Ma, C., Yu, C., Tu, W.W., Tang, W., Huang, S., Ye, D., Ding, W., Yang, Y., et al.: A survey on self-play methods in reinforcement learning. arXiv preprint arXiv:2408.01072 (2024)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.