COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies

Abed Showgan; Andres Murillo; Asaf Shabtai; Aviram Zilberman; Chen Frydman; Rami Puzis; Rubin Krief; Sekiya Motoyoshi; Yuval Elovici

arxiv: 2606.30479 · v1 · pith:BB7QLCCGnew · submitted 2026-06-29 · 💻 cs.NI · cs.AI· cs.CR· cs.MA

COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies

Chen Frydman , Aviram Zilberman , Rubin Krief , Abed Showgan , Andres Murillo , Sekiya Motoyoshi , Asaf Shabtai , Yuval Elovici

show 1 more author

Rami Puzis

This is my paper

Pith reviewed 2026-06-30 03:26 UTC · model grok-4.3

classification 💻 cs.NI cs.AIcs.CRcs.MA

keywords network mitigationattack replaymulti-agent LLMnetwork emulationautomated hardeningconnectivity preservation

0 comments

The pith

A multi-agent LLM system automates network mitigation by generating device commands, testing them via attack replay on an emulator, and checking that connectivity remains intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COHORT as a complete automated pipeline that turns an observed attack into deployable mitigations without relying on human experts for every step. Multiple LLM agents work together to propose changes, turn those changes into actual commands for firewalls and routers, apply them inside a realistic emulator, and then replay the original attack to see if the mitigation stops it. A separate check discards any change that also breaks normal LAN or internet traffic, and approved changes can be stacked to check for side effects. The evaluation on three different network layouts and four attack types shows that nearly half the outputs succeed on both attack disruption and connectivity preservation, more than four times the rate of a simpler single-agent version using the same model.

Core claim

COHORT is the first end-to-end framework to automate this procedure for deployable mitigations. A role-decomposed multi-agent LLM workflow proposes candidates, implements them as real device commands, and refines them through a critique loop, all on a high-fidelity GNS3 emulator running real vendor firmware. Each candidate is evaluated by offensive replay: re-executing the original adversary on the mitigated network for a paired comparison against the unmitigated baseline, rather than the reward-signal or expert-judgment proxies used in prior work. Two further checks complement replay: a connectivity-regression check rejects mitigations that disrupt legitimate LAN or internet connectivity, a

What carries the argument

Role-decomposed multi-agent LLM workflow that proposes mitigations, translates them to real device commands, and validates them by offensive replay plus connectivity checks inside a GNS3 emulator running actual vendor firmware.

If this is right

Mitigations emerge as concrete device commands that can be copied directly into production equipment.
Stacking multiple validated mitigations on the same emulated state reveals whether later changes undo the protection of earlier ones.
The same workflow applies across ransomware, lateral movement, DNS exfiltration, and data-theft scenarios without changing the agent roles.
Validation occurs entirely inside the emulator, so the production network is never exposed during testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the emulator-to-production gap is small, response time to a new adversary could shrink from weeks of manual work to hours of automated generation and replay.
The method could be extended to test mitigations against variants of the original attack rather than only the exact replay used in training.
The persistent state used for cumulative evaluation might also serve as a sandbox for exploring how defenders could roll back or adjust mitigations after initial deployment.

Load-bearing premise

The GNS3 emulator running real vendor firmware accurately captures production network behavior for both attack replay and mitigation effectiveness, and LLM-generated device commands execute without introducing undetected side effects or new attack surfaces.

What would settle it

Take one of the mitigations the system produced, apply the exact commands to a live production network matching one of the tested topologies, replay the same attack, and observe whether the attack is stopped while normal traffic continues without regression.

Figures

Figures reproduced from arXiv: 2606.30479 by Abed Showgan, Andres Murillo, Asaf Shabtai, Aviram Zilberman, Chen Frydman, Rami Puzis, Rubin Krief, Sekiya Motoyoshi, Yuval Elovici.

**Figure 1.** Figure 1: Multi-agent automatic mitigation framework showing the overall architecture and agent interactions. Judge Whereas the critic reviews the implemented configuration prior to attack execution, the judge evaluates the resulting attack outcome. It replays the attack scenario in the post-mitigation environment using Caldera and compares the resulting attack step success rate (ASSR) to the pre-mitigation baseli… view at source ↗

**Figure 2.** Figure 2: Single-agent baseline workflow for mitigation suggestion, implementation, and self-validation. the project is rolled back to the previously accepted state. Because acceptance requires non-regression rather than strict improvement, non-interfering mitigations are retained even when they add no measurable ASSR reduction, yielding a defense-in-depth posture. Failure modes Each role has a characteristic fail… view at source ↗

**Figure 3.** Figure 3: Evaluation workflow showing the parallel per-mitigation evaluations (independent, rolled-back) and the cumulative evaluation (persistent cumulative project with sequential mitigation replay). A demo video walking through this diagram can be found in Appendix 8.3. • Multi-Agent without Critic. The multi-agent graph is executed with role-specialized suggester and implementer agents, and the implementation–cr… view at source ↗

**Figure 4.** Figure 4: MSR by runtime condition, pooled across attacks and topologies. The hatched bar shows the rate without the connectivity-regression check. or has no incremental effect on top of the defenses already deployed on the cumulative project. Aggregate cumulativeME behavior across all runs is reported in Figures 5 (by condition) and 6 (by topology). 5.3. Results MSR Results [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 6.** Figure 6: Cumulative ME by topology, all conditions pooled. Cumulative Mitigation Effectiveness [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: MSR by enterprise topology (small/medium/large) for the Multi-Agent condition. Error bars indicate the standard error of the mean. 0 20 40 60 80 100 Mitigation Success Rate (%) Network-Based Traffic Filtering Network-Based Traffic Inspection Network Isolation Host-Based Traffic Filtering Host-Based Intrusion Prevention Host Hardening Application Control 5.8% (n=139) 9.1% (n=11) 15.8% (n=19) 25.0% (n=20) 53… view at source ↗

**Figure 8.** Figure 8: MSR by mitigation category for the Multi-Agent condition, pooled across topologies. Category labels are assigned for each mitigation and normalized to a shared reporting vocabulary ( [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: MSR by runtime condition and attack scenario. Hatched bar: rate without the connectivity-regression check. 8.3. Data and Code Availability In the spirit of open science, we release the following artifacts at https://github.com/user32133/cohort and https://cohort-experiments-app.streamlit.app/: • Agent prompts for all roles (Suggester, Implementer, Critic, Judge, Summarizer, and the single-agent baseline). … view at source ↗

**Figure 10.** Figure 10: Small enterprise network topology. The large enterprise network topology was inspired by Enterprise Network Lab: Bank Project by Kiki Oyewole (GNS3 Marketplace). 8.5. Pairwise Statistical Comparisons [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Medium enterprise network topology (inspired by Simple Network Layout by Gilbert Nims, GNS3 Marketplace). e0 eth1 e0 eth1 e0 eth2 e0 eth3 e0 eth2 e0 eth3 e0 eth2 e0 eth3 e3 e3 e2 e2 eth0 e1 e2 e2 docker0 e0 e1 e5 e0 e0 e6 e0 e2 e0 e4 e0 e6 e0 e4 e0 e6 e0 e4 e0 e5 e0 e6 e0 e4 e0 e5 e0 e6 e0 e4 e0 e5 e0 e6 e0 e1 e0 e4 e0 e2 e0 e4 e0 e6 e0 Accountant-PC-1 Accountant-PC-2 Accountant-PC-3 Accountant-PC-4 Accou… view at source ↗

**Figure 12.** Figure 12: Large enterprise network topology (inspired by Enterprise Network Lab: Bank Project by Kiki Oyewole, GNS3 Marketplace). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

read the original abstract

Mitigating an observed adversary in an enterprise network typically takes weeks of expert work: an analyst derives a mitigation tailored to that adversary, validates it without breaking production, and verifies it disrupts the specific attack. The procedure relies on expert judgment and cannot safely be exercised against the production network. COHORT is the first end-to-end framework to automate this procedure for deployable mitigations. A role-decomposed multi-agent LLM workflow proposes candidates, implements them as real device commands, and refines them through a critique loop, all on a high-fidelity GNS3 emulator running real vendor firmware (firewall, switch, router). Each candidate is evaluated by offensive replay: re-executing the original adversary on the mitigated network for a paired comparison against the unmitigated baseline, rather than the reward-signal or expert-judgment proxies used in prior simulation, hybrid, and configuration-generation work. Two further checks complement replay: a connectivity-regression check (LAN ping and internet HTTP probe) rejects mitigations that disrupt legitimate LAN or internet connectivity, and a cumulative evaluation stacks approved mitigations onto a persistent state to surface compound effects. Across three topologies and four attack scenarios (ransomware, lateral movement, DNS exfiltration, data theft), 46.7% of generated mitigations both disrupt the attack and preserve connectivity under replay, 4.4 times the rate of a single-agent baseline using the same model and tool access. A demo video walking through the framework is available with our released artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COHORT gets a multi-agent LLM to generate mitigations that pass offensive replay in GNS3 at 4.4x a single-agent baseline, but the deployable claim rests on unvalidated emulator fidelity.

read the letter

The main takeaway is that this paper builds an end-to-end workflow where LLM agents in different roles propose mitigations, turn them into real device commands, apply them on a GNS3 emulator with vendor firmware, and test them by replaying the original attack. They report 46.7% of the outputs both stop the attack and keep LAN/internet connectivity, which is 4.4 times the single-agent rate on the same model.

The evaluation approach is the strongest part. Direct offensive replay on the mitigated topology is more relevant than reward signals or expert scoring used in earlier work. Adding the connectivity regression checks and the cumulative stacking of approved mitigations is a practical way to surface problems that single checks would miss. Running on real firmware rather than abstract models is also an improvement over pure simulation papers.

The soft spot is exactly the one in the stress-test note. All results live inside GNS3, with no comparison to physical hardware and only two simple probes for side effects. If the emulator does not reproduce production traffic or if the generated commands create issues outside those probes, the success numbers will not carry over. The abstract gives no details on replay traffic generation, run-to-run variance, or how many candidates were filtered before the final count, so the 46.7% figure is hard to interpret without more data.

This is for groups working on LLM agents for network operations or automated security response. A reader who wants concrete examples of role decomposition and replay-based testing will find usable ideas even if they do not adopt the full system.

I would send it to peer review. The experimental loop is a real attempt at the full mitigation cycle and the baseline comparison makes the improvement claim checkable, even though referees will need to press on the emulator validation.

Referee Report

1 major / 2 minor

Summary. The paper introduces COHORT, the first end-to-end multi-agent LLM framework that automates mitigation generation, implementation as real device commands, and refinement via a critique loop on a GNS3 emulator running real vendor firmware. Candidates are validated by offensive replay of the original adversary (paired comparison to baseline), plus connectivity checks (LAN ping, internet HTTP) and cumulative stacking; across three topologies and four attacks it reports 46.7% success rate (disrupt attack while preserving connectivity), 4.4× the single-agent baseline using the same model and tools. Artifacts and a demo video are released.

Significance. If the GNS3-based replay results generalize, the framework would represent a meaningful advance over prior simulation, hybrid, or configuration-generation approaches by supplying an automated, replay-validated path to deployable mitigations and by releasing reproducible artifacts. The use of real firmware and explicit offensive replay rather than proxy rewards is a concrete methodological improvement.

major comments (1)

[Evaluation] Evaluation section (and abstract): the headline claim that the generated mitigations are 'deployable' and that the 4.4× improvement is meaningful rests on the assumption that GNS3 with real firmware faithfully reproduces production attack surfaces and legitimate flows. The manuscript supplies no hardware-to-emulator validation experiments, no instrumentation details for how replay traffic is generated or captured, and no measurement of LLM-generated command side-effects beyond the two connectivity probes. Without these, the quantitative results cannot be separated from possible emulation artifacts.

minor comments (2)

[Abstract] The abstract states concrete success rates and improvement factors but does not mention statistical tests, number of runs, or error bars; the full manuscript should make these explicit in the results tables or text.
Notation for the multi-agent roles and the cumulative evaluation stacking procedure could be clarified with a diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive critique of the evaluation methodology. We address the concerns point by point below and will incorporate clarifications and additional details where feasible.

read point-by-point responses

Referee: [Evaluation] Evaluation section (and abstract): the headline claim that the generated mitigations are 'deployable' and that the 4.4× improvement is meaningful rests on the assumption that GNS3 with real firmware faithfully reproduces production attack surfaces and legitimate flows. The manuscript supplies no hardware-to-emulator validation experiments, no instrumentation details for how replay traffic is generated or captured, and no measurement of LLM-generated command side-effects beyond the two connectivity probes. Without these, the quantitative results cannot be separated from possible emulation artifacts.

Authors: We agree that the evaluation would benefit from greater transparency on these points. - Hardware-to-emulator validation: No such experiments were performed, as they would require access to identical production hardware configurations, which was outside the scope and resources of this study. GNS3 running unmodified vendor firmware is a widely accepted high-fidelity platform in networking research; we will add an explicit limitations paragraph acknowledging the emulation assumption and its implications for generalizability. - Instrumentation details: The current manuscript describes replay at a high level. We will expand the Evaluation section with concrete details on traffic generation (re-use of the original adversary tooling and packet captures) and capture methods (GNS3 network monitoring interfaces) to allow replication. - Side-effects beyond the two probes: The LAN ping and HTTP checks are the primary regression tests reported. We acknowledge they do not exhaustively measure all possible command side-effects (e.g., performance or subtle policy interactions). We will add discussion of this scope limitation and note that the offensive-replay success metric remains the core validation signal. These textual expansions and the new limitations paragraph will be included in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems framework with no derivations or fitted predictions

full rationale

The paper describes an end-to-end LLM-based framework evaluated via replay experiments on GNS3 emulators. No equations, parameters, uniqueness theorems, or derivation steps appear in the provided text. The headline quantitative result (46.7% success rate, 4.4× baseline) is presented as an experimental outcome, not a prediction derived from fitted inputs or self-citations. The work is self-contained against external benchmarks in the sense that its claims are falsifiable via replication on physical hardware, with no load-bearing self-referential definitions or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no free parameters, mathematical axioms, or newly postulated entities; the framework depends on existing LLM capabilities and the fidelity of the GNS3 emulator, neither of which is quantified or derived here.

pith-pipeline@v0.9.1-grok · 5846 in / 1268 out tokens · 71898 ms · 2026-06-30T03:26:49.493362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 20 canonical work pages · 4 internal anchors

[1]

IEEE Wireless Communications , year=

A Trustworthy Agentic Multi-LLM Network: Challenges, Solutions, and a Use Case , author=. IEEE Wireless Communications , year=
[2]

Advances in Neural Information Processing Systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=
[3]

Proceedings of the ACM on Software Engineering , volume=

Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

2025
[4]

arXiv preprint arXiv:2409.11239 , year=

Llm-as-a-judge & reward model: What they can and cannot do , author=. arXiv preprint arXiv:2409.11239 , year=

work page arXiv
[5]

A Survey on LLM-as-a-Judge

A survey on llm-as-a-judge , author=. arXiv preprint arXiv:2411.15594 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[7]

Computers & Security , volume=

Moving towards agile cybersecurity incident response: A case study exploring the enabling role of big data analytics-embedded dynamic capabilities , author=. Computers & Security , volume=. 2023 , publisher=

2023
[8]

Computer Science Review , volume=

A quest for research and knowledge gaps in cybersecurity awareness for small and medium-sized enterprises , author=. Computer Science Review , volume=. 2023 , publisher=

2023
[9]

The twelfth international conference on learning representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The twelfth international conference on learning representations , year=
[10]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Large language model based multi-agents: A survey of progress and challenges , author=. arXiv preprint arXiv:2402.01680 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Vicinagearth , volume=

A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges , author=. Vicinagearth , volume=. 2024 , publisher=

2024
[12]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

ESORICS 2025: 30th European Symposium on Research in computer security , year=

Network intrusion response systems: towards standardized evaluation of intrusion response , author=. ESORICS 2025: 30th European Symposium on Research in computer security , year=

2025
[14]

34th USENIX Security Symposium (USENIX Security 25) , pages=

Cloak, Honey, Trap: Proactive Defenses Against \ LLM \ Agents , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
[15]

arXiv preprint arXiv:2504.00428 , year=

LLM-Assisted Proactive Threat Intelligence for Automated Reasoning , author=. arXiv preprint arXiv:2504.00428 , year=

work page arXiv
[16]

2003 , publisher=

Policy-based network management: solutions for the next generation , author=. 2003 , publisher=

2003
[17]

International Conference on Ubiquitous Security , pages=

A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models , author=. International Conference on Ubiquitous Security , pages=. 2023 , organization=

2023
[18]

Proceedings of the 34th ACM/SIGAPP symposium on applied computing , pages=

Towards automated network mitigation analysis , author=. Proceedings of the 34th ACM/SIGAPP symposium on applied computing , pages=
[19]

2025 International Wireless Communications and Mobile Computing (IWCMC) , pages=

Autonomous Cyber Incident Response Using Reasoning and Action , author=. 2025 International Wireless Communications and Mobile Computing (IWCMC) , pages=. 2025 , organization=

2025
[20]

Proceedings of the 22nd ACM Workshop on Hot Topics in Networks , pages=

A Holistic View of AI-driven Network Incident Management , author=. Proceedings of the 22nd ACM Workshop on Hot Topics in Networks , pages=
[21]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

Recommending root-cause and mitigation steps for cloud incidents using large language models , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

2023
[22]

International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=

Inferring Recovery Steps from Cyber Threat Intelligence Reports , author=. International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=. 2024 , organization=

2024
[23]

arXiv preprint arXiv:2402.17531 , year=

Nissist: An incident mitigation copilot based on troubleshooting guides , author=. arXiv preprint arXiv:2402.17531 , year=

work page arXiv
[24]

arXiv preprint arXiv:2309.16422 , year=

Cyber sentinel: Exploring conversational agents in streamlining security tasks with gpt-4 , author=. arXiv preprint arXiv:2309.16422 , year=

work page arXiv
[25]

Proceedings of the 8th Asia-Pacific Workshop on Networking , pages=

ShieldGPT: An LLM-based framework for DDoS mitigation , author=. Proceedings of the 8th Asia-Pacific Workshop on Networking , pages=
[26]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[27]

First conference on language modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=
[28]

Forty-first international conference on machine learning , year=

Improving factuality and reasoning in language models through multiagent debate , author=. Forty-first international conference on machine learning , year=
[29]

Computers and Electrical Engineering , volume=

The Paradigm of Hallucinations in AI-driven cybersecurity systems: Understanding taxonomy, classification outcomes, and mitigations , author=. Computers and Electrical Engineering , volume=. 2025 , publisher=

2025
[30]

ACSAC 2025-Annual Computer Security Applications Conference Workshops , year=

AgentNIRS: An LLM-driven agent for network intrusion response , author=. ACSAC 2025-Annual Computer Security Applications Conference Workshops , year=

2025
[31]

arXiv preprint arXiv:2501.13411 , year=

VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework , author=. arXiv preprint arXiv:2501.13411 , year=

work page arXiv
[32]

IEEE access , year=

Generative ai for cyber security: Analyzing the potential of chatgpt, dall-e and other models for enhancing the security space , author=. IEEE access , year=
[33]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
[34]

IEEE/CAA Journal of Automatica Sinica , volume=

When software security meets large language models: A survey , author=. IEEE/CAA Journal of Automatica Sinica , volume=. 2025 , publisher=

2025
[35]

IEEE Transactions on Network and Service Management , year=

LLM-powered Intent-driven Configuration Generation for Multi-vendor Networks , author=. IEEE Transactions on Network and Service Management , year=
[36]

International Journal of Network Management , volume=

A Comprehensive Survey on LLM-Based Network Management and Operations , author=. International Journal of Network Management , volume=. 2025 , publisher=

2025
[37]

Future Internet , volume=

Large Language Models Meet Next-Generation Networking Technologies: A Review , author=. Future Internet , volume=. 2024 , publisher=

2024
[38]

Computer Networks , volume=

A survey on network simulators, emulators, and testbeds used for research and education , author=. Computer Networks , volume=. 2023 , publisher=

2023
[39]

Cybersecurity , volume=

When llms meet cybersecurity: A systematic literature review , author=. Cybersecurity , volume=. 2025 , publisher=

2025
[40]

IEEE Network , year=

Large language models for networking: Workflow, advances and challenges , author=. IEEE Network , year=
[41]

IEEE Access , year=

A survey on enterprise network security: Asset behavioral monitoring and distributed attack detection , author=. IEEE Access , year=
[42]

Computers & Security , volume=

Cyber ranges and security testbeds: Scenarios, functions, tools and architecture , author=. Computers & Security , volume=. 2020 , publisher=

2020
[43]

Proceedings of the 4th ACM Workshop on Cyber-Physical System Security , pages=

Towards security-aware virtual environments for digital twins , author=. Proceedings of the 4th ACM Workshop on Cyber-Physical System Security , pages=
[44]

USENIX Security Symposium , volume=

MulVAL: A Logic-based Network Security Analyzer , author=. USENIX Security Symposium , volume=
[45]

19th Annual Computer Security Applications Conference , pages=

Efficient minimum-cost network hardening via exploit dependency graphs , author=. 19th Annual Computer Security Applications Conference , pages=. 2003 , organization=

2003
[46]

Journal of Information Security and Applications , volume=

A real-time automated attack-defense graph generation approach , author=. Journal of Information Security and Applications , volume=. 2025 , publisher=

2025
[47]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Deep reinforcement learning for cyber security , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2021 , publisher=

2021
[48]

ACM Computing Surveys , volume=

A multi-vocal review of security orchestration , author=. ACM Computing Surveys , volume=. 2019 , publisher=

2019
[49]

International Journal of Information and Computer Security , volume=

A taxonomy of intrusion response systems , author=. International Journal of Information and Computer Security , volume=. 2007 , publisher=

2007
[50]

Proceedings of the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT) , pages=

NetConfEval: Can LLMs Facilitate Network Configuration? , author=. Proceedings of the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT) , pages=. 2024 , organization=

2024
[51]

2025 , url =

Cost of a Data Breach Report , author =. 2025 , url =

2025
[52]

2025 , url =

MITRE ATT&CK. 2025 , url =

2025
[53]

Caldera — Print Software Driven by Innovation , year =
[54]

2025 , volume =

Ifland, Beni and Krief, Rubin and Zilberman, Aviram and Duani, Elad and Ohana, Miro and Murillo, Andres and Manor, Ofir and Lavi, Ortal and Hikichi, Kenji and Shabtai, Asaf and Elovici, Yuval and Puzis, Rami , booktitle =. 2025 , volume =. doi:10.1109/ICDCSW63273.2025.00026 , url =

work page doi:10.1109/icdcsw63273.2025.00026 2025
[55]

IEEE Communications Magazine , volume=

Large language models for zero touch network configuration management , author=. IEEE Communications Magazine , volume=. 2024 , publisher=

2024
[56]

arXiv preprint arXiv:2304.07411 , year=

SoK: The MITRE ATT&CK Framework in Research and Practice , author=. arXiv preprint arXiv:2304.07411 , year=

work page arXiv
[57]

Proceedings of the 32nd Annual Conference on Computer Security Applications , pages=

Intelligent, automated red team emulation , author=. Proceedings of the 32nd Annual Conference on Computer Security Applications , pages=
[58]

Military Cyber Affairs , volume=

Characterizing Caldera’s Cyber Attack Emulation Capabilities , author=. Military Cyber Affairs , volume=
[59]

ACM Computing Surveys , volume=

Automation for network security configuration: State of the art and research trends , author=. ACM Computing Surveys , volume=. 2023 , publisher=

2023
[60]

Journal of Cybersecurity , volume=

Simulation for cybersecurity: state of the art and future directions , author=. Journal of Cybersecurity , volume=. 2021 , publisher=

2021
[61]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Exploring the efficacy of multi-agent reinforcement learning for autonomous cyber defence: A cage challenge 4 perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[62]

arXiv preprint arXiv:2407.11070 , year=

Optimal defender strategies for CAGE-2 using causal modeling and tree search , author=. arXiv preprint arXiv:2407.11070 , year=

work page arXiv
[63]

arXiv preprint arXiv:2211.15557 , year=

Beyond cage: Investigating generalization of learned autonomous network defense policies , author=. arXiv preprint arXiv:2211.15557 , year=

work page arXiv
[64]

arXiv preprint arXiv:2410.16324 , year=

Cyborg++: An enhanced gym for the development of autonomous cyber agents , author=. arXiv preprint arXiv:2410.16324 , year=

work page arXiv
[65]

arXiv preprint arXiv:2108.09118 , year=

Cyborg: A gym for the development of autonomous cyber agents , author=. arXiv preprint arXiv:2108.09118 , year=

work page arXiv
[66]

, Note =

Microsoft Defender Research Team. , Note =. CyberBattleSim , Year =
[67]

2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages=

Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces , author=. 2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages=

2025
[68]

European Symposium on Research in Computer Security , pages=

Nasimemu: Network attack simulator & emulator for training agents generalizing to novel scenarios , author=. European Symposium on Research in Computer Security , pages=. 2023 , organization=

2023
[69]

2024 , howpublished =

2024
[70]

Proceedings of the 17th Cyber Security Experimentation and Test Workshop , pages=

Towards a high fidelity training environment for autonomous cyber defense agents , author=. Proceedings of the 17th Cyber Security Experimentation and Test Workshop , pages=
[71]

arXiv preprint arXiv:2103.07583 , year=

Network environment design for autonomous cyberdefense , author=. arXiv preprint arXiv:2103.07583 , year=

work page arXiv
[72]

IEEE Transactions on Network and Service Management , volume=

Intrusion prevention through optimal stopping , author=. IEEE Transactions on Network and Service Management , volume=. 2022 , publisher=

2022
[73]

arXiv preprint arXiv:2109.03331 , year=

Cygil: A cyber gym for training autonomous agents over emulated network systems , author=. arXiv preprint arXiv:2109.03331 , year=

work page arXiv
[74]

Caldera , howpublished =
[75]

Harrell, Brent and Chan, Melanie and Han, Hojin and Voss, Kristin and Danke, Ganesh and Ji, Lauren and Brobin, Olivia and Esprit, Kate , title =
[76]

2025 , eprint=

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems , author=. 2025 , eprint=

2025
[77]

xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems , author=. arXiv preprint arXiv:2509.13021 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

arXiv preprint arXiv:2410.09134 , year=

Multi-agent actor-critics in autonomous cyber defense , author=. arXiv preprint arXiv:2410.09134 , year=

work page arXiv
[79]

2025 IEEE Conference on Artificial Intelligence (CAI) , pages=

Large language models are autonomous cyber defenders , author=. 2025 IEEE Conference on Artificial Intelligence (CAI) , pages=. 2025 , organization=

2025
[80]

Computer Networks , volume=

Design and evaluation of an Autonomous Cyber Defence agent using DRL and an augmented LLM , author=. Computer Networks , volume=. 2025 , publisher=

2025

Showing first 80 references.

[1] [1]

IEEE Wireless Communications , year=

A Trustworthy Agentic Multi-LLM Network: Challenges, Solutions, and a Use Case , author=. IEEE Wireless Communications , year=

[2] [2]

Advances in Neural Information Processing Systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=

[3] [3]

Proceedings of the ACM on Software Engineering , volume=

Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

2025

[4] [4]

arXiv preprint arXiv:2409.11239 , year=

Llm-as-a-judge & reward model: What they can and cannot do , author=. arXiv preprint arXiv:2409.11239 , year=

work page arXiv

[5] [5]

A Survey on LLM-as-a-Judge

A survey on llm-as-a-judge , author=. arXiv preprint arXiv:2411.15594 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

[7] [7]

Computers & Security , volume=

Moving towards agile cybersecurity incident response: A case study exploring the enabling role of big data analytics-embedded dynamic capabilities , author=. Computers & Security , volume=. 2023 , publisher=

2023

[8] [8]

Computer Science Review , volume=

A quest for research and knowledge gaps in cybersecurity awareness for small and medium-sized enterprises , author=. Computer Science Review , volume=. 2023 , publisher=

2023

[9] [9]

The twelfth international conference on learning representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The twelfth international conference on learning representations , year=

[10] [10]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Large language model based multi-agents: A survey of progress and challenges , author=. arXiv preprint arXiv:2402.01680 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Vicinagearth , volume=

A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges , author=. Vicinagearth , volume=. 2024 , publisher=

2024

[12] [12]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

ESORICS 2025: 30th European Symposium on Research in computer security , year=

Network intrusion response systems: towards standardized evaluation of intrusion response , author=. ESORICS 2025: 30th European Symposium on Research in computer security , year=

2025

[14] [14]

34th USENIX Security Symposium (USENIX Security 25) , pages=

Cloak, Honey, Trap: Proactive Defenses Against \ LLM \ Agents , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

[15] [15]

arXiv preprint arXiv:2504.00428 , year=

LLM-Assisted Proactive Threat Intelligence for Automated Reasoning , author=. arXiv preprint arXiv:2504.00428 , year=

work page arXiv

[16] [16]

2003 , publisher=

Policy-based network management: solutions for the next generation , author=. 2003 , publisher=

2003

[17] [17]

International Conference on Ubiquitous Security , pages=

A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models , author=. International Conference on Ubiquitous Security , pages=. 2023 , organization=

2023

[18] [18]

Proceedings of the 34th ACM/SIGAPP symposium on applied computing , pages=

Towards automated network mitigation analysis , author=. Proceedings of the 34th ACM/SIGAPP symposium on applied computing , pages=

[19] [19]

2025 International Wireless Communications and Mobile Computing (IWCMC) , pages=

Autonomous Cyber Incident Response Using Reasoning and Action , author=. 2025 International Wireless Communications and Mobile Computing (IWCMC) , pages=. 2025 , organization=

2025

[20] [20]

Proceedings of the 22nd ACM Workshop on Hot Topics in Networks , pages=

A Holistic View of AI-driven Network Incident Management , author=. Proceedings of the 22nd ACM Workshop on Hot Topics in Networks , pages=

[21] [21]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

Recommending root-cause and mitigation steps for cloud incidents using large language models , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

2023

[22] [22]

International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=

Inferring Recovery Steps from Cyber Threat Intelligence Reports , author=. International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=. 2024 , organization=

2024

[23] [23]

arXiv preprint arXiv:2402.17531 , year=

Nissist: An incident mitigation copilot based on troubleshooting guides , author=. arXiv preprint arXiv:2402.17531 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2309.16422 , year=

Cyber sentinel: Exploring conversational agents in streamlining security tasks with gpt-4 , author=. arXiv preprint arXiv:2309.16422 , year=

work page arXiv

[25] [25]

Proceedings of the 8th Asia-Pacific Workshop on Networking , pages=

ShieldGPT: An LLM-based framework for DDoS mitigation , author=. Proceedings of the 8th Asia-Pacific Workshop on Networking , pages=

[26] [26]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[27] [27]

First conference on language modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=

[28] [28]

Forty-first international conference on machine learning , year=

Improving factuality and reasoning in language models through multiagent debate , author=. Forty-first international conference on machine learning , year=

[29] [29]

Computers and Electrical Engineering , volume=

The Paradigm of Hallucinations in AI-driven cybersecurity systems: Understanding taxonomy, classification outcomes, and mitigations , author=. Computers and Electrical Engineering , volume=. 2025 , publisher=

2025

[30] [30]

ACSAC 2025-Annual Computer Security Applications Conference Workshops , year=

AgentNIRS: An LLM-driven agent for network intrusion response , author=. ACSAC 2025-Annual Computer Security Applications Conference Workshops , year=

2025

[31] [31]

arXiv preprint arXiv:2501.13411 , year=

VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework , author=. arXiv preprint arXiv:2501.13411 , year=

work page arXiv

[32] [32]

IEEE access , year=

Generative ai for cyber security: Analyzing the potential of chatgpt, dall-e and other models for enhancing the security space , author=. IEEE access , year=

[33] [33]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

[34] [34]

IEEE/CAA Journal of Automatica Sinica , volume=

When software security meets large language models: A survey , author=. IEEE/CAA Journal of Automatica Sinica , volume=. 2025 , publisher=

2025

[35] [35]

IEEE Transactions on Network and Service Management , year=

LLM-powered Intent-driven Configuration Generation for Multi-vendor Networks , author=. IEEE Transactions on Network and Service Management , year=

[36] [36]

International Journal of Network Management , volume=

A Comprehensive Survey on LLM-Based Network Management and Operations , author=. International Journal of Network Management , volume=. 2025 , publisher=

2025

[37] [37]

Future Internet , volume=

Large Language Models Meet Next-Generation Networking Technologies: A Review , author=. Future Internet , volume=. 2024 , publisher=

2024

[38] [38]

Computer Networks , volume=

A survey on network simulators, emulators, and testbeds used for research and education , author=. Computer Networks , volume=. 2023 , publisher=

2023

[39] [39]

Cybersecurity , volume=

When llms meet cybersecurity: A systematic literature review , author=. Cybersecurity , volume=. 2025 , publisher=

2025

[40] [40]

IEEE Network , year=

Large language models for networking: Workflow, advances and challenges , author=. IEEE Network , year=

[41] [41]

IEEE Access , year=

A survey on enterprise network security: Asset behavioral monitoring and distributed attack detection , author=. IEEE Access , year=

[42] [42]

Computers & Security , volume=

Cyber ranges and security testbeds: Scenarios, functions, tools and architecture , author=. Computers & Security , volume=. 2020 , publisher=

2020

[43] [43]

Proceedings of the 4th ACM Workshop on Cyber-Physical System Security , pages=

Towards security-aware virtual environments for digital twins , author=. Proceedings of the 4th ACM Workshop on Cyber-Physical System Security , pages=

[44] [44]

USENIX Security Symposium , volume=

MulVAL: A Logic-based Network Security Analyzer , author=. USENIX Security Symposium , volume=

[45] [45]

19th Annual Computer Security Applications Conference , pages=

Efficient minimum-cost network hardening via exploit dependency graphs , author=. 19th Annual Computer Security Applications Conference , pages=. 2003 , organization=

2003

[46] [46]

Journal of Information Security and Applications , volume=

A real-time automated attack-defense graph generation approach , author=. Journal of Information Security and Applications , volume=. 2025 , publisher=

2025

[47] [47]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Deep reinforcement learning for cyber security , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2021 , publisher=

2021

[48] [48]

ACM Computing Surveys , volume=

A multi-vocal review of security orchestration , author=. ACM Computing Surveys , volume=. 2019 , publisher=

2019

[49] [49]

International Journal of Information and Computer Security , volume=

A taxonomy of intrusion response systems , author=. International Journal of Information and Computer Security , volume=. 2007 , publisher=

2007

[50] [50]

Proceedings of the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT) , pages=

NetConfEval: Can LLMs Facilitate Network Configuration? , author=. Proceedings of the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT) , pages=. 2024 , organization=

2024

[51] [51]

2025 , url =

Cost of a Data Breach Report , author =. 2025 , url =

2025

[52] [52]

2025 , url =

MITRE ATT&CK. 2025 , url =

2025

[53] [53]

Caldera — Print Software Driven by Innovation , year =

[54] [54]

2025 , volume =

Ifland, Beni and Krief, Rubin and Zilberman, Aviram and Duani, Elad and Ohana, Miro and Murillo, Andres and Manor, Ofir and Lavi, Ortal and Hikichi, Kenji and Shabtai, Asaf and Elovici, Yuval and Puzis, Rami , booktitle =. 2025 , volume =. doi:10.1109/ICDCSW63273.2025.00026 , url =

work page doi:10.1109/icdcsw63273.2025.00026 2025

[55] [55]

IEEE Communications Magazine , volume=

Large language models for zero touch network configuration management , author=. IEEE Communications Magazine , volume=. 2024 , publisher=

2024

[56] [56]

arXiv preprint arXiv:2304.07411 , year=

SoK: The MITRE ATT&CK Framework in Research and Practice , author=. arXiv preprint arXiv:2304.07411 , year=

work page arXiv

[57] [57]

Proceedings of the 32nd Annual Conference on Computer Security Applications , pages=

Intelligent, automated red team emulation , author=. Proceedings of the 32nd Annual Conference on Computer Security Applications , pages=

[58] [58]

Military Cyber Affairs , volume=

Characterizing Caldera’s Cyber Attack Emulation Capabilities , author=. Military Cyber Affairs , volume=

[59] [59]

ACM Computing Surveys , volume=

Automation for network security configuration: State of the art and research trends , author=. ACM Computing Surveys , volume=. 2023 , publisher=

2023

[60] [60]

Journal of Cybersecurity , volume=

Simulation for cybersecurity: state of the art and future directions , author=. Journal of Cybersecurity , volume=. 2021 , publisher=

2021

[61] [61]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Exploring the efficacy of multi-agent reinforcement learning for autonomous cyber defence: A cage challenge 4 perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[62] [62]

arXiv preprint arXiv:2407.11070 , year=

Optimal defender strategies for CAGE-2 using causal modeling and tree search , author=. arXiv preprint arXiv:2407.11070 , year=

work page arXiv

[63] [63]

arXiv preprint arXiv:2211.15557 , year=

Beyond cage: Investigating generalization of learned autonomous network defense policies , author=. arXiv preprint arXiv:2211.15557 , year=

work page arXiv

[64] [64]

arXiv preprint arXiv:2410.16324 , year=

Cyborg++: An enhanced gym for the development of autonomous cyber agents , author=. arXiv preprint arXiv:2410.16324 , year=

work page arXiv

[65] [65]

arXiv preprint arXiv:2108.09118 , year=

Cyborg: A gym for the development of autonomous cyber agents , author=. arXiv preprint arXiv:2108.09118 , year=

work page arXiv

[66] [66]

, Note =

Microsoft Defender Research Team. , Note =. CyberBattleSim , Year =

[67] [67]

2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages=

Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces , author=. 2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages=

2025

[68] [68]

European Symposium on Research in Computer Security , pages=

Nasimemu: Network attack simulator & emulator for training agents generalizing to novel scenarios , author=. European Symposium on Research in Computer Security , pages=. 2023 , organization=

2023

[69] [69]

2024 , howpublished =

2024

[70] [70]

Proceedings of the 17th Cyber Security Experimentation and Test Workshop , pages=

Towards a high fidelity training environment for autonomous cyber defense agents , author=. Proceedings of the 17th Cyber Security Experimentation and Test Workshop , pages=

[71] [71]

arXiv preprint arXiv:2103.07583 , year=

Network environment design for autonomous cyberdefense , author=. arXiv preprint arXiv:2103.07583 , year=

work page arXiv

[72] [72]

IEEE Transactions on Network and Service Management , volume=

Intrusion prevention through optimal stopping , author=. IEEE Transactions on Network and Service Management , volume=. 2022 , publisher=

2022

[73] [73]

arXiv preprint arXiv:2109.03331 , year=

Cygil: A cyber gym for training autonomous agents over emulated network systems , author=. arXiv preprint arXiv:2109.03331 , year=

work page arXiv

[74] [74]

Caldera , howpublished =

[75] [75]

Harrell, Brent and Chan, Melanie and Han, Hojin and Voss, Kristin and Danke, Ganesh and Ji, Lauren and Brobin, Olivia and Esprit, Kate , title =

[76] [76]

2025 , eprint=

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems , author=. 2025 , eprint=

2025

[77] [77]

xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems , author=. arXiv preprint arXiv:2509.13021 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[78] [78]

arXiv preprint arXiv:2410.09134 , year=

Multi-agent actor-critics in autonomous cyber defense , author=. arXiv preprint arXiv:2410.09134 , year=

work page arXiv

[79] [79]

2025 IEEE Conference on Artificial Intelligence (CAI) , pages=

Large language models are autonomous cyber defenders , author=. 2025 IEEE Conference on Artificial Intelligence (CAI) , pages=. 2025 , organization=

2025

[80] [80]

Computer Networks , volume=

Design and evaluation of an Autonomous Cyber Defence agent using DRL and an augmented LLM , author=. Computer Networks , volume=. 2025 , publisher=

2025