COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies
Pith reviewed 2026-06-30 03:26 UTC · model grok-4.3
The pith
A multi-agent LLM system automates network mitigation by generating device commands, testing them via attack replay on an emulator, and checking that connectivity remains intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COHORT is the first end-to-end framework to automate this procedure for deployable mitigations. A role-decomposed multi-agent LLM workflow proposes candidates, implements them as real device commands, and refines them through a critique loop, all on a high-fidelity GNS3 emulator running real vendor firmware. Each candidate is evaluated by offensive replay: re-executing the original adversary on the mitigated network for a paired comparison against the unmitigated baseline, rather than the reward-signal or expert-judgment proxies used in prior work. Two further checks complement replay: a connectivity-regression check rejects mitigations that disrupt legitimate LAN or internet connectivity, a
What carries the argument
Role-decomposed multi-agent LLM workflow that proposes mitigations, translates them to real device commands, and validates them by offensive replay plus connectivity checks inside a GNS3 emulator running actual vendor firmware.
If this is right
- Mitigations emerge as concrete device commands that can be copied directly into production equipment.
- Stacking multiple validated mitigations on the same emulated state reveals whether later changes undo the protection of earlier ones.
- The same workflow applies across ransomware, lateral movement, DNS exfiltration, and data-theft scenarios without changing the agent roles.
- Validation occurs entirely inside the emulator, so the production network is never exposed during testing.
Where Pith is reading between the lines
- If the emulator-to-production gap is small, response time to a new adversary could shrink from weeks of manual work to hours of automated generation and replay.
- The method could be extended to test mitigations against variants of the original attack rather than only the exact replay used in training.
- The persistent state used for cumulative evaluation might also serve as a sandbox for exploring how defenders could roll back or adjust mitigations after initial deployment.
Load-bearing premise
The GNS3 emulator running real vendor firmware accurately captures production network behavior for both attack replay and mitigation effectiveness, and LLM-generated device commands execute without introducing undetected side effects or new attack surfaces.
What would settle it
Take one of the mitigations the system produced, apply the exact commands to a live production network matching one of the tested topologies, replay the same attack, and observe whether the attack is stopped while normal traffic continues without regression.
Figures
read the original abstract
Mitigating an observed adversary in an enterprise network typically takes weeks of expert work: an analyst derives a mitigation tailored to that adversary, validates it without breaking production, and verifies it disrupts the specific attack. The procedure relies on expert judgment and cannot safely be exercised against the production network. COHORT is the first end-to-end framework to automate this procedure for deployable mitigations. A role-decomposed multi-agent LLM workflow proposes candidates, implements them as real device commands, and refines them through a critique loop, all on a high-fidelity GNS3 emulator running real vendor firmware (firewall, switch, router). Each candidate is evaluated by offensive replay: re-executing the original adversary on the mitigated network for a paired comparison against the unmitigated baseline, rather than the reward-signal or expert-judgment proxies used in prior simulation, hybrid, and configuration-generation work. Two further checks complement replay: a connectivity-regression check (LAN ping and internet HTTP probe) rejects mitigations that disrupt legitimate LAN or internet connectivity, and a cumulative evaluation stacks approved mitigations onto a persistent state to surface compound effects. Across three topologies and four attack scenarios (ransomware, lateral movement, DNS exfiltration, data theft), 46.7% of generated mitigations both disrupt the attack and preserve connectivity under replay, 4.4 times the rate of a single-agent baseline using the same model and tool access. A demo video walking through the framework is available with our released artifacts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COHORT, the first end-to-end multi-agent LLM framework that automates mitigation generation, implementation as real device commands, and refinement via a critique loop on a GNS3 emulator running real vendor firmware. Candidates are validated by offensive replay of the original adversary (paired comparison to baseline), plus connectivity checks (LAN ping, internet HTTP) and cumulative stacking; across three topologies and four attacks it reports 46.7% success rate (disrupt attack while preserving connectivity), 4.4× the single-agent baseline using the same model and tools. Artifacts and a demo video are released.
Significance. If the GNS3-based replay results generalize, the framework would represent a meaningful advance over prior simulation, hybrid, or configuration-generation approaches by supplying an automated, replay-validated path to deployable mitigations and by releasing reproducible artifacts. The use of real firmware and explicit offensive replay rather than proxy rewards is a concrete methodological improvement.
major comments (1)
- [Evaluation] Evaluation section (and abstract): the headline claim that the generated mitigations are 'deployable' and that the 4.4× improvement is meaningful rests on the assumption that GNS3 with real firmware faithfully reproduces production attack surfaces and legitimate flows. The manuscript supplies no hardware-to-emulator validation experiments, no instrumentation details for how replay traffic is generated or captured, and no measurement of LLM-generated command side-effects beyond the two connectivity probes. Without these, the quantitative results cannot be separated from possible emulation artifacts.
minor comments (2)
- [Abstract] The abstract states concrete success rates and improvement factors but does not mention statistical tests, number of runs, or error bars; the full manuscript should make these explicit in the results tables or text.
- Notation for the multi-agent roles and the cumulative evaluation stacking procedure could be clarified with a diagram or pseudocode to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive critique of the evaluation methodology. We address the concerns point by point below and will incorporate clarifications and additional details where feasible.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (and abstract): the headline claim that the generated mitigations are 'deployable' and that the 4.4× improvement is meaningful rests on the assumption that GNS3 with real firmware faithfully reproduces production attack surfaces and legitimate flows. The manuscript supplies no hardware-to-emulator validation experiments, no instrumentation details for how replay traffic is generated or captured, and no measurement of LLM-generated command side-effects beyond the two connectivity probes. Without these, the quantitative results cannot be separated from possible emulation artifacts.
Authors: We agree that the evaluation would benefit from greater transparency on these points. - Hardware-to-emulator validation: No such experiments were performed, as they would require access to identical production hardware configurations, which was outside the scope and resources of this study. GNS3 running unmodified vendor firmware is a widely accepted high-fidelity platform in networking research; we will add an explicit limitations paragraph acknowledging the emulation assumption and its implications for generalizability. - Instrumentation details: The current manuscript describes replay at a high level. We will expand the Evaluation section with concrete details on traffic generation (re-use of the original adversary tooling and packet captures) and capture methods (GNS3 network monitoring interfaces) to allow replication. - Side-effects beyond the two probes: The LAN ping and HTTP checks are the primary regression tests reported. We acknowledge they do not exhaustively measure all possible command side-effects (e.g., performance or subtle policy interactions). We will add discussion of this scope limitation and note that the offensive-replay success metric remains the core validation signal. These textual expansions and the new limitations paragraph will be included in the revised manuscript. revision: partial
Circularity Check
No circularity: empirical systems framework with no derivations or fitted predictions
full rationale
The paper describes an end-to-end LLM-based framework evaluated via replay experiments on GNS3 emulators. No equations, parameters, uniqueness theorems, or derivation steps appear in the provided text. The headline quantitative result (46.7% success rate, 4.4× baseline) is presented as an experimental outcome, not a prediction derived from fitted inputs or self-citations. The work is self-contained against external benchmarks in the sense that its claims are falsifiable via replication on physical hardware, with no load-bearing self-referential definitions or renamings of known results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
IEEE Wireless Communications , year=
A Trustworthy Agentic Multi-LLM Network: Challenges, Solutions, and a Use Case , author=. IEEE Wireless Communications , year=
-
[2]
Advances in Neural Information Processing Systems , volume=
Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
Proceedings of the ACM on Software Engineering , volume=
Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=
2025
-
[4]
arXiv preprint arXiv:2409.11239 , year=
Llm-as-a-judge & reward model: What they can and cannot do , author=. arXiv preprint arXiv:2409.11239 , year=
-
[5]
A survey on llm-as-a-judge , author=. arXiv preprint arXiv:2411.15594 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Advances in neural information processing systems , volume=
Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
-
[7]
Computers & Security , volume=
Moving towards agile cybersecurity incident response: A case study exploring the enabling role of big data analytics-embedded dynamic capabilities , author=. Computers & Security , volume=. 2023 , publisher=
2023
-
[8]
Computer Science Review , volume=
A quest for research and knowledge gaps in cybersecurity awareness for small and medium-sized enterprises , author=. Computer Science Review , volume=. 2023 , publisher=
2023
-
[9]
The twelfth international conference on learning representations , year=
MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The twelfth international conference on learning representations , year=
-
[10]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Large language model based multi-agents: A survey of progress and challenges , author=. arXiv preprint arXiv:2402.01680 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Vicinagearth , volume=
A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges , author=. Vicinagearth , volume=. 2024 , publisher=
2024
-
[12]
Retrieval-Augmented Generation for Large Language Models: A Survey
Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
ESORICS 2025: 30th European Symposium on Research in computer security , year=
Network intrusion response systems: towards standardized evaluation of intrusion response , author=. ESORICS 2025: 30th European Symposium on Research in computer security , year=
2025
-
[14]
34th USENIX Security Symposium (USENIX Security 25) , pages=
Cloak, Honey, Trap: Proactive Defenses Against \ LLM \ Agents , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
-
[15]
arXiv preprint arXiv:2504.00428 , year=
LLM-Assisted Proactive Threat Intelligence for Automated Reasoning , author=. arXiv preprint arXiv:2504.00428 , year=
-
[16]
2003 , publisher=
Policy-based network management: solutions for the next generation , author=. 2003 , publisher=
2003
-
[17]
International Conference on Ubiquitous Security , pages=
A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models , author=. International Conference on Ubiquitous Security , pages=. 2023 , organization=
2023
-
[18]
Proceedings of the 34th ACM/SIGAPP symposium on applied computing , pages=
Towards automated network mitigation analysis , author=. Proceedings of the 34th ACM/SIGAPP symposium on applied computing , pages=
-
[19]
2025 International Wireless Communications and Mobile Computing (IWCMC) , pages=
Autonomous Cyber Incident Response Using Reasoning and Action , author=. 2025 International Wireless Communications and Mobile Computing (IWCMC) , pages=. 2025 , organization=
2025
-
[20]
Proceedings of the 22nd ACM Workshop on Hot Topics in Networks , pages=
A Holistic View of AI-driven Network Incident Management , author=. Proceedings of the 22nd ACM Workshop on Hot Topics in Networks , pages=
-
[21]
2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=
Recommending root-cause and mitigation steps for cloud incidents using large language models , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=
2023
-
[22]
International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=
Inferring Recovery Steps from Cyber Threat Intelligence Reports , author=. International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment , pages=. 2024 , organization=
2024
-
[23]
arXiv preprint arXiv:2402.17531 , year=
Nissist: An incident mitigation copilot based on troubleshooting guides , author=. arXiv preprint arXiv:2402.17531 , year=
-
[24]
arXiv preprint arXiv:2309.16422 , year=
Cyber sentinel: Exploring conversational agents in streamlining security tasks with gpt-4 , author=. arXiv preprint arXiv:2309.16422 , year=
-
[25]
Proceedings of the 8th Asia-Pacific Workshop on Networking , pages=
ShieldGPT: An LLM-based framework for DDoS mitigation , author=. Proceedings of the 8th Asia-Pacific Workshop on Networking , pages=
-
[26]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[27]
First conference on language modeling , year=
Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=
-
[28]
Forty-first international conference on machine learning , year=
Improving factuality and reasoning in language models through multiagent debate , author=. Forty-first international conference on machine learning , year=
-
[29]
Computers and Electrical Engineering , volume=
The Paradigm of Hallucinations in AI-driven cybersecurity systems: Understanding taxonomy, classification outcomes, and mitigations , author=. Computers and Electrical Engineering , volume=. 2025 , publisher=
2025
-
[30]
ACSAC 2025-Annual Computer Security Applications Conference Workshops , year=
AgentNIRS: An LLM-driven agent for network intrusion response , author=. ACSAC 2025-Annual Computer Security Applications Conference Workshops , year=
2025
-
[31]
arXiv preprint arXiv:2501.13411 , year=
VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework , author=. arXiv preprint arXiv:2501.13411 , year=
-
[32]
IEEE access , year=
Generative ai for cyber security: Analyzing the potential of chatgpt, dall-e and other models for enhancing the security space , author=. IEEE access , year=
-
[33]
Advances in neural information processing systems , volume=
Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
-
[34]
IEEE/CAA Journal of Automatica Sinica , volume=
When software security meets large language models: A survey , author=. IEEE/CAA Journal of Automatica Sinica , volume=. 2025 , publisher=
2025
-
[35]
IEEE Transactions on Network and Service Management , year=
LLM-powered Intent-driven Configuration Generation for Multi-vendor Networks , author=. IEEE Transactions on Network and Service Management , year=
-
[36]
International Journal of Network Management , volume=
A Comprehensive Survey on LLM-Based Network Management and Operations , author=. International Journal of Network Management , volume=. 2025 , publisher=
2025
-
[37]
Future Internet , volume=
Large Language Models Meet Next-Generation Networking Technologies: A Review , author=. Future Internet , volume=. 2024 , publisher=
2024
-
[38]
Computer Networks , volume=
A survey on network simulators, emulators, and testbeds used for research and education , author=. Computer Networks , volume=. 2023 , publisher=
2023
-
[39]
Cybersecurity , volume=
When llms meet cybersecurity: A systematic literature review , author=. Cybersecurity , volume=. 2025 , publisher=
2025
-
[40]
IEEE Network , year=
Large language models for networking: Workflow, advances and challenges , author=. IEEE Network , year=
-
[41]
IEEE Access , year=
A survey on enterprise network security: Asset behavioral monitoring and distributed attack detection , author=. IEEE Access , year=
-
[42]
Computers & Security , volume=
Cyber ranges and security testbeds: Scenarios, functions, tools and architecture , author=. Computers & Security , volume=. 2020 , publisher=
2020
-
[43]
Proceedings of the 4th ACM Workshop on Cyber-Physical System Security , pages=
Towards security-aware virtual environments for digital twins , author=. Proceedings of the 4th ACM Workshop on Cyber-Physical System Security , pages=
-
[44]
USENIX Security Symposium , volume=
MulVAL: A Logic-based Network Security Analyzer , author=. USENIX Security Symposium , volume=
-
[45]
19th Annual Computer Security Applications Conference , pages=
Efficient minimum-cost network hardening via exploit dependency graphs , author=. 19th Annual Computer Security Applications Conference , pages=. 2003 , organization=
2003
-
[46]
Journal of Information Security and Applications , volume=
A real-time automated attack-defense graph generation approach , author=. Journal of Information Security and Applications , volume=. 2025 , publisher=
2025
-
[47]
IEEE Transactions on Neural Networks and Learning Systems , volume=
Deep reinforcement learning for cyber security , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2021 , publisher=
2021
-
[48]
ACM Computing Surveys , volume=
A multi-vocal review of security orchestration , author=. ACM Computing Surveys , volume=. 2019 , publisher=
2019
-
[49]
International Journal of Information and Computer Security , volume=
A taxonomy of intrusion response systems , author=. International Journal of Information and Computer Security , volume=. 2007 , publisher=
2007
-
[50]
Proceedings of the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT) , pages=
NetConfEval: Can LLMs Facilitate Network Configuration? , author=. Proceedings of the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT) , pages=. 2024 , organization=
2024
-
[51]
2025 , url =
Cost of a Data Breach Report , author =. 2025 , url =
2025
-
[52]
2025 , url =
MITRE ATT&CK. 2025 , url =
2025
-
[53]
Caldera — Print Software Driven by Innovation , year =
-
[54]
Ifland, Beni and Krief, Rubin and Zilberman, Aviram and Duani, Elad and Ohana, Miro and Murillo, Andres and Manor, Ofir and Lavi, Ortal and Hikichi, Kenji and Shabtai, Asaf and Elovici, Yuval and Puzis, Rami , booktitle =. 2025 , volume =. doi:10.1109/ICDCSW63273.2025.00026 , url =
-
[55]
IEEE Communications Magazine , volume=
Large language models for zero touch network configuration management , author=. IEEE Communications Magazine , volume=. 2024 , publisher=
2024
-
[56]
arXiv preprint arXiv:2304.07411 , year=
SoK: The MITRE ATT&CK Framework in Research and Practice , author=. arXiv preprint arXiv:2304.07411 , year=
-
[57]
Proceedings of the 32nd Annual Conference on Computer Security Applications , pages=
Intelligent, automated red team emulation , author=. Proceedings of the 32nd Annual Conference on Computer Security Applications , pages=
-
[58]
Military Cyber Affairs , volume=
Characterizing Caldera’s Cyber Attack Emulation Capabilities , author=. Military Cyber Affairs , volume=
-
[59]
ACM Computing Surveys , volume=
Automation for network security configuration: State of the art and research trends , author=. ACM Computing Surveys , volume=. 2023 , publisher=
2023
-
[60]
Journal of Cybersecurity , volume=
Simulation for cybersecurity: state of the art and future directions , author=. Journal of Cybersecurity , volume=. 2021 , publisher=
2021
-
[61]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Exploring the efficacy of multi-agent reinforcement learning for autonomous cyber defence: A cage challenge 4 perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[62]
arXiv preprint arXiv:2407.11070 , year=
Optimal defender strategies for CAGE-2 using causal modeling and tree search , author=. arXiv preprint arXiv:2407.11070 , year=
-
[63]
arXiv preprint arXiv:2211.15557 , year=
Beyond cage: Investigating generalization of learned autonomous network defense policies , author=. arXiv preprint arXiv:2211.15557 , year=
-
[64]
arXiv preprint arXiv:2410.16324 , year=
Cyborg++: An enhanced gym for the development of autonomous cyber agents , author=. arXiv preprint arXiv:2410.16324 , year=
-
[65]
arXiv preprint arXiv:2108.09118 , year=
Cyborg: A gym for the development of autonomous cyber agents , author=. arXiv preprint arXiv:2108.09118 , year=
-
[66]
, Note =
Microsoft Defender Research Team. , Note =. CyberBattleSim , Year =
-
[67]
2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages=
Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces , author=. 2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID) , pages=
2025
-
[68]
European Symposium on Research in Computer Security , pages=
Nasimemu: Network attack simulator & emulator for training agents generalizing to novel scenarios , author=. European Symposium on Research in Computer Security , pages=. 2023 , organization=
2023
-
[69]
2024 , howpublished =
2024
-
[70]
Proceedings of the 17th Cyber Security Experimentation and Test Workshop , pages=
Towards a high fidelity training environment for autonomous cyber defense agents , author=. Proceedings of the 17th Cyber Security Experimentation and Test Workshop , pages=
-
[71]
arXiv preprint arXiv:2103.07583 , year=
Network environment design for autonomous cyberdefense , author=. arXiv preprint arXiv:2103.07583 , year=
-
[72]
IEEE Transactions on Network and Service Management , volume=
Intrusion prevention through optimal stopping , author=. IEEE Transactions on Network and Service Management , volume=. 2022 , publisher=
2022
-
[73]
arXiv preprint arXiv:2109.03331 , year=
Cygil: A cyber gym for training autonomous agents over emulated network systems , author=. arXiv preprint arXiv:2109.03331 , year=
-
[74]
Caldera , howpublished =
-
[75]
Harrell, Brent and Chan, Melanie and Han, Hojin and Voss, Kristin and Danke, Ganesh and Ji, Lauren and Brobin, Olivia and Esprit, Kate , title =
-
[76]
2025 , eprint=
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems , author=. 2025 , eprint=
2025
-
[77]
xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems , author=. arXiv preprint arXiv:2509.13021 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
arXiv preprint arXiv:2410.09134 , year=
Multi-agent actor-critics in autonomous cyber defense , author=. arXiv preprint arXiv:2410.09134 , year=
-
[79]
2025 IEEE Conference on Artificial Intelligence (CAI) , pages=
Large language models are autonomous cyber defenders , author=. 2025 IEEE Conference on Artificial Intelligence (CAI) , pages=. 2025 , organization=
2025
-
[80]
Computer Networks , volume=
Design and evaluation of an Autonomous Cyber Defence agent using DRL and an augmented LLM , author=. Computer Networks , volume=. 2025 , publisher=
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.