pith. sign in

arxiv: 2601.07122 · v2 · pith:TYTUNU4Mnew · submitted 2026-01-12 · 💻 cs.CR · cs.AI· cs.LG

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

Pith reviewed 2026-05-21 16:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords cloud network resiliencemulti-agent reinforcement learninglarge language modelscyber defensehuman-in-the-loopnetwork availabilityMITRE ATT&CK
0
0 comments X

The pith

A two-layer LLM-RL framework defends cloud networks by adapting to new structures and attacks without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that standard reinforcement learning defenses for cloud networks lose effectiveness when network layouts, scales, or attack patterns shift because they must be retrained from scratch and offer little room for human guidance. CyberOps-Bots counters this with an upper LLM layer that plans tactics and recognizes intent alongside lower RL agents that carry out precise local actions. If successful, the result is a defense that keeps networks online under changing conditions while remaining understandable to operators.

Core claim

The authors present CyberOps-Bots as a hierarchical multi-agent system in which an LLM agent equipped with ReAct planning, IPDRR perception, memory, and tool integration manages global awareness and high-level defense tactics drawn from the MITRE ATT&CK model, while separate pre-trained RL agents handle atomic resource deployment and isolation tasks in local regions; experiments on real cloud data show this yields 68.5 percent higher maintained availability and a 34.7 percent jumpstart gain across scenario shifts compared with prior methods.

What carries the argument

The two-layer hierarchy that pairs an LLM agent for strategic planning and human-in-the-loop oversight with lower-level RL agents for reliable local execution.

If this is right

  • Defense policies remain effective when network size or topology changes without any model retraining.
  • Operators can inject intent through the LLM layer to steer responses while RL agents still execute reliably.
  • Pre-trained RL components can be reused across different regions, reducing the cost of scaling defenses.
  • Overall system resilience rises because global planning adapts faster than pure RL approaches while local actions stay deterministic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of language-based planning from action execution could be tested in other dynamic control settings such as power-grid load balancing under shifting demand.
  • Over time the framework might allow security teams to update high-level goals in natural language rather than rewriting reward functions for each new threat class.
  • Deployment logs from the LLM planning layer could supply traceable records that help auditors verify compliance during incidents.

Load-bearing premise

The real cloud datasets used for testing accurately capture the range of network structures, scales, attack strategies, and intensities that occur in live production environments, and LLM outputs can be converted into RL actions without introducing new errors or vulnerabilities.

What would settle it

Running the framework on a production-scale cloud that encounters an attack intensity or network configuration absent from the study datasets and observing whether availability falls below the reported levels or LLM-generated actions produce execution failures.

Figures

Figures reproduced from arXiv: 2601.07122 by Feiyang Li, Guoshun Nan, Hao Hu, Jipeng Tang, Xinye Cao, Yingchang Jiang, Yixiao Peng, Yuling Liu.

Figure 1
Figure 1. Figure 1: While technologies like virtualization and elastic scaling provide [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A typical cloud-native e-commerce architecture, exemplifying the four dynamic aspects (A1-A4). i) The network frequently performs elastic scaling [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The CyberOps-Bots framework architecture, comprising three co [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experimental setup for evaluating the adaptability of CyberOps-Bots [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Figure (a-c) present the experimental results when the test scenario [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Figure (a-c) present the experimental results when the test scenario [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of the trade-off between defense persistence (Maximum Episode Length) and resilience (Mean Healthy Ratio) across different algorithms [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: This figure shows the average network vulnerability value at the end of [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the LLM agent’s tactical evolution under human intervention. The diagram displays the reasoning chain and defense actions across [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATT&CK's Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules--ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration--performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework that integrates Large Language Models (LLMs) to improve cloud network resilience against adversarial attacks. The architecture consists of an upper-level LLM agent (with ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration) for global awareness and tactical planning, paired with lower-level RL agents trained via heterogeneous separated pre-training for localized defense actions. Experiments on real cloud datasets are reported to show that CyberOps-Bots maintains 68.5% higher network availability and achieves a 34.7% jumpstart performance gain under scenario shifts without retraining, relative to state-of-the-art algorithms; the work positions itself as the first LLM-RL framework with Human-in-the-Loop support for this domain.

Significance. If the empirical claims are substantiated with full experimental details, the work could meaningfully advance hybrid LLM-RL approaches for cyber defense by mitigating the retraining requirement of conventional RL methods while adding interpretability via HITL. This would be relevant for dynamic cloud environments where network structure, scale, and attack patterns evolve.

major comments (2)
  1. [Abstract] Abstract: The headline claims of 68.5% higher network availability and 34.7% jumpstart gain without retraining are presented without any description of the baselines, statistical tests performed, dataset characteristics (size, topology, attack distributions), or the precise formulas/metrics used to compute these percentages, rendering the central empirical result unverifiable from the provided text.
  2. [Experiments] Experiments section: The scenario-shift protocol is not quantitatively specified (e.g., no reported deltas in node count, attack-vector entropy, intensity distributions, or structural changes between train and test regimes), which directly undermines the load-bearing claim that the observed gains demonstrate genuine cross-scenario generalization rather than proximity to the training distribution.
minor comments (1)
  1. The abstract and architecture description would benefit from explicit citation of the specific real cloud datasets employed and a short reproducibility note on how LLM outputs are mapped to RL action spaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract and experiments section require additional quantitative details to make the empirical claims fully verifiable. We will incorporate these clarifications in the revised manuscript. Our responses to the major comments are provided below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of 68.5% higher network availability and 34.7% jumpstart gain without retraining are presented without any description of the baselines, statistical tests performed, dataset characteristics (size, topology, attack distributions), or the precise formulas/metrics used to compute these percentages, rendering the central empirical result unverifiable from the provided text.

    Authors: We agree that the abstract does not currently supply enough context to allow direct verification of the headline performance numbers. In the revised version we will expand the abstract to name the primary state-of-the-art baselines, note the statistical tests used to establish significance, summarize the real-cloud dataset properties (scale, topology family, and attack-type distribution), and state the exact definitions of network availability and jumpstart gain. These additions will be kept concise while rendering the central claims traceable from the abstract alone. revision: yes

  2. Referee: [Experiments] Experiments section: The scenario-shift protocol is not quantitatively specified (e.g., no reported deltas in node count, attack-vector entropy, intensity distributions, or structural changes between train and test regimes), which directly undermines the load-bearing claim that the observed gains demonstrate genuine cross-scenario generalization rather than proximity to the training distribution.

    Authors: The referee correctly notes that the scenario-shift protocol must be described quantitatively to substantiate the generalization claim. We will revise the Experiments section to report the concrete differences between training and test regimes, including measured deltas in node count, attack-vector entropy, attack-intensity distributions, and any topological alterations. These additions will allow readers to assess whether the reported gains reflect genuine cross-scenario robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results presented as direct experimental outcomes

full rationale

The paper describes a hierarchical LLM-RL framework for cloud defense and reports performance gains from experiments on real cloud datasets. No equations, derivations, or first-principles predictions appear that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The 68.5% availability and 34.7% jumpstart claims are framed as measured outcomes compared to baselines, not quantities defined in terms of the model's own parameters or prior self-referential results. The architecture draws on external MITRE ATT&CK inspiration and heterogeneous pre-training, but these are design elements validated experimentally rather than circularly justified. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from RL and LLM agent literature plus the unverified claim that the chosen datasets capture realistic attack dynamics; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (2)
  • domain assumption LLM agents can perform reliable ReAct planning, perception, and tool integration for cyber defense tasks
    Invoked in the description of the upper-level agent modules
  • domain assumption Heterogeneous separated pre-training produces RL agents that execute atomic defense actions reliably in localized regions
    Stated as the basis for the lower-level agents

pith-pipeline@v0.9.0 · 5853 in / 1252 out tokens · 55174 ms · 2026-05-21T16:24:22.254125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 6 internal anchors

  1. [1]

    Advancements and challenges in cloud computing: Multi- cloud management, security, and ai-driven threat mitigation,

    R. Dilworth, “Advancements and challenges in cloud computing: Multi- cloud management, security, and ai-driven threat mitigation,” inPro- ceedings of the 2024 7th Artificial Intelligence and Cloud Computing Conference, 2024, pp. 639–645

  2. [2]

    ebpf: A new approach to cloud-native observability, networking and security for current (5g) and future mobile networks (6g and beyond),

    D. Soldani, P. Nahi, H. Bour, S. Jafarizadeh, M. F. Soliman, L. Di Gio- vanna, F. Monaco, G. Ognibene, and F. Risso, “ebpf: A new approach to cloud-native observability, networking and security for current (5g) and future mobile networks (6g and beyond),”IEEE Access, vol. 11, pp. 57 174–57 202, 2023

  3. [3]

    Empowering cloud computing with network acceleration: A survey,

    L. Rosa, L. Foschini, and A. Corradi, “Empowering cloud computing with network acceleration: A survey,”IEEE Communications Surveys & Tutorials, vol. 26, no. 4, pp. 2729–2768, 2024

  4. [4]

    Autonomous cloud networking in 2024: Leveraging ai and intent-based architectures for self-healing and optimization,

    K. Venkata, “Autonomous cloud networking in 2024: Leveraging ai and intent-based architectures for self-healing and optimization,” 2025

  5. [5]

    Cloud network anomaly detection using machine and deep learning tech- niques— recent research advancements,

    A. M. Abdallah, A. Saif Rashed Obaid Alkaabi, G. Bark Nasser Douman Alameri, S. H. Rafique, N. S. Musa, and T. Murugan, “Cloud network anomaly detection using machine and deep learning tech- niques— recent research advancements,”IEEE Access, vol. 12, pp. 56 749–56 773, 2024

  6. [6]

    Deep reinforcement learning for cyber security,

    T. T. Nguyen and V . J. Reddi, “Deep reinforcement learning for cyber security,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 3779–3795, 2023

  7. [7]

    Autonomous network defence using reinforcement learning,

    M. Foley, C. Hicks, K. Highnam, and V . Mavroudis, “Autonomous network defence using reinforcement learning,” inProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, ser. ASIA CCS ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1252–1254. [Online]. Available: https://doi.org/10.1145/3488932.3527286

  8. [8]

    Optimal decision making approach for cyber security defense using evolutionary game,

    H. Hu, Y . Liu, C. Chen, H. Zhang, and Y . Liu, “Optimal decision making approach for cyber security defense using evolutionary game,”IEEE Transactions on Network and Service Management, vol. 17, no. 3, pp. 1683–1700, 2020

  9. [9]

    Automated cyber defence: A review,

    S. Vyas, J. Hannay, A. Bolton, and P. P. Burnap, “Automated cyber defence: A review,”arXiv preprint arXiv:2303.04926, 2023

  10. [10]

    Dynpen: Automated penetration testing in dynamic network scenarios using deep reinforcement learning,

    Q. Li, R. Wang, D. Li, F. Shi, M. Zhang, A. Chattopadhyay, Y . Shen, and Y . Li, “Dynpen: Automated penetration testing in dynamic network scenarios using deep reinforcement learning,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 8966–8981, 2024

  11. [11]

    Imbalance in the cloud: An analysis on alibaba cluster trace,

    C. Lu, K. Ye, G. Xu, C.-Z. Xu, and T. Bai, “Imbalance in the cloud: An analysis on alibaba cluster trace,” in2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 2884–2892

  12. [12]

    Heterogeneity and dynamicity of clouds at scale: Google trace analy- sis,

    C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, “Heterogeneity and dynamicity of clouds at scale: Google trace analy- sis,” inProceedings of the third ACM symposium on cloud computing, 2012, pp. 1–13

  13. [13]

    A case study of the capital one data breach,

    N. Novaes Neto, S. Madnick, A. Moraes G de Paula, and N. Malara Borges, “A case study of the capital one data breach,”Stuart E. and Moraes G. de Paula, Anchises and Malara Borges, Natasha, A Case Study of the Capital One Data Breach (January 1, 2020), 2020

  14. [14]

    How Google Cloud Blocked Largest Layer 7 DDoS Attack at 46 Million RPS,

    Google Cloud Armor, “How Google Cloud Blocked Largest Layer 7 DDoS Attack at 46 Million RPS,” 2022. [Online]. Available: https://cloud.google.com/blog/products/identity-security/ how-google-cloud-blocked-largest-layer-7-ddos-attack-at-46-million-rps

  15. [15]

    Causally aware reinforcement learning agents for autonomous cyber defence,

    T. Purves, K. G. Kyriakopoulos, S. Jenkins, I. Phillips, and T. Dudman, “Causally aware reinforcement learning agents for autonomous cyber defence,”Knowledge-Based Systems, vol. 304, p. 112521, 2024

  16. [16]

    Learning games for defending advanced persistent threats in cyber systems,

    T. Zhu, D. Ye, Z. Cheng, W. Zhou, and P. S. Yu, “Learning games for defending advanced persistent threats in cyber systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 4, pp. 2410–2422, 2023

  17. [17]

    Reinforcement- learning-based apt defense for large-scale smart grids,

    L. Xiao, H. Liu, Z. Lv, Y . Chen, Z. Lin, and Y . Du, “Reinforcement- learning-based apt defense for large-scale smart grids,”IEEE Internet of Things Journal, vol. 12, no. 9, pp. 11 917–11 925, 2025

  18. [18]

    Mitre att&ck: State of the art and way forward,

    B. Al-Sada, A. Sadighian, and G. Oligeri, “Mitre att&ck: State of the art and way forward,”ACM Comput. Surv., vol. 57, no. 1, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3687300

  19. [19]

    Llm4game: Multi- agent reinforcement learning with knowledge injection for dynamic defense resource allocation in cloud storage,

    Y . Peng, H. Hu, F. Li, Y . Jiang, J. Tang, and Y . Liu, “Llm4game: Multi- agent reinforcement learning with knowledge injection for dynamic defense resource allocation in cloud storage,”Computer Networks, p. 111748, 2025

  20. [20]

    IDS-agent: An LLM agent for explainable intrusion detection in iot networks,

    Y . Li, Z. Xiang, N. D. Bastian, D. Song, and B. Li, “IDS-agent: An LLM agent for explainable intrusion detection in iot networks,” 2025. [Online]. Available: https://openreview.net/forum?id=uuCcK4cmlH

  21. [21]

    Dual-reinforcement-learning-based attack path prediction for 5g industrial cyber–physical systems,

    X. Li, X. Hu, and T. Jiang, “Dual-reinforcement-learning-based attack path prediction for 5g industrial cyber–physical systems,”IEEE Internet of Things Journal, vol. 11, no. 1, pp. 50–58, 2024

  22. [22]

    Hierarchical multi-agent reinforcement learning for cyber network defense,

    A. V . Singh, E. Rathbun, E. Graham, L. Oakley, S. Boboila, A. Oprea, and P. Chin, “Hierarchical multi-agent reinforcement learning for cyber network defense,”arXiv preprint arXiv:2410.17351, 2024

  23. [23]

    Defending against apt attacks in cloud computing environments using grouped mul- tiagent deep reinforcement learning,

    J. Chen, X. Lan, Q. Zhang, W. Ma, W. Fang, and J. He, “Defending against apt attacks in cloud computing environments using grouped mul- tiagent deep reinforcement learning,”IEEE Internet of Things Journal, vol. 12, no. 12, pp. 19 459–19 470, 2025

  24. [24]

    Deep-reinforcement- learning-based self-evolving moving target defense approach against unknown attacks,

    Y . Cao, K. Liu, Y . Lin, L. Wang, and Y . Xia, “Deep-reinforcement- learning-based self-evolving moving target defense approach against unknown attacks,”IEEE Internet of Things Journal, vol. 11, no. 20, pp. 33 027–33 039, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

  25. [25]

    A multiagent deep reinforcement learning autonomous security manage- ment approach for internet of things,

    B. Ren, Y . Tang, H. Wang, Y . Wang, J. Liu, G. Gao, and W. Wei, “A multiagent deep reinforcement learning autonomous security manage- ment approach for internet of things,”IEEE Internet of Things Journal, vol. 11, no. 15, pp. 25 600–25 612, 2024

  26. [26]

    Recent developments of game theory and reinforcement learning approaches: A systematic review,

    G. Jain, A. Kumar, and S. A. Bhat, “Recent developments of game theory and reinforcement learning approaches: A systematic review,” IEEE Access, vol. 12, pp. 9999–10 011, 2024

  27. [27]

    Game-theoretic apt defense: An experimental study on robotics,

    S. Rass, S. K ¨onig, J. Wachter, V . Mayoral-Vilches, and E. Panaousis, “Game-theoretic apt defense: An experimental study on robotics,”Com- puters & Security, vol. 132, p. 103328, 2023

  28. [28]

    Optimal deception asset deployment in cybersecurity: A nash q-learning approach in multi-agent stochastic games,

    G. Kong, F. Chen, X. Yang, G. Cheng, S. Zhang, and W. He, “Optimal deception asset deployment in cybersecurity: A nash q-learning approach in multi-agent stochastic games,”Applied Sciences, vol. 14, no. 1,

  29. [29]

    Available: https://www.mdpi.com/2076-3417/14/1/357

    [Online]. Available: https://www.mdpi.com/2076-3417/14/1/357

  30. [30]

    Resilient cyber-physical system hon- eypots for cyberattacker engagement,

    A. S. Mohamed and D. Kundur, “Resilient cyber-physical system hon- eypots for cyberattacker engagement,”IEEE Transactions on Industrial Informatics, vol. 21, no. 11, pp. 8585–8595, 2025

  31. [31]

    Ambient intelligence approach: Internet of things based decision performance analysis for intrusion detection,

    T. Ramana, M. Thirunavukkarasan, A. S. Mohammed, G. G. Devarajan, and S. M. Nagarajan, “Ambient intelligence approach: Internet of things based decision performance analysis for intrusion detection,”Computer Communications, vol. 195, pp. 315–322, 2022

  32. [32]

    Moving target defense (mtd) for 6g edge-to-cloud continuum: A cognitive perspective,

    W. Soussi, G. G ¨ur, and B. Stiller, “Moving target defense (mtd) for 6g edge-to-cloud continuum: A cognitive perspective,”IEEE Network, vol. 39, no. 1, pp. 149–156, 2025

  33. [33]

    Markov game based on reinforcement learning solution against cyber–physical attacks in smart grid,

    K. Bitirgen and ¨U. B. Filik, “Markov game based on reinforcement learning solution against cyber–physical attacks in smart grid,”Expert Systems with Applications, vol. 255, p. 124607, 2024

  34. [34]

    Enhancing underwater iot se- curity: A collaborative pursuit strategy using multi-agent reinforcement learning,

    Y . Hou, G. Han, F. Zhang, and C. Lin, “Enhancing underwater iot se- curity: A collaborative pursuit strategy using multi-agent reinforcement learning,”IEEE Internet of Things Magazine, vol. 7, no. 5, pp. 112–118, 2024

  35. [35]

    A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,

    Y . Tang, J. Sun, H. Wang, J. Deng, L. Tong, and W. Xu, “A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,”Computers & Security, vol. 142, p. 103871, 2024

  36. [36]

    Finding the optimal security policies for autonomous cyber operations with competitive reinforce- ment learning,

    G. Mcdonald, L. Li, and R. A. Mallah, “Finding the optimal security policies for autonomous cyber operations with competitive reinforce- ment learning,”IEEE Access, vol. 12, pp. 120 292–120 305, 2024

  37. [37]

    A game-theoretic method for defending against advanced persistent threats in cyber systems,

    L. Zhang, T. Zhu, F. K. Hussain, D. Ye, and W. Zhou, “A game-theoretic method for defending against advanced persistent threats in cyber systems,”IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1349–1364, 2023

  38. [38]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023

  39. [39]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  40. [40]

    The rise and potential of large language model based agents: A survey,

    Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhouet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, p. 121101, 2025

  41. [41]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models,”arXiv preprint arXiv:2410.05229, 2024

  42. [42]

    Evaluating large language models on controlled generation tasks,

    J. Sun, Y . Tian, W. Zhou, N. Xu, Q. Hu, R. Gupta, J. Wieting, N. Peng, and X. Ma, “Evaluating large language models on controlled generation tasks,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 3155–3168...

  43. [43]

    NIST Cybersecurity Framework,

    N. I. of Standards and Technology, “NIST Cybersecurity Framework,” Nov. 2014. [Online]. Available: https://www.nist.gov/cyberframework

  44. [44]

    Developing opti- mal causal cyber-defence agents via cyber security simulation,

    A. Andrew, S. Spillard, J. Collyer, and N. Dhir, “Developing optimal causal cyber-defence agents via cyber security simulation,” 2022. [Online]. Available: https://arxiv.org/abs/2207.12355

  45. [45]

    Entity-based reinforcement learning for autonomous cyber defence,

    I. Symes Thompson, A. Caron, C. Hicks, and V . Mavroudis, “Entity-based reinforcement learning for autonomous cyber defence,” inProceedings of the Workshop on Autonomous Cybersecurity, ser. AutonomousCyber ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 56–67. [Online]. Available: https://doi.org/10.1145/3689933.3690835

  46. [46]

    ISO/IEC 27001:2022,

    c. ISO/IEC Joint Technical Committee 1, Subcommittee 27 – Information security and privacy protection, “ISO/IEC 27001:2022,” Geneva, Switzerland, 2022. [Online]. Available: https://www.iso.org/ standard/27001

  47. [47]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

  48. [48]

    Reflexion: language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 8634–8652. [Online]. Available: https://proceedings.neurip...

  49. [49]

    A Realistic Cyber Defense Dataset,

    Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC), “A Realistic Cyber Defense Dataset,”

  50. [50]

    Available: https://registry.opendata.aws/cse-cic-ids2018

    [Online]. Available: https://registry.opendata.aws/cse-cic-ids2018

  51. [51]

    Qwen3 technical report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  52. [52]

    Qwen3 Technical Report

    [Online]. Available: https://arxiv.org/abs/2505.09388

  53. [53]

    Xuance: A comprehensive and unified deep reinforcement learning library,

    W. Liu, W. Cai, K. Jiang, G. Cheng, Y . Wang, J. Wang, J. Cao, L. Xu, C. Mu, and C. Sun, “Xuance: A comprehensive and unified deep reinforcement learning library,” 2023. [Online]. Available: https://arxiv.org/abs/2312.16248

  54. [54]

    The surprising effectiveness of ppo in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . WU, “The surprising effectiveness of ppo in cooperative multi-agent games,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 611–24 624. [Online]. Available: https://pr...

  55. [55]

    Independent rein- forcement learners in cooperative markov games: a survey regarding coordination problems,

    L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent rein- forcement learners in cooperative markov games: a survey regarding coordination problems,”The Knowledge Engineering Review, vol. 27, no. 1, p. 1–31, 2012

  56. [56]

    Monotonic value function factorisation for deep multi-agent reinforcement learning,

    T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-081.html

  57. [57]

    Value-Decomposition Networks For Cooperative Multi-Agent Learning

    P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuylset al., “Value-decomposition networks for cooperative multi-agent learning,” arXiv preprint arXiv:1706.05296, 2017

  58. [58]

    Mind the gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning,

    S. Zhou, J. Liu, Y . Lu, J. Yang, Y . Zhang, and J. Chen, “Mind the gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning,”Frontiers of Information Technology & Electronic Engineering, vol. 26, no. 12, pp. 2511–2528, 2025. [Online]. Available: https://doi.org/10.1631/FITEE. 2500100

  59. [59]

    Transfer learning for reinforcement learning domains: A survey

    M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey.”Journal of Machine Learning Research, vol. 10, no. 7, 2009

  60. [60]

    EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

    Y . Li, F. Wei, C. Zhang, and H. Zhang, “Eagle-3: Scaling up inference acceleration of large language models via training-time test,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01840