Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

Feiyang Li; Guoshun Nan; Hao Hu; Jipeng Tang; Xinye Cao; Yingchang Jiang; Yixiao Peng; Yuling Liu

arxiv: 2601.07122 · v2 · pith:TYTUNU4Mnew · submitted 2026-01-12 · 💻 cs.CR · cs.AI· cs.LG

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

Yixiao Peng , Hao Hu , Feiyang Li , Xinye Cao , Yingchang Jiang , Jipeng Tang , Guoshun Nan , Yuling Liu This is my paper

Pith reviewed 2026-05-21 16:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords cloud network resiliencemulti-agent reinforcement learninglarge language modelscyber defensehuman-in-the-loopnetwork availabilityMITRE ATT&CK

0 comments

The pith

A two-layer LLM-RL framework defends cloud networks by adapting to new structures and attacks without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that standard reinforcement learning defenses for cloud networks lose effectiveness when network layouts, scales, or attack patterns shift because they must be retrained from scratch and offer little room for human guidance. CyberOps-Bots counters this with an upper LLM layer that plans tactics and recognizes intent alongside lower RL agents that carry out precise local actions. If successful, the result is a defense that keeps networks online under changing conditions while remaining understandable to operators.

Core claim

The authors present CyberOps-Bots as a hierarchical multi-agent system in which an LLM agent equipped with ReAct planning, IPDRR perception, memory, and tool integration manages global awareness and high-level defense tactics drawn from the MITRE ATT&CK model, while separate pre-trained RL agents handle atomic resource deployment and isolation tasks in local regions; experiments on real cloud data show this yields 68.5 percent higher maintained availability and a 34.7 percent jumpstart gain across scenario shifts compared with prior methods.

What carries the argument

The two-layer hierarchy that pairs an LLM agent for strategic planning and human-in-the-loop oversight with lower-level RL agents for reliable local execution.

If this is right

Defense policies remain effective when network size or topology changes without any model retraining.
Operators can inject intent through the LLM layer to steer responses while RL agents still execute reliably.
Pre-trained RL components can be reused across different regions, reducing the cost of scaling defenses.
Overall system resilience rises because global planning adapts faster than pure RL approaches while local actions stay deterministic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of language-based planning from action execution could be tested in other dynamic control settings such as power-grid load balancing under shifting demand.
Over time the framework might allow security teams to update high-level goals in natural language rather than rewriting reward functions for each new threat class.
Deployment logs from the LLM planning layer could supply traceable records that help auditors verify compliance during incidents.

Load-bearing premise

The real cloud datasets used for testing accurately capture the range of network structures, scales, attack strategies, and intensities that occur in live production environments, and LLM outputs can be converted into RL actions without introducing new errors or vulnerabilities.

What would settle it

Running the framework on a production-scale cloud that encounters an attack intensity or network configuration absent from the study datasets and observing whether availability falls below the reported levels or LLM-generated actions produce execution failures.

Figures

Figures reproduced from arXiv: 2601.07122 by Feiyang Li, Guoshun Nan, Hao Hu, Jipeng Tang, Xinye Cao, Yingchang Jiang, Yixiao Peng, Yuling Liu.

**Figure 2.** Figure 2: A typical cloud-native e-commerce architecture, exemplifying the four dynamic aspects (A1-A4). i) The network frequently performs elastic scaling [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The CyberOps-Bots framework architecture, comprising three co [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental setup for evaluating the adaptability of CyberOps-Bots [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Figure (a-c) present the experimental results when the test scenario [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 7.** Figure 7: Figure (a-c) present the experimental results when the test scenario [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of the trade-off between defense persistence (Maximum Episode Length) and resilience (Mean Healthy Ratio) across different algorithms [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: This figure shows the average network vulnerability value at the end of [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of the LLM agent’s tactical evolution under human intervention. The diagram displays the reasoning chain and defense actions across [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATT&CK's Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules--ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration--performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a hierarchical LLM-RL setup for cloud defense that claims strong no-retraining robustness, but the scenario shifts lack the quantitative detail needed to support that claim.

read the letter

The main point is that CyberOps-Bots layers an LLM-based planner with ReAct, memory, and MITRE-inspired tactics over pre-trained RL agents to handle cloud network defense. It reports 68.5 percent higher availability and 34.7 percent better jumpstart performance under scenario shifts without retraining. If those numbers survive scrutiny, the separation of high-level planning from low-level execution could reduce retraining costs in changing environments. The architecture itself is a reasonable attempt to combine LLM adaptability with RL reliability while adding human-in-the-loop support for interpretability. The authors correctly identify that standard RL defenses struggle when network scale, attack vectors, or intensity change. Giving credit where due, the two-layer design and the use of heterogeneous pre-training for the lower agents show some care in addressing the practical limits of pure RL or pure LLM approaches. The citation to MITRE ATT&CK as inspiration is also a sensible anchor for the planning module. The soft spot is the experimental section. The abstract gives no numbers on how the test scenarios actually differ from training in node count, attack entropy, or intensity distribution. Without that, it is difficult to tell whether the gains reflect genuine generalization or simply test cases that stayed close to the training distribution. Baselines, statistical tests, and dataset characteristics are also missing from the summary, so the central empirical result stays hard to evaluate. The assumption that LLM-generated plans translate safely into RL actions without introducing new failure modes also needs direct evidence. This paper is mainly for researchers working on AI for cybersecurity and cloud operations who already follow multi-agent RL or LLM-agent work. A reader interested in practical robustness ideas might extract useful architecture points even if the evaluation needs tightening. I would send it to peer review so that referees can check the full experiments and the precise definition of the scenario shifts.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework that integrates Large Language Models (LLMs) to improve cloud network resilience against adversarial attacks. The architecture consists of an upper-level LLM agent (with ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration) for global awareness and tactical planning, paired with lower-level RL agents trained via heterogeneous separated pre-training for localized defense actions. Experiments on real cloud datasets are reported to show that CyberOps-Bots maintains 68.5% higher network availability and achieves a 34.7% jumpstart performance gain under scenario shifts without retraining, relative to state-of-the-art algorithms; the work positions itself as the first LLM-RL framework with Human-in-the-Loop support for this domain.

Significance. If the empirical claims are substantiated with full experimental details, the work could meaningfully advance hybrid LLM-RL approaches for cyber defense by mitigating the retraining requirement of conventional RL methods while adding interpretability via HITL. This would be relevant for dynamic cloud environments where network structure, scale, and attack patterns evolve.

major comments (2)

[Abstract] Abstract: The headline claims of 68.5% higher network availability and 34.7% jumpstart gain without retraining are presented without any description of the baselines, statistical tests performed, dataset characteristics (size, topology, attack distributions), or the precise formulas/metrics used to compute these percentages, rendering the central empirical result unverifiable from the provided text.
[Experiments] Experiments section: The scenario-shift protocol is not quantitatively specified (e.g., no reported deltas in node count, attack-vector entropy, intensity distributions, or structural changes between train and test regimes), which directly undermines the load-bearing claim that the observed gains demonstrate genuine cross-scenario generalization rather than proximity to the training distribution.

minor comments (1)

The abstract and architecture description would benefit from explicit citation of the specific real cloud datasets employed and a short reproducibility note on how LLM outputs are mapped to RL action spaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract and experiments section require additional quantitative details to make the empirical claims fully verifiable. We will incorporate these clarifications in the revised manuscript. Our responses to the major comments are provided below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claims of 68.5% higher network availability and 34.7% jumpstart gain without retraining are presented without any description of the baselines, statistical tests performed, dataset characteristics (size, topology, attack distributions), or the precise formulas/metrics used to compute these percentages, rendering the central empirical result unverifiable from the provided text.

Authors: We agree that the abstract does not currently supply enough context to allow direct verification of the headline performance numbers. In the revised version we will expand the abstract to name the primary state-of-the-art baselines, note the statistical tests used to establish significance, summarize the real-cloud dataset properties (scale, topology family, and attack-type distribution), and state the exact definitions of network availability and jumpstart gain. These additions will be kept concise while rendering the central claims traceable from the abstract alone. revision: yes
Referee: [Experiments] Experiments section: The scenario-shift protocol is not quantitatively specified (e.g., no reported deltas in node count, attack-vector entropy, intensity distributions, or structural changes between train and test regimes), which directly undermines the load-bearing claim that the observed gains demonstrate genuine cross-scenario generalization rather than proximity to the training distribution.

Authors: The referee correctly notes that the scenario-shift protocol must be described quantitatively to substantiate the generalization claim. We will revise the Experiments section to report the concrete differences between training and test regimes, including measured deltas in node count, attack-vector entropy, attack-intensity distributions, and any topological alterations. These additions will allow readers to assess whether the reported gains reflect genuine cross-scenario robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results presented as direct experimental outcomes

full rationale

The paper describes a hierarchical LLM-RL framework for cloud defense and reports performance gains from experiments on real cloud datasets. No equations, derivations, or first-principles predictions appear that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The 68.5% availability and 34.7% jumpstart claims are framed as measured outcomes compared to baselines, not quantities defined in terms of the model's own parameters or prior self-referential results. The architecture draws on external MITRE ATT&CK inspiration and heterogeneous pre-training, but these are design elements validated experimentally rather than circularly justified. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from RL and LLM agent literature plus the unverified claim that the chosen datasets capture realistic attack dynamics; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (2)

domain assumption LLM agents can perform reliable ReAct planning, perception, and tool integration for cyber defense tasks
Invoked in the description of the upper-level agent modules
domain assumption Heterogeneous separated pre-training produces RL agents that execute atomic defense actions reliably in localized regions
Stated as the basis for the lower-level agents

pith-pipeline@v0.9.0 · 5853 in / 1252 out tokens · 55174 ms · 2026-05-21T16:24:22.254125+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical multi-agent reinforcement learning framework empowered by Large Language Models... ReAct planning, IPDRR-based perception, long-short term memory
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

heterogeneous separated pre-training... specialized reward functions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 6 internal anchors

[1]

Advancements and challenges in cloud computing: Multi- cloud management, security, and ai-driven threat mitigation,

R. Dilworth, “Advancements and challenges in cloud computing: Multi- cloud management, security, and ai-driven threat mitigation,” inPro- ceedings of the 2024 7th Artificial Intelligence and Cloud Computing Conference, 2024, pp. 639–645

work page 2024
[2]

ebpf: A new approach to cloud-native observability, networking and security for current (5g) and future mobile networks (6g and beyond),

D. Soldani, P. Nahi, H. Bour, S. Jafarizadeh, M. F. Soliman, L. Di Gio- vanna, F. Monaco, G. Ognibene, and F. Risso, “ebpf: A new approach to cloud-native observability, networking and security for current (5g) and future mobile networks (6g and beyond),”IEEE Access, vol. 11, pp. 57 174–57 202, 2023

work page 2023
[3]

Empowering cloud computing with network acceleration: A survey,

L. Rosa, L. Foschini, and A. Corradi, “Empowering cloud computing with network acceleration: A survey,”IEEE Communications Surveys & Tutorials, vol. 26, no. 4, pp. 2729–2768, 2024

work page 2024
[4]

Autonomous cloud networking in 2024: Leveraging ai and intent-based architectures for self-healing and optimization,

K. Venkata, “Autonomous cloud networking in 2024: Leveraging ai and intent-based architectures for self-healing and optimization,” 2025

work page 2024
[5]

Cloud network anomaly detection using machine and deep learning tech- niques— recent research advancements,

A. M. Abdallah, A. Saif Rashed Obaid Alkaabi, G. Bark Nasser Douman Alameri, S. H. Rafique, N. S. Musa, and T. Murugan, “Cloud network anomaly detection using machine and deep learning tech- niques— recent research advancements,”IEEE Access, vol. 12, pp. 56 749–56 773, 2024

work page 2024
[6]

Deep reinforcement learning for cyber security,

T. T. Nguyen and V . J. Reddi, “Deep reinforcement learning for cyber security,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 3779–3795, 2023

work page 2023
[7]

Autonomous network defence using reinforcement learning,

M. Foley, C. Hicks, K. Highnam, and V . Mavroudis, “Autonomous network defence using reinforcement learning,” inProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, ser. ASIA CCS ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1252–1254. [Online]. Available: https://doi.org/10.1145/3488932.3527286

work page doi:10.1145/3488932.3527286 2022
[8]

Optimal decision making approach for cyber security defense using evolutionary game,

H. Hu, Y . Liu, C. Chen, H. Zhang, and Y . Liu, “Optimal decision making approach for cyber security defense using evolutionary game,”IEEE Transactions on Network and Service Management, vol. 17, no. 3, pp. 1683–1700, 2020

work page 2020
[9]

Automated cyber defence: A review,

S. Vyas, J. Hannay, A. Bolton, and P. P. Burnap, “Automated cyber defence: A review,”arXiv preprint arXiv:2303.04926, 2023

work page arXiv 2023
[10]

Dynpen: Automated penetration testing in dynamic network scenarios using deep reinforcement learning,

Q. Li, R. Wang, D. Li, F. Shi, M. Zhang, A. Chattopadhyay, Y . Shen, and Y . Li, “Dynpen: Automated penetration testing in dynamic network scenarios using deep reinforcement learning,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 8966–8981, 2024

work page 2024
[11]

Imbalance in the cloud: An analysis on alibaba cluster trace,

C. Lu, K. Ye, G. Xu, C.-Z. Xu, and T. Bai, “Imbalance in the cloud: An analysis on alibaba cluster trace,” in2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 2884–2892

work page 2017
[12]

Heterogeneity and dynamicity of clouds at scale: Google trace analy- sis,

C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, “Heterogeneity and dynamicity of clouds at scale: Google trace analy- sis,” inProceedings of the third ACM symposium on cloud computing, 2012, pp. 1–13

work page 2012
[13]

A case study of the capital one data breach,

N. Novaes Neto, S. Madnick, A. Moraes G de Paula, and N. Malara Borges, “A case study of the capital one data breach,”Stuart E. and Moraes G. de Paula, Anchises and Malara Borges, Natasha, A Case Study of the Capital One Data Breach (January 1, 2020), 2020

work page 2020
[14]

How Google Cloud Blocked Largest Layer 7 DDoS Attack at 46 Million RPS,

Google Cloud Armor, “How Google Cloud Blocked Largest Layer 7 DDoS Attack at 46 Million RPS,” 2022. [Online]. Available: https://cloud.google.com/blog/products/identity-security/ how-google-cloud-blocked-largest-layer-7-ddos-attack-at-46-million-rps

work page 2022
[15]

Causally aware reinforcement learning agents for autonomous cyber defence,

T. Purves, K. G. Kyriakopoulos, S. Jenkins, I. Phillips, and T. Dudman, “Causally aware reinforcement learning agents for autonomous cyber defence,”Knowledge-Based Systems, vol. 304, p. 112521, 2024

work page 2024
[16]

Learning games for defending advanced persistent threats in cyber systems,

T. Zhu, D. Ye, Z. Cheng, W. Zhou, and P. S. Yu, “Learning games for defending advanced persistent threats in cyber systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 4, pp. 2410–2422, 2023

work page 2023
[17]

Reinforcement- learning-based apt defense for large-scale smart grids,

L. Xiao, H. Liu, Z. Lv, Y . Chen, Z. Lin, and Y . Du, “Reinforcement- learning-based apt defense for large-scale smart grids,”IEEE Internet of Things Journal, vol. 12, no. 9, pp. 11 917–11 925, 2025

work page 2025
[18]

Mitre att&ck: State of the art and way forward,

B. Al-Sada, A. Sadighian, and G. Oligeri, “Mitre att&ck: State of the art and way forward,”ACM Comput. Surv., vol. 57, no. 1, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3687300

work page doi:10.1145/3687300 2024
[19]

Llm4game: Multi- agent reinforcement learning with knowledge injection for dynamic defense resource allocation in cloud storage,

Y . Peng, H. Hu, F. Li, Y . Jiang, J. Tang, and Y . Liu, “Llm4game: Multi- agent reinforcement learning with knowledge injection for dynamic defense resource allocation in cloud storage,”Computer Networks, p. 111748, 2025

work page 2025
[20]

IDS-agent: An LLM agent for explainable intrusion detection in iot networks,

Y . Li, Z. Xiang, N. D. Bastian, D. Song, and B. Li, “IDS-agent: An LLM agent for explainable intrusion detection in iot networks,” 2025. [Online]. Available: https://openreview.net/forum?id=uuCcK4cmlH

work page 2025
[21]

Dual-reinforcement-learning-based attack path prediction for 5g industrial cyber–physical systems,

X. Li, X. Hu, and T. Jiang, “Dual-reinforcement-learning-based attack path prediction for 5g industrial cyber–physical systems,”IEEE Internet of Things Journal, vol. 11, no. 1, pp. 50–58, 2024

work page 2024
[22]

Hierarchical multi-agent reinforcement learning for cyber network defense,

A. V . Singh, E. Rathbun, E. Graham, L. Oakley, S. Boboila, A. Oprea, and P. Chin, “Hierarchical multi-agent reinforcement learning for cyber network defense,”arXiv preprint arXiv:2410.17351, 2024

work page arXiv 2024
[23]

Defending against apt attacks in cloud computing environments using grouped mul- tiagent deep reinforcement learning,

J. Chen, X. Lan, Q. Zhang, W. Ma, W. Fang, and J. He, “Defending against apt attacks in cloud computing environments using grouped mul- tiagent deep reinforcement learning,”IEEE Internet of Things Journal, vol. 12, no. 12, pp. 19 459–19 470, 2025

work page 2025
[24]

Deep-reinforcement- learning-based self-evolving moving target defense approach against unknown attacks,

Y . Cao, K. Liu, Y . Lin, L. Wang, and Y . Xia, “Deep-reinforcement- learning-based self-evolving moving target defense approach against unknown attacks,”IEEE Internet of Things Journal, vol. 11, no. 20, pp. 33 027–33 039, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

work page 2024
[25]

A multiagent deep reinforcement learning autonomous security manage- ment approach for internet of things,

B. Ren, Y . Tang, H. Wang, Y . Wang, J. Liu, G. Gao, and W. Wei, “A multiagent deep reinforcement learning autonomous security manage- ment approach for internet of things,”IEEE Internet of Things Journal, vol. 11, no. 15, pp. 25 600–25 612, 2024

work page 2024
[26]

Recent developments of game theory and reinforcement learning approaches: A systematic review,

G. Jain, A. Kumar, and S. A. Bhat, “Recent developments of game theory and reinforcement learning approaches: A systematic review,” IEEE Access, vol. 12, pp. 9999–10 011, 2024

work page 2024
[27]

Game-theoretic apt defense: An experimental study on robotics,

S. Rass, S. K ¨onig, J. Wachter, V . Mayoral-Vilches, and E. Panaousis, “Game-theoretic apt defense: An experimental study on robotics,”Com- puters & Security, vol. 132, p. 103328, 2023

work page 2023
[28]

Optimal deception asset deployment in cybersecurity: A nash q-learning approach in multi-agent stochastic games,

G. Kong, F. Chen, X. Yang, G. Cheng, S. Zhang, and W. He, “Optimal deception asset deployment in cybersecurity: A nash q-learning approach in multi-agent stochastic games,”Applied Sciences, vol. 14, no. 1,

work page
[29]

Available: https://www.mdpi.com/2076-3417/14/1/357

[Online]. Available: https://www.mdpi.com/2076-3417/14/1/357

work page 2076
[30]

Resilient cyber-physical system hon- eypots for cyberattacker engagement,

A. S. Mohamed and D. Kundur, “Resilient cyber-physical system hon- eypots for cyberattacker engagement,”IEEE Transactions on Industrial Informatics, vol. 21, no. 11, pp. 8585–8595, 2025

work page 2025
[31]

Ambient intelligence approach: Internet of things based decision performance analysis for intrusion detection,

T. Ramana, M. Thirunavukkarasan, A. S. Mohammed, G. G. Devarajan, and S. M. Nagarajan, “Ambient intelligence approach: Internet of things based decision performance analysis for intrusion detection,”Computer Communications, vol. 195, pp. 315–322, 2022

work page 2022
[32]

Moving target defense (mtd) for 6g edge-to-cloud continuum: A cognitive perspective,

W. Soussi, G. G ¨ur, and B. Stiller, “Moving target defense (mtd) for 6g edge-to-cloud continuum: A cognitive perspective,”IEEE Network, vol. 39, no. 1, pp. 149–156, 2025

work page 2025
[33]

Markov game based on reinforcement learning solution against cyber–physical attacks in smart grid,

K. Bitirgen and ¨U. B. Filik, “Markov game based on reinforcement learning solution against cyber–physical attacks in smart grid,”Expert Systems with Applications, vol. 255, p. 124607, 2024

work page 2024
[34]

Enhancing underwater iot se- curity: A collaborative pursuit strategy using multi-agent reinforcement learning,

Y . Hou, G. Han, F. Zhang, and C. Lin, “Enhancing underwater iot se- curity: A collaborative pursuit strategy using multi-agent reinforcement learning,”IEEE Internet of Things Magazine, vol. 7, no. 5, pp. 112–118, 2024

work page 2024
[35]

A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,

Y . Tang, J. Sun, H. Wang, J. Deng, L. Tong, and W. Xu, “A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,”Computers & Security, vol. 142, p. 103871, 2024

work page 2024
[36]

Finding the optimal security policies for autonomous cyber operations with competitive reinforce- ment learning,

G. Mcdonald, L. Li, and R. A. Mallah, “Finding the optimal security policies for autonomous cyber operations with competitive reinforce- ment learning,”IEEE Access, vol. 12, pp. 120 292–120 305, 2024

work page 2024
[37]

A game-theoretic method for defending against advanced persistent threats in cyber systems,

L. Zhang, T. Zhu, F. K. Hussain, D. Ye, and W. Zhou, “A game-theoretic method for defending against advanced persistent threats in cyber systems,”IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1349–1364, 2023

work page 2023
[38]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

The rise and potential of large language model based agents: A survey,

Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhouet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, p. 121101, 2025

work page 2025
[41]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models,”arXiv preprint arXiv:2410.05229, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Evaluating large language models on controlled generation tasks,

J. Sun, Y . Tian, W. Zhou, N. Xu, Q. Hu, R. Gupta, J. Wieting, N. Peng, and X. Ma, “Evaluating large language models on controlled generation tasks,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 3155–3168...

work page 2023
[43]

NIST Cybersecurity Framework,

N. I. of Standards and Technology, “NIST Cybersecurity Framework,” Nov. 2014. [Online]. Available: https://www.nist.gov/cyberframework

work page 2014
[44]

Developing opti- mal causal cyber-defence agents via cyber security simulation,

A. Andrew, S. Spillard, J. Collyer, and N. Dhir, “Developing optimal causal cyber-defence agents via cyber security simulation,” 2022. [Online]. Available: https://arxiv.org/abs/2207.12355

work page arXiv 2022
[45]

Entity-based reinforcement learning for autonomous cyber defence,

I. Symes Thompson, A. Caron, C. Hicks, and V . Mavroudis, “Entity-based reinforcement learning for autonomous cyber defence,” inProceedings of the Workshop on Autonomous Cybersecurity, ser. AutonomousCyber ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 56–67. [Online]. Available: https://doi.org/10.1145/3689933.3690835

work page doi:10.1145/3689933.3690835 2024
[46]

ISO/IEC 27001:2022,

c. ISO/IEC Joint Technical Committee 1, Subcommittee 27 – Information security and privacy protection, “ISO/IEC 27001:2022,” Geneva, Switzerland, 2022. [Online]. Available: https://www.iso.org/ standard/27001

work page 2022
[47]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

work page 2022
[48]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 8634–8652. [Online]. Available: https://proceedings.neurip...

work page 2023
[49]

A Realistic Cyber Defense Dataset,

Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC), “A Realistic Cyber Defense Dataset,”

work page
[50]

Available: https://registry.opendata.aws/cse-cic-ids2018

[Online]. Available: https://registry.opendata.aws/cse-cic-ids2018

work page
[51]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page
[52]

Qwen3 Technical Report

[Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Xuance: A comprehensive and unified deep reinforcement learning library,

W. Liu, W. Cai, K. Jiang, G. Cheng, Y . Wang, J. Wang, J. Cao, L. Xu, C. Mu, and C. Sun, “Xuance: A comprehensive and unified deep reinforcement learning library,” 2023. [Online]. Available: https://arxiv.org/abs/2312.16248

work page arXiv 2023
[54]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . WU, “The surprising effectiveness of ppo in cooperative multi-agent games,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 611–24 624. [Online]. Available: https://pr...

work page 2022
[55]

Independent rein- forcement learners in cooperative markov games: a survey regarding coordination problems,

L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent rein- forcement learners in cooperative markov games: a survey regarding coordination problems,”The Knowledge Engineering Review, vol. 27, no. 1, p. 1–31, 2012

work page 2012
[56]

Monotonic value function factorisation for deep multi-agent reinforcement learning,

T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-081.html

work page 2020
[57]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuylset al., “Value-decomposition networks for cooperative multi-agent learning,” arXiv preprint arXiv:1706.05296, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

Mind the gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning,

S. Zhou, J. Liu, Y . Lu, J. Yang, Y . Zhang, and J. Chen, “Mind the gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning,”Frontiers of Information Technology & Electronic Engineering, vol. 26, no. 12, pp. 2511–2528, 2025. [Online]. Available: https://doi.org/10.1631/FITEE. 2500100

work page doi:10.1631/fitee 2025
[59]

Transfer learning for reinforcement learning domains: A survey

M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey.”Journal of Machine Learning Research, vol. 10, no. 7, 2009

work page 2009
[60]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Y . Li, F. Wei, C. Zhang, and H. Zhang, “Eagle-3: Scaling up inference acceleration of large language models via training-time test,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Advancements and challenges in cloud computing: Multi- cloud management, security, and ai-driven threat mitigation,

R. Dilworth, “Advancements and challenges in cloud computing: Multi- cloud management, security, and ai-driven threat mitigation,” inPro- ceedings of the 2024 7th Artificial Intelligence and Cloud Computing Conference, 2024, pp. 639–645

work page 2024

[2] [2]

ebpf: A new approach to cloud-native observability, networking and security for current (5g) and future mobile networks (6g and beyond),

D. Soldani, P. Nahi, H. Bour, S. Jafarizadeh, M. F. Soliman, L. Di Gio- vanna, F. Monaco, G. Ognibene, and F. Risso, “ebpf: A new approach to cloud-native observability, networking and security for current (5g) and future mobile networks (6g and beyond),”IEEE Access, vol. 11, pp. 57 174–57 202, 2023

work page 2023

[3] [3]

Empowering cloud computing with network acceleration: A survey,

L. Rosa, L. Foschini, and A. Corradi, “Empowering cloud computing with network acceleration: A survey,”IEEE Communications Surveys & Tutorials, vol. 26, no. 4, pp. 2729–2768, 2024

work page 2024

[4] [4]

Autonomous cloud networking in 2024: Leveraging ai and intent-based architectures for self-healing and optimization,

K. Venkata, “Autonomous cloud networking in 2024: Leveraging ai and intent-based architectures for self-healing and optimization,” 2025

work page 2024

[5] [5]

Cloud network anomaly detection using machine and deep learning tech- niques— recent research advancements,

A. M. Abdallah, A. Saif Rashed Obaid Alkaabi, G. Bark Nasser Douman Alameri, S. H. Rafique, N. S. Musa, and T. Murugan, “Cloud network anomaly detection using machine and deep learning tech- niques— recent research advancements,”IEEE Access, vol. 12, pp. 56 749–56 773, 2024

work page 2024

[6] [6]

Deep reinforcement learning for cyber security,

T. T. Nguyen and V . J. Reddi, “Deep reinforcement learning for cyber security,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 3779–3795, 2023

work page 2023

[7] [7]

Autonomous network defence using reinforcement learning,

M. Foley, C. Hicks, K. Highnam, and V . Mavroudis, “Autonomous network defence using reinforcement learning,” inProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, ser. ASIA CCS ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1252–1254. [Online]. Available: https://doi.org/10.1145/3488932.3527286

work page doi:10.1145/3488932.3527286 2022

[8] [8]

Optimal decision making approach for cyber security defense using evolutionary game,

H. Hu, Y . Liu, C. Chen, H. Zhang, and Y . Liu, “Optimal decision making approach for cyber security defense using evolutionary game,”IEEE Transactions on Network and Service Management, vol. 17, no. 3, pp. 1683–1700, 2020

work page 2020

[9] [9]

Automated cyber defence: A review,

S. Vyas, J. Hannay, A. Bolton, and P. P. Burnap, “Automated cyber defence: A review,”arXiv preprint arXiv:2303.04926, 2023

work page arXiv 2023

[10] [10]

Dynpen: Automated penetration testing in dynamic network scenarios using deep reinforcement learning,

Q. Li, R. Wang, D. Li, F. Shi, M. Zhang, A. Chattopadhyay, Y . Shen, and Y . Li, “Dynpen: Automated penetration testing in dynamic network scenarios using deep reinforcement learning,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 8966–8981, 2024

work page 2024

[11] [11]

Imbalance in the cloud: An analysis on alibaba cluster trace,

C. Lu, K. Ye, G. Xu, C.-Z. Xu, and T. Bai, “Imbalance in the cloud: An analysis on alibaba cluster trace,” in2017 IEEE International Conference on Big Data (Big Data), 2017, pp. 2884–2892

work page 2017

[12] [12]

Heterogeneity and dynamicity of clouds at scale: Google trace analy- sis,

C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, “Heterogeneity and dynamicity of clouds at scale: Google trace analy- sis,” inProceedings of the third ACM symposium on cloud computing, 2012, pp. 1–13

work page 2012

[13] [13]

A case study of the capital one data breach,

N. Novaes Neto, S. Madnick, A. Moraes G de Paula, and N. Malara Borges, “A case study of the capital one data breach,”Stuart E. and Moraes G. de Paula, Anchises and Malara Borges, Natasha, A Case Study of the Capital One Data Breach (January 1, 2020), 2020

work page 2020

[14] [14]

How Google Cloud Blocked Largest Layer 7 DDoS Attack at 46 Million RPS,

Google Cloud Armor, “How Google Cloud Blocked Largest Layer 7 DDoS Attack at 46 Million RPS,” 2022. [Online]. Available: https://cloud.google.com/blog/products/identity-security/ how-google-cloud-blocked-largest-layer-7-ddos-attack-at-46-million-rps

work page 2022

[15] [15]

Causally aware reinforcement learning agents for autonomous cyber defence,

T. Purves, K. G. Kyriakopoulos, S. Jenkins, I. Phillips, and T. Dudman, “Causally aware reinforcement learning agents for autonomous cyber defence,”Knowledge-Based Systems, vol. 304, p. 112521, 2024

work page 2024

[16] [16]

Learning games for defending advanced persistent threats in cyber systems,

T. Zhu, D. Ye, Z. Cheng, W. Zhou, and P. S. Yu, “Learning games for defending advanced persistent threats in cyber systems,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 4, pp. 2410–2422, 2023

work page 2023

[17] [17]

Reinforcement- learning-based apt defense for large-scale smart grids,

L. Xiao, H. Liu, Z. Lv, Y . Chen, Z. Lin, and Y . Du, “Reinforcement- learning-based apt defense for large-scale smart grids,”IEEE Internet of Things Journal, vol. 12, no. 9, pp. 11 917–11 925, 2025

work page 2025

[18] [18]

Mitre att&ck: State of the art and way forward,

B. Al-Sada, A. Sadighian, and G. Oligeri, “Mitre att&ck: State of the art and way forward,”ACM Comput. Surv., vol. 57, no. 1, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3687300

work page doi:10.1145/3687300 2024

[19] [19]

Llm4game: Multi- agent reinforcement learning with knowledge injection for dynamic defense resource allocation in cloud storage,

Y . Peng, H. Hu, F. Li, Y . Jiang, J. Tang, and Y . Liu, “Llm4game: Multi- agent reinforcement learning with knowledge injection for dynamic defense resource allocation in cloud storage,”Computer Networks, p. 111748, 2025

work page 2025

[20] [20]

IDS-agent: An LLM agent for explainable intrusion detection in iot networks,

Y . Li, Z. Xiang, N. D. Bastian, D. Song, and B. Li, “IDS-agent: An LLM agent for explainable intrusion detection in iot networks,” 2025. [Online]. Available: https://openreview.net/forum?id=uuCcK4cmlH

work page 2025

[21] [21]

Dual-reinforcement-learning-based attack path prediction for 5g industrial cyber–physical systems,

X. Li, X. Hu, and T. Jiang, “Dual-reinforcement-learning-based attack path prediction for 5g industrial cyber–physical systems,”IEEE Internet of Things Journal, vol. 11, no. 1, pp. 50–58, 2024

work page 2024

[22] [22]

Hierarchical multi-agent reinforcement learning for cyber network defense,

A. V . Singh, E. Rathbun, E. Graham, L. Oakley, S. Boboila, A. Oprea, and P. Chin, “Hierarchical multi-agent reinforcement learning for cyber network defense,”arXiv preprint arXiv:2410.17351, 2024

work page arXiv 2024

[23] [23]

Defending against apt attacks in cloud computing environments using grouped mul- tiagent deep reinforcement learning,

J. Chen, X. Lan, Q. Zhang, W. Ma, W. Fang, and J. He, “Defending against apt attacks in cloud computing environments using grouped mul- tiagent deep reinforcement learning,”IEEE Internet of Things Journal, vol. 12, no. 12, pp. 19 459–19 470, 2025

work page 2025

[24] [24]

Deep-reinforcement- learning-based self-evolving moving target defense approach against unknown attacks,

Y . Cao, K. Liu, Y . Lin, L. Wang, and Y . Xia, “Deep-reinforcement- learning-based self-evolving moving target defense approach against unknown attacks,”IEEE Internet of Things Journal, vol. 11, no. 20, pp. 33 027–33 039, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

work page 2024

[25] [25]

A multiagent deep reinforcement learning autonomous security manage- ment approach for internet of things,

B. Ren, Y . Tang, H. Wang, Y . Wang, J. Liu, G. Gao, and W. Wei, “A multiagent deep reinforcement learning autonomous security manage- ment approach for internet of things,”IEEE Internet of Things Journal, vol. 11, no. 15, pp. 25 600–25 612, 2024

work page 2024

[26] [26]

Recent developments of game theory and reinforcement learning approaches: A systematic review,

G. Jain, A. Kumar, and S. A. Bhat, “Recent developments of game theory and reinforcement learning approaches: A systematic review,” IEEE Access, vol. 12, pp. 9999–10 011, 2024

work page 2024

[27] [27]

Game-theoretic apt defense: An experimental study on robotics,

S. Rass, S. K ¨onig, J. Wachter, V . Mayoral-Vilches, and E. Panaousis, “Game-theoretic apt defense: An experimental study on robotics,”Com- puters & Security, vol. 132, p. 103328, 2023

work page 2023

[28] [28]

Optimal deception asset deployment in cybersecurity: A nash q-learning approach in multi-agent stochastic games,

G. Kong, F. Chen, X. Yang, G. Cheng, S. Zhang, and W. He, “Optimal deception asset deployment in cybersecurity: A nash q-learning approach in multi-agent stochastic games,”Applied Sciences, vol. 14, no. 1,

work page

[29] [29]

Available: https://www.mdpi.com/2076-3417/14/1/357

[Online]. Available: https://www.mdpi.com/2076-3417/14/1/357

work page 2076

[30] [30]

Resilient cyber-physical system hon- eypots for cyberattacker engagement,

A. S. Mohamed and D. Kundur, “Resilient cyber-physical system hon- eypots for cyberattacker engagement,”IEEE Transactions on Industrial Informatics, vol. 21, no. 11, pp. 8585–8595, 2025

work page 2025

[31] [31]

Ambient intelligence approach: Internet of things based decision performance analysis for intrusion detection,

T. Ramana, M. Thirunavukkarasan, A. S. Mohammed, G. G. Devarajan, and S. M. Nagarajan, “Ambient intelligence approach: Internet of things based decision performance analysis for intrusion detection,”Computer Communications, vol. 195, pp. 315–322, 2022

work page 2022

[32] [32]

Moving target defense (mtd) for 6g edge-to-cloud continuum: A cognitive perspective,

W. Soussi, G. G ¨ur, and B. Stiller, “Moving target defense (mtd) for 6g edge-to-cloud continuum: A cognitive perspective,”IEEE Network, vol. 39, no. 1, pp. 149–156, 2025

work page 2025

[33] [33]

Markov game based on reinforcement learning solution against cyber–physical attacks in smart grid,

K. Bitirgen and ¨U. B. Filik, “Markov game based on reinforcement learning solution against cyber–physical attacks in smart grid,”Expert Systems with Applications, vol. 255, p. 124607, 2024

work page 2024

[34] [34]

Enhancing underwater iot se- curity: A collaborative pursuit strategy using multi-agent reinforcement learning,

Y . Hou, G. Han, F. Zhang, and C. Lin, “Enhancing underwater iot se- curity: A collaborative pursuit strategy using multi-agent reinforcement learning,”IEEE Internet of Things Magazine, vol. 7, no. 5, pp. 112–118, 2024

work page 2024

[35] [35]

A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,

Y . Tang, J. Sun, H. Wang, J. Deng, L. Tong, and W. Xu, “A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,”Computers & Security, vol. 142, p. 103871, 2024

work page 2024

[36] [36]

Finding the optimal security policies for autonomous cyber operations with competitive reinforce- ment learning,

G. Mcdonald, L. Li, and R. A. Mallah, “Finding the optimal security policies for autonomous cyber operations with competitive reinforce- ment learning,”IEEE Access, vol. 12, pp. 120 292–120 305, 2024

work page 2024

[37] [37]

A game-theoretic method for defending against advanced persistent threats in cyber systems,

L. Zhang, T. Zhu, F. K. Hussain, D. Ye, and W. Zhou, “A game-theoretic method for defending against advanced persistent threats in cyber systems,”IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1349–1364, 2023

work page 2023

[38] [38]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

The rise and potential of large language model based agents: A survey,

Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhouet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, p. 121101, 2025

work page 2025

[41] [41]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models,”arXiv preprint arXiv:2410.05229, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Evaluating large language models on controlled generation tasks,

J. Sun, Y . Tian, W. Zhou, N. Xu, Q. Hu, R. Gupta, J. Wieting, N. Peng, and X. Ma, “Evaluating large language models on controlled generation tasks,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 3155–3168...

work page 2023

[43] [43]

NIST Cybersecurity Framework,

N. I. of Standards and Technology, “NIST Cybersecurity Framework,” Nov. 2014. [Online]. Available: https://www.nist.gov/cyberframework

work page 2014

[44] [44]

Developing opti- mal causal cyber-defence agents via cyber security simulation,

A. Andrew, S. Spillard, J. Collyer, and N. Dhir, “Developing optimal causal cyber-defence agents via cyber security simulation,” 2022. [Online]. Available: https://arxiv.org/abs/2207.12355

work page arXiv 2022

[45] [45]

Entity-based reinforcement learning for autonomous cyber defence,

I. Symes Thompson, A. Caron, C. Hicks, and V . Mavroudis, “Entity-based reinforcement learning for autonomous cyber defence,” inProceedings of the Workshop on Autonomous Cybersecurity, ser. AutonomousCyber ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 56–67. [Online]. Available: https://doi.org/10.1145/3689933.3690835

work page doi:10.1145/3689933.3690835 2024

[46] [46]

ISO/IEC 27001:2022,

c. ISO/IEC Joint Technical Committee 1, Subcommittee 27 – Information security and privacy protection, “ISO/IEC 27001:2022,” Geneva, Switzerland, 2022. [Online]. Available: https://www.iso.org/ standard/27001

work page 2022

[47] [47]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe eleventh international conference on learning representations, 2022

work page 2022

[48] [48]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 8634–8652. [Online]. Available: https://proceedings.neurip...

work page 2023

[49] [49]

A Realistic Cyber Defense Dataset,

Communications Security Establishment (CSE) and The Canadian Institute for Cybersecurity (CIC), “A Realistic Cyber Defense Dataset,”

work page

[50] [50]

Available: https://registry.opendata.aws/cse-cic-ids2018

[Online]. Available: https://registry.opendata.aws/cse-cic-ids2018

work page

[51] [51]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page

[52] [52]

Qwen3 Technical Report

[Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Xuance: A comprehensive and unified deep reinforcement learning library,

W. Liu, W. Cai, K. Jiang, G. Cheng, Y . Wang, J. Wang, J. Cao, L. Xu, C. Mu, and C. Sun, “Xuance: A comprehensive and unified deep reinforcement learning library,” 2023. [Online]. Available: https://arxiv.org/abs/2312.16248

work page arXiv 2023

[54] [54]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . WU, “The surprising effectiveness of ppo in cooperative multi-agent games,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 611–24 624. [Online]. Available: https://pr...

work page 2022

[55] [55]

Independent rein- forcement learners in cooperative markov games: a survey regarding coordination problems,

L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent rein- forcement learners in cooperative markov games: a survey regarding coordination problems,”The Knowledge Engineering Review, vol. 27, no. 1, p. 1–31, 2012

work page 2012

[56] [56]

Monotonic value function factorisation for deep multi-agent reinforcement learning,

T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,”Journal of Machine Learning Research, vol. 21, no. 178, pp. 1–51, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-081.html

work page 2020

[57] [57]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V . Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuylset al., “Value-decomposition networks for cooperative multi-agent learning,” arXiv preprint arXiv:1706.05296, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[58] [58]

Mind the gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning,

S. Zhou, J. Liu, Y . Lu, J. Yang, Y . Zhang, and J. Chen, “Mind the gap: towards generalizable autonomous penetration testing via domain randomization and meta-reinforcement learning,”Frontiers of Information Technology & Electronic Engineering, vol. 26, no. 12, pp. 2511–2528, 2025. [Online]. Available: https://doi.org/10.1631/FITEE. 2500100

work page doi:10.1631/fitee 2025

[59] [59]

Transfer learning for reinforcement learning domains: A survey

M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey.”Journal of Machine Learning Research, vol. 10, no. 7, 2009

work page 2009

[60] [60]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Y . Li, F. Wei, C. Zhang, and H. Zhang, “Eagle-3: Scaling up inference acceleration of large language models via training-time test,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01840

work page internal anchor Pith review Pith/arXiv arXiv 2025