Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Diksha Goel; Jeff Wang; Kristen Moore; Minjune Kim; Thanh Thi Nguyen

arxiv: 2505.11708 · v3 · pith:D7UVK7LKnew · submitted 2025-05-16 · 💻 cs.CR · cs.LG

Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Diksha Goel , Kristen Moore , Jeff Wang , Minjune Kim , Thanh Thi Nguyen This is my paper

Pith reviewed 2026-05-22 13:51 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords reinforcement learningexplainable AIcybersecurityPOMDPprioritized experience replaycyber agentsmulti-layer frameworkQ-values

0 comments

The pith

A multi-layer framework explains RL cyber agents by modeling their exploration dynamics and tracking Q-value shifts over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that reinforcement learning agents simulating cyberattacks can be made understandable through a combined analysis of high-level strategic patterns and low-level action preferences. It models the attack process as a partially observable Markov decision process to reveal how agents balance exploration and exploitation across different phases. At the same time it tracks how Q-values evolve and uses prioritized experience replay to flag the moments when the agent's preferences change most sharply. If this works, defenders could use the resulting maps to anticipate evolving attack strategies and debug their own policies more effectively than with existing post-hoc or domain-specific tools.

Core claim

By treating cyberattacks as a POMDP, the framework exposes exploration-exploitation dynamics and phase-aware behavioural shifts; by tracking the temporal evolution of Q-values and applying prioritised experience replay, it surfaces critical learning transitions and evolving action preferences. When tested on CyberBattleSim environments of increasing complexity, these two layers together produce interpretable views of both strategic and tactical reasoning that previous explainable RL methods have not supplied at scale.

What carries the argument

The multi-layer explainability framework that operates at the MDP level via POMDP modeling of exploration-exploitation dynamics and at the policy level via temporal Q-value evolution combined with prioritised experience replay.

If this is right

Red-team simulations gain visibility into how agents form and shift their attack strategies over time.
RL policy debugging becomes possible by locating the exact learning transitions where action preferences change.
Phase-aware threat modelling can incorporate the identified behavioural shifts rather than treating the agent as a static black box.
Anticipatory defence planning can use the surfaced Q-value trajectories to predict likely next moves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layered approach could be tested on RL agents in non-cyber domains such as robotic navigation to see whether POMDP and PER analysis transfers.
If the insights prove actionable, trainers might deliberately insert monitoring points at the identified transition moments to improve sample efficiency.
Defenders could feed the extracted phase information back into their own RL models to create more responsive countermeasures.

Load-bearing premise

That breaking down POMDP exploration-exploitation patterns and Q-value changes through prioritised experience replay will deliver genuinely new and usable insights into the agent's choices rather than descriptions already available from ordinary RL logs.

What would settle it

Run the framework on an RL agent in a known CyberBattleSim scenario, then compare the insights it generates against a simple log of visited states, rewards, and actions to check whether any previously hidden reasoning steps are actually revealed.

Figures

Figures reproduced from arXiv: 2505.11708 by Diksha Goel, Jeff Wang, Kristen Moore, Minjune Kim, Thanh Thi Nguyen.

**Figure 1.** Figure 1: Progressive discovery and control escalation by the RL-based attacker agent in the ToyCTF environment. Figure (a–b) In the early-stages, the agent initially controls only the Client node and discovers the adjacent Website node. Figure (c–d) In the mid-stage, the agent uncovers additional nodes and expands its control via remote exploits and lateral movements. Figure (e–f) In the late-stages, the agent atta… view at source ↗

**Figure 2.** Figure 2: Cumulative rewards comparison of attacker policies across environments. The shaded region represents the standard deviation of cumulative rewards across training steps for each agent. (Results indicate that while Exploiting-DQL attains the highest rewards by leveraging a fixed, trained DQL policy, standard DQL is the most effective core agent, providing high cumulative rewards with stable and scalable lear… view at source ↗

**Figure 3.** Figure 3: Impact of Exploration Strategies on cumulative rewards across environments (Results show that Standard Exploration achieves broader state-space coverage, yielding higher rewards and facilitating more robust policy convergence). (a) CTF (b) CC22 (c) CC100 (d) CC500 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of Exploration Strategies on node discovery rate across environments (Results show that Standard Exploration consistently enables faster and broader discovery, facilitating more informed policy learning). Our benchmarking shows DQL as the most effective and scalable attacker policy across all environments. While the Exploiting DQL variant achieves marginally higher rewards, it leverages a trained DQ… view at source ↗

**Figure 5.** Figure 5: Cumulative rewards comparison in early vs. late attack phases across environments (Results show that agents achieve higher rewards in the late phase, indicating a progression from exploratory behaviour to exploitation-focused strategies as training advances). at lower cumulative rewards, while Standard Exploration achieved sustained improvement and a late-stage reward surge in CC500 after broad, systematic… view at source ↗

**Figure 6.** Figure 6: Emergence of action preferences via state-aggregated Q-values across episodes. The highlighted dominant action marks the shift from exploration to high-value, environment-specific tactics. DQL’s sharp Q-value gradients reflect discrete exploitation bursts as the agent locks onto effective attack sequences. Action indices for CyberBattleChain (1–15) and CTF (1–18) correspond to the x-axis and are listed in … view at source ↗

**Figure 7.** Figure 7: Temporal evolution of average TD-error (PER priority) across environments. Smaller environments (CTF, CC22) stabilise quickly, whereas larger networks (CC100, CC500) exhibit prolonged surges in TD-error, signalling persistent replay of unstable transitions. These spikes correspond to the performance degradation discussed in Section VI-E (“Unexpected PER Behaviour”). (a) CTF (b) CC22 (c) CC100 (d) CC500 [P… view at source ↗

**Figure 8.** Figure 8: Number of Key High-Priority States Across Environments. In smaller networks (CTF, CC22), the agent quickly converges to a small set of key states, indicating early policy convergence. In contrast, larger networks (CC100, CC500) exhibit broader, persistent distributions, reflecting prolonged exploration and slower convergence due to higher state-space complexity. replay of high-TD transitions. The irregular… view at source ↗

**Figure 9.** Figure 9: Reward progression of DQL and DQL+PER in CC100 and CC500 environments. DQL achieves higher cumulative rewards and steadier learning, while DQL+PER underperforms due to its bias toward high-TD-error transitions, which oversamples transient experiences and slows convergence [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Average TD-error of DQL and DQL+PER in CC100 and CC500. DQL learns more stably, whereas PER introduces early instability and biased convergence, leading to inferior rewards. lived behaviours rather than enduring strategic improvements. Continued oversampling of such transitions after their relevance has decayed injects noise into the replay buffer, disrupts value estimation, and impedes policy stabilisat… view at source ↗

**Figure 11.** Figure 11: Cumulative rewards comparison for the PPO algorithm in early vs. late attack phases across environments. Results show that PPO agent achieves higher rewards in the late phase, indicating a progression from exploratory behaviour to exploitation-focused strategies as training advances. (a) CTF: Episode 10 (b) CTF: Episode 25 (c) CC22: Episode 20 (d) CC22: Episode 35 (e) CC100: Episode 10 (f) CC100: Episode … view at source ↗

**Figure 12.** Figure 12: Emergence of action preferences through state-aggregated policy logits across episodes. The highlighted dominant action marks the shift from exploratory to high-value, environment-specific tactics. Compared with DQL, PPO’s clipped updates yield smoother yet consistent preference consolidation. Action indices for CyberBattleChain (1–15) and CTF (1–18) correspond to the x-axis and are listed in Table II and… view at source ↗

**Figure 13.** Figure 13: Explainable signatures of policy collapse. (a) Evolution of policy confidence (logit margin) and uncertainty (normalised entropy) over training. A sharp rise in confidence and collapse in entropy mark the transition to an over-confident, low-exploration regime. (b) Reward–entropy dynamics during training. Following the entropy collapse, reward stagnates, indicating behavioural failure despite high confide… view at source ↗

**Figure 14.** Figure 14: Joint evolution of task performance (normalised return), exploration (policy entropy), and decision confidence (policy sharpness) over training. the agent operates in a stable learning regime: high policy entropy sustains exploration, policy sharpness remains low, and task performance improves steadily. These signals evolve coherently, indicating effective optimisation. Between Episodes 19–26, a critica… view at source ↗

read the original abstract

Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (Markov Decision Process (MDP)-level) and tactical (policy-level) reasoning. At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving action preferences. Evaluated across CyberBattleSim environments of increasing complexity, our framework offers interpretable insights into agent behaviour at scale. Unlike previous explainable RL methods, which are {predominantly} post-hoc, domain-specific, or limited in depth, our approach is both agent- and environment-agnostic, {supporting use cases such as red-team simulation, RL policy debugging, phase-aware threat modelling and anticipatory defence planning.} By transforming black-box learning into actionable behavioural intelligence, our framework enables both defenders and developers to better anticipate, analyse, and respond to autonomous cyber threats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper packages POMDP modeling and prioritized experience replay into a multi-layer framework for RL cyber agents, but the abstract leaves the claimed novel actionable insights unproven against standard logging.

read the letter

The main point is that the authors combine POMDP analysis at the MDP level to capture exploration-exploitation and phase shifts with policy-level tracking of Q-value evolution via prioritized experience replay. This creates a unified structure for interpreting RL agents in cyber attack simulations like those in CyberBattleSim. The approach aims to be agent- and environment-agnostic, which is a reasonable target for practical use in red-teaming and defense planning. It addresses a genuine need: RL agents for simulating threats are useful but hard to trust or debug without some window into their decisions. Framing the problem this way and layering existing RL tools for both strategic and tactical views is a clear step forward from purely post-hoc methods. The evaluation across environments of increasing complexity at least shows awareness of scaling issues. The soft spots are more noticeable. The abstract states that the framework offers interpretable insights but supplies no quantitative results, no baseline comparisons to raw Q-tables or episode traces, and no metrics for whether the outputs are actually more actionable. This makes the central claim rest on an assumption that POMDP plus PER will surface something previously unavailable rather than descriptive summaries already available from standard RL instrumentation. If the full paper has concrete examples and direct comparisons, that would strengthen it; without them the contribution looks incremental. This work is mainly for researchers applying RL to cybersecurity simulations who want better debugging aids. It is not likely to change core RL theory but could help in applied settings. I would send it for peer review so referees can check the implementation details and results against the claims.

Referee Report

2 major / 1 minor

Summary. The paper proposes a multi-layer explainability framework for RL-based cyber attacker agents. At the MDP level it models attacks as POMDPs to expose exploration-exploitation dynamics and phase-aware behavioural shifts; at the policy level it tracks temporal Q-value evolution and uses Prioritised Experience Replay (PER) to identify critical learning transitions. The framework is evaluated on CyberBattleSim environments of increasing complexity and is presented as agent- and environment-agnostic, yielding actionable behavioural intelligence for red-team simulation, policy debugging, phase-aware threat modelling and anticipatory defence.

Significance. If the claimed distinction from standard RL logging were demonstrated with quantitative metrics and baseline comparisons, the framework could meaningfully advance explainability in RL-driven cyber simulations, supporting practical uses in defensive preparedness. At present the absence of such evidence leaves the significance speculative.

major comments (2)

[Abstract] Abstract: The central claim that the POMDP-level and PER-based analysis produces 'interpretable insights into agent behaviour at scale' and 'actionable behavioural intelligence' is unsupported by any reported quantitative metrics, interpretability scores, or comparisons against raw Q-tables, episode traces, or standard saliency methods.
[Abstract] Abstract and Evaluation description: No baseline comparisons or actionability metrics are supplied to substantiate that the outputs differ from conventional RL instrumentation or that they enable previously unavailable use cases such as anticipatory defence planning.

minor comments (1)

[Abstract] Abstract contains apparent LaTeX artifacts (curly braces around 'predominantly' and the long use-case sentence); these should be removed for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the quantitative support for our claims while preserving the framework's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the POMDP-level and PER-based analysis produces 'interpretable insights into agent behaviour at scale' and 'actionable behavioural intelligence' is unsupported by any reported quantitative metrics, interpretability scores, or comparisons against raw Q-tables, episode traces, or standard saliency methods.

Authors: We acknowledge that the current evaluation relies primarily on qualitative case studies across CyberBattleSim environments to illustrate phase-aware shifts and critical learning transitions. While these examples demonstrate distinctions from raw Q-tables and episode traces, we did not report formal interpretability scores or direct baseline comparisons. In the revised manuscript we will add a dedicated evaluation subsection that includes quantitative comparisons, such as the proportion of unique behavioral patterns surfaced by the multi-layer framework versus standard logging, and proxy metrics for insight density per episode. revision: yes
Referee: [Abstract] Abstract and Evaluation description: No baseline comparisons or actionability metrics are supplied to substantiate that the outputs differ from conventional RL instrumentation or that they enable previously unavailable use cases such as anticipatory defence planning.

Authors: The manuscript positions the framework as providing strategic (POMDP) and tactical (PER/Q-value) layers that conventional instrumentation does not combine. However, we agree that explicit substantiation through baselines and actionability metrics is needed to move beyond illustrative examples. We will revise the evaluation section to incorporate baseline comparisons against raw traces and saliency methods, along with discussion of how the identified phase shifts and transition points support anticipatory defence scenarios within the simulated environments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard RL constructs applied to new domain

full rationale

The paper proposes a multi-layer framework that models cyberattacks as POMDPs to expose exploration-exploitation and phase shifts, then tracks temporal Q-value evolution via PER to surface learning transitions. These are established, externally defined RL techniques (POMDP formulation, Q-learning, prioritized replay) applied to CyberBattleSim environments. No derivation, equation, or central claim reduces by construction to a fitted parameter, self-referential definition, or self-citation chain. The claim of providing agent-agnostic, actionable insights is presented as an empirical outcome of the framework rather than a tautology. This is a self-contained proposal whose validity rests on external evaluation rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard reinforcement learning modeling choices and the domain assumption that cyber attack sequences can be usefully represented as POMDPs; no new free parameters or invented entities are introduced.

axioms (1)

domain assumption Cyberattacks can be modeled as Partially Observable Markov Decision Processes to expose exploration-exploitation dynamics and phase-aware behavioural shifts.
Stated directly as the basis for the MDP-level layer in the abstract.

pith-pipeline@v0.9.0 · 5794 in / 1265 out tokens · 43627 ms · 2026-05-22T13:51:57.552616+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Q(s, a)←Q(s, a) + α [r + γ max_a' Q(s', a') − Q(s, a)] ... δ_i = r_i + γ max_a' Q(s'_i, a') − Q(s_i, a_i)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning
cs.CR 2026-04 unverdicted novelty 6.0

C-MADF learns a structural causal model to restrict response actions in an MDP and uses dual blue-red RL policies to achieve 1.8% false-positive rate and 0.979 F1 on the CICIoT2023 dataset.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Defending active directory by combining neural network based dynamic program and evolutionary diversity optimisation,

D. Goel, M. H. Ward-Graham, A. Neumann, F. Neumann, H. Nguyen, and M. Guo, “Defending active directory by combining neural network based dynamic program and evolutionary diversity optimisation,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’22, 2022, p. 1191–1199

work page 2022
[2]

Cyberbattlesim,

Microsoft Defender Research Team, “Cyberbattlesim,” https://github.com/microsoft/cyberbattlesim, 2021, Created by Christian Seifert, Michael Betser, William Blum, James Bono, Kate Farris, Emily Goren, Justin Grana, Kristian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei

work page 2021
[3]

Cyber operations research gym,

“Cyber operations research gym,” https://github.com/cage- challenge/CybORG, 2022, created by Maxwell Standen, David Bowman, Son Hoang, Toby Richer, Martin Lucas, Richard Van Tassel, Phillip Vu, Mitchell Kiely, KC C., Natalie Konschnik, Joshua Collyer

work page 2022
[4]

Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces,

F. Terranova, A. Lahmadi, and I. Chrisment, “Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces,” in2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID), Gold Coast, Australia, Oct. 2025, p. 18. [Online]. Available: https://hal.science/hal-05182437

work page 2025
[5]

Evolving reinforcement learning environment to minimize learner’s achievable reward: An application on hardening active directory systems,

D. Goel, A. Neumann, F. Neumann, H. Nguyen, and M. Guo, “Evolving reinforcement learning environment to minimize learner’s achievable reward: An application on hardening active directory systems,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’23, 2023, p. 1348–1356

work page 2023
[6]

Enhancing network resilience through machine learning- powered graph combinatorial optimization: Applications in cyber de- fense and information diffusion,

D. Goel, “Enhancing network resilience through machine learning- powered graph combinatorial optimization: Applications in cyber de- fense and information diffusion,”arXiv preprint arXiv:2310.10667, 2023

work page arXiv 2023
[7]

Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach,

C.-Y . Wei and H. Luo, “Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach,” inConference on learning theory. PMLR, 2021, pp. 4300–4354

work page 2021
[8]

Explainable ai (xai): Core ideas, techniques, and solutions,

R. Dwivedi, D. Dave, H. Naik, S. Singhal, R. Omer, P. Patel, B. Qian, Z. Wen, T. Shah, G. Morganet al., “Explainable ai (xai): Core ideas, techniques, and solutions,”ACM computing surveys, vol. 55, no. 9, pp. 1–33, 2023

work page 2023
[9]

Causal explanations for sequential decision-making in multi-agent systems,

B. Gyevnar, C. Wang, C. G. Lucas, S. B. Cohen, and S. V . Albrecht, “Causal explanations for sequential decision-making in multi-agent systems,”arXiv preprint arXiv:2302.10809, 2023

work page arXiv 2023
[10]

Codex: A cluster- based method for explainable reinforcement learning,

T. K. Mathes, J. Inman, A. Col ´on, and S. Khan, “Codex: A cluster- based method for explainable reinforcement learning,”arXiv preprint arXiv:2312.04216, 2023

work page arXiv 2023
[11]

Explainable reinforcement learning through a causal lens,

P. Madumal, T. Miller, L. Sonenberg, and F. Vetere, “Explainable reinforcement learning through a causal lens,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 03, 2020, pp. 2493–2500

work page 2020
[12]

Causal explanations for sequential decision making,

S. B. Nashed, S. Mahmud, C. V . Goldman, and S. Zilberstein, “Causal explanations for sequential decision making,”Journal of Artificial Intel- ligence Research, vol. 83, 2025

work page 2025
[13]

AIRS: Ex- planation for deep reinforcement learning-based security applications,

J. Yu, W. Guo, Q. Qin, G. Wang, T. Wang, and X. Xing, “AIRS: Ex- planation for deep reinforcement learning-based security applications,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 7375–7392

work page 2023
[14]

Inroads into autonomous network defence using explained reinforcement learning,

M. Foley, M. Wang, C. Hicks, V . Mavroudiset al., “Inroads into autonomous network defence using explained reinforcement learning,” arXiv preprint arXiv:2306.09318, 2023

work page arXiv 2023
[15]

Experiential explanations for reinforcement learning,

A. Alabdulkarim, M. Singh, G. Mansi, K. Hall, and M. O. Riedl, “Experiential explanations for reinforcement learning,”arXiv preprint arXiv:2210.04723, 2022

work page arXiv 2022
[16]

Explainable artificial intelligence for cybersecurity,

D. K. Sharma, J. Mishra, A. Singh, R. Govil, G. Srivastava, and J. C.- W. Lin, “Explainable artificial intelligence for cybersecurity,”Computers and Electrical Engineering, vol. 103, p. 108356, 2022

work page 2022
[17]

Evaluation of explainable artificial intelligence: Shap, lime, and cam,

H. T. T. Nguyen, H. Q. Cao, K. V . T. Nguyen, and N. D. K. Pham, “Evaluation of explainable artificial intelligence: Shap, lime, and cam,” inProceedings of the FPT AI Conference, 2021, pp. 1–6

work page 2021
[18]

Explainability of cybersecurity threats data using shap,

R. Alenezi and S. A. Ludwig, “Explainability of cybersecurity threats data using shap,” in2021 IEEE symposium series on computational intelligence (SSCI). IEEE, 2021, pp. 01–10

work page 2021
[19]

Interpreting agent behaviors in reinforcement-learning-based cyber- battle simulation platforms,

J. Claypoole, S. Cheung, A. Gehani, V . Yegneswaran, and A. Ridley, “Interpreting agent behaviors in reinforcement-learning-based cyber- battle simulation platforms,”arXiv preprint arXiv:2506.08192, 2025

work page arXiv 2025
[20]

Nasim: Network attack simulator,

J. Schwartz and H. Kurniawatti, “Nasim: Network attack simulator,” https://networkattacksimulator.readthedocs.io/, 2019

work page 2019
[21]

Network defense is not a game,

A. Molina-Markham, R. K. Winder, and A. Ridley, “Network defense is not a game,”arXiv preprint arXiv:2104.10262, 2021

work page arXiv 2021
[22]

Entity-based reinforcement learning for autonomous cyber defence,

I. S. Thompson, A. Caron, C. Hicks, and V . Mavroudis, “Entity-based reinforcement learning for autonomous cyber defence,” inProceedings of the Workshop on Autonomous Cybersecurity, 2024, pp. 56–67

work page 2024
[23]

Optimizing cyber defense in dynamic active directories through rein- forcement learning,

D. Goel, K. Moore, M. Guo, D. Wang, M. Kim, and S. Camtepe, “Optimizing cyber defense in dynamic active directories through rein- forcement learning,” inEuropean Symposium on Research in Computer Security. Springer, 2024, pp. 332–352

work page 2024
[24]

Learning cyber defence tactics from scratch with multi-agent reinforcement learning,

J. Wiebe, R. A. Mallah, and L. Li, “Learning cyber defence tactics from scratch with multi-agent reinforcement learning,”arXiv preprint arXiv:2310.05939, 2023

work page arXiv 2023
[25]

Autonomous network cyber offence strategy through deep reinforcement learning,

M. Sultana, A. Taylor, and L. Li, “Autonomous network cyber offence strategy through deep reinforcement learning,” inArtificial Intelligence and Machine Learning for Multi-Domain Operations Applications III, vol. 11746. SPIE, 2021, pp. 490–502

work page 2021
[26]

Developing opti- mal causal cyber-defence agents via cyber security simulation,

A. Andrew, S. Spillard, J. Collyer, and N. Dhir, “Developing opti- mal causal cyber-defence agents via cyber security simulation,”arXiv preprint arXiv:2207.12355, 2022

work page arXiv 2022
[27]

Autonomous cyber warfare agents: dynamic rein- forcement learning for defensive cyber operations,

D. A. Bierbrauer, R. M. Schabinger, C. Carlin, J. Mullin, J. A. Pavlik, and N. D. Bastian, “Autonomous cyber warfare agents: dynamic rein- forcement learning for defensive cyber operations,” inArtificial Intelli- gence and Machine Learning for Multi-Domain Operations Applications V, vol. 12538. SPIE, 2023, pp. 42–56

work page 2023
[28]

Adaptiveϵ-greedy exploration in reinforcement learning,

M. Tokic, “Adaptiveϵ-greedy exploration in reinforcement learning,” inProceedings of the 22nd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2010, pp. 243–250

work page 2010
[29]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992

work page 1992
[30]

R. E. Bellman,Dynamic Programming. Princeton, NJ: Princeton University Press, 1957

work page 1957
[31]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” inProceedings of the 4th International Conference on Learning Representations (ICLR), 2016, arXiv:1511.05952. [Online]. Available: https://arxiv.org/abs/1511.05952 Appendix Intermediate Exploration Trajectory of the RL Attacker Under Partial Observability To illustrate the ...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Defending active directory by combining neural network based dynamic program and evolutionary diversity optimisation,

D. Goel, M. H. Ward-Graham, A. Neumann, F. Neumann, H. Nguyen, and M. Guo, “Defending active directory by combining neural network based dynamic program and evolutionary diversity optimisation,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’22, 2022, p. 1191–1199

work page 2022

[2] [2]

Cyberbattlesim,

Microsoft Defender Research Team, “Cyberbattlesim,” https://github.com/microsoft/cyberbattlesim, 2021, Created by Christian Seifert, Michael Betser, William Blum, James Bono, Kate Farris, Emily Goren, Justin Grana, Kristian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei

work page 2021

[3] [3]

Cyber operations research gym,

“Cyber operations research gym,” https://github.com/cage- challenge/CybORG, 2022, created by Maxwell Standen, David Bowman, Son Hoang, Toby Richer, Martin Lucas, Richard Van Tassel, Phillip Vu, Mitchell Kiely, KC C., Natalie Konschnik, Joshua Collyer

work page 2022

[4] [4]

Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces,

F. Terranova, A. Lahmadi, and I. Chrisment, “Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces,” in2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID), Gold Coast, Australia, Oct. 2025, p. 18. [Online]. Available: https://hal.science/hal-05182437

work page 2025

[5] [5]

Evolving reinforcement learning environment to minimize learner’s achievable reward: An application on hardening active directory systems,

D. Goel, A. Neumann, F. Neumann, H. Nguyen, and M. Guo, “Evolving reinforcement learning environment to minimize learner’s achievable reward: An application on hardening active directory systems,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’23, 2023, p. 1348–1356

work page 2023

[6] [6]

Enhancing network resilience through machine learning- powered graph combinatorial optimization: Applications in cyber de- fense and information diffusion,

D. Goel, “Enhancing network resilience through machine learning- powered graph combinatorial optimization: Applications in cyber de- fense and information diffusion,”arXiv preprint arXiv:2310.10667, 2023

work page arXiv 2023

[7] [7]

Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach,

C.-Y . Wei and H. Luo, “Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach,” inConference on learning theory. PMLR, 2021, pp. 4300–4354

work page 2021

[8] [8]

Explainable ai (xai): Core ideas, techniques, and solutions,

R. Dwivedi, D. Dave, H. Naik, S. Singhal, R. Omer, P. Patel, B. Qian, Z. Wen, T. Shah, G. Morganet al., “Explainable ai (xai): Core ideas, techniques, and solutions,”ACM computing surveys, vol. 55, no. 9, pp. 1–33, 2023

work page 2023

[9] [9]

Causal explanations for sequential decision-making in multi-agent systems,

B. Gyevnar, C. Wang, C. G. Lucas, S. B. Cohen, and S. V . Albrecht, “Causal explanations for sequential decision-making in multi-agent systems,”arXiv preprint arXiv:2302.10809, 2023

work page arXiv 2023

[10] [10]

Codex: A cluster- based method for explainable reinforcement learning,

T. K. Mathes, J. Inman, A. Col ´on, and S. Khan, “Codex: A cluster- based method for explainable reinforcement learning,”arXiv preprint arXiv:2312.04216, 2023

work page arXiv 2023

[11] [11]

Explainable reinforcement learning through a causal lens,

P. Madumal, T. Miller, L. Sonenberg, and F. Vetere, “Explainable reinforcement learning through a causal lens,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 03, 2020, pp. 2493–2500

work page 2020

[12] [12]

Causal explanations for sequential decision making,

S. B. Nashed, S. Mahmud, C. V . Goldman, and S. Zilberstein, “Causal explanations for sequential decision making,”Journal of Artificial Intel- ligence Research, vol. 83, 2025

work page 2025

[13] [13]

AIRS: Ex- planation for deep reinforcement learning-based security applications,

J. Yu, W. Guo, Q. Qin, G. Wang, T. Wang, and X. Xing, “AIRS: Ex- planation for deep reinforcement learning-based security applications,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 7375–7392

work page 2023

[14] [14]

Inroads into autonomous network defence using explained reinforcement learning,

M. Foley, M. Wang, C. Hicks, V . Mavroudiset al., “Inroads into autonomous network defence using explained reinforcement learning,” arXiv preprint arXiv:2306.09318, 2023

work page arXiv 2023

[15] [15]

Experiential explanations for reinforcement learning,

A. Alabdulkarim, M. Singh, G. Mansi, K. Hall, and M. O. Riedl, “Experiential explanations for reinforcement learning,”arXiv preprint arXiv:2210.04723, 2022

work page arXiv 2022

[16] [16]

Explainable artificial intelligence for cybersecurity,

D. K. Sharma, J. Mishra, A. Singh, R. Govil, G. Srivastava, and J. C.- W. Lin, “Explainable artificial intelligence for cybersecurity,”Computers and Electrical Engineering, vol. 103, p. 108356, 2022

work page 2022

[17] [17]

Evaluation of explainable artificial intelligence: Shap, lime, and cam,

H. T. T. Nguyen, H. Q. Cao, K. V . T. Nguyen, and N. D. K. Pham, “Evaluation of explainable artificial intelligence: Shap, lime, and cam,” inProceedings of the FPT AI Conference, 2021, pp. 1–6

work page 2021

[18] [18]

Explainability of cybersecurity threats data using shap,

R. Alenezi and S. A. Ludwig, “Explainability of cybersecurity threats data using shap,” in2021 IEEE symposium series on computational intelligence (SSCI). IEEE, 2021, pp. 01–10

work page 2021

[19] [19]

Interpreting agent behaviors in reinforcement-learning-based cyber- battle simulation platforms,

J. Claypoole, S. Cheung, A. Gehani, V . Yegneswaran, and A. Ridley, “Interpreting agent behaviors in reinforcement-learning-based cyber- battle simulation platforms,”arXiv preprint arXiv:2506.08192, 2025

work page arXiv 2025

[20] [20]

Nasim: Network attack simulator,

J. Schwartz and H. Kurniawatti, “Nasim: Network attack simulator,” https://networkattacksimulator.readthedocs.io/, 2019

work page 2019

[21] [21]

Network defense is not a game,

A. Molina-Markham, R. K. Winder, and A. Ridley, “Network defense is not a game,”arXiv preprint arXiv:2104.10262, 2021

work page arXiv 2021

[22] [22]

Entity-based reinforcement learning for autonomous cyber defence,

I. S. Thompson, A. Caron, C. Hicks, and V . Mavroudis, “Entity-based reinforcement learning for autonomous cyber defence,” inProceedings of the Workshop on Autonomous Cybersecurity, 2024, pp. 56–67

work page 2024

[23] [23]

Optimizing cyber defense in dynamic active directories through rein- forcement learning,

D. Goel, K. Moore, M. Guo, D. Wang, M. Kim, and S. Camtepe, “Optimizing cyber defense in dynamic active directories through rein- forcement learning,” inEuropean Symposium on Research in Computer Security. Springer, 2024, pp. 332–352

work page 2024

[24] [24]

Learning cyber defence tactics from scratch with multi-agent reinforcement learning,

J. Wiebe, R. A. Mallah, and L. Li, “Learning cyber defence tactics from scratch with multi-agent reinforcement learning,”arXiv preprint arXiv:2310.05939, 2023

work page arXiv 2023

[25] [25]

Autonomous network cyber offence strategy through deep reinforcement learning,

M. Sultana, A. Taylor, and L. Li, “Autonomous network cyber offence strategy through deep reinforcement learning,” inArtificial Intelligence and Machine Learning for Multi-Domain Operations Applications III, vol. 11746. SPIE, 2021, pp. 490–502

work page 2021

[26] [26]

Developing opti- mal causal cyber-defence agents via cyber security simulation,

A. Andrew, S. Spillard, J. Collyer, and N. Dhir, “Developing opti- mal causal cyber-defence agents via cyber security simulation,”arXiv preprint arXiv:2207.12355, 2022

work page arXiv 2022

[27] [27]

Autonomous cyber warfare agents: dynamic rein- forcement learning for defensive cyber operations,

D. A. Bierbrauer, R. M. Schabinger, C. Carlin, J. Mullin, J. A. Pavlik, and N. D. Bastian, “Autonomous cyber warfare agents: dynamic rein- forcement learning for defensive cyber operations,” inArtificial Intelli- gence and Machine Learning for Multi-Domain Operations Applications V, vol. 12538. SPIE, 2023, pp. 42–56

work page 2023

[28] [28]

Adaptiveϵ-greedy exploration in reinforcement learning,

M. Tokic, “Adaptiveϵ-greedy exploration in reinforcement learning,” inProceedings of the 22nd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2010, pp. 243–250

work page 2010

[29] [29]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992

work page 1992

[30] [30]

R. E. Bellman,Dynamic Programming. Princeton, NJ: Princeton University Press, 1957

work page 1957

[31] [31]

Prioritized Experience Replay

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” inProceedings of the 4th International Conference on Learning Representations (ICLR), 2016, arXiv:1511.05952. [Online]. Available: https://arxiv.org/abs/1511.05952 Appendix Intermediate Exploration Trajectory of the RL Attacker Under Partial Observability To illustrate the ...

work page internal anchor Pith review Pith/arXiv arXiv 2016