pith. sign in

arxiv: 2505.11708 · v3 · pith:D7UVK7LKnew · submitted 2025-05-16 · 💻 cs.CR · cs.LG

Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Pith reviewed 2026-05-22 13:51 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords reinforcement learningexplainable AIcybersecurityPOMDPprioritized experience replaycyber agentsmulti-layer frameworkQ-values
0
0 comments X

The pith

A multi-layer framework explains RL cyber agents by modeling their exploration dynamics and tracking Q-value shifts over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that reinforcement learning agents simulating cyberattacks can be made understandable through a combined analysis of high-level strategic patterns and low-level action preferences. It models the attack process as a partially observable Markov decision process to reveal how agents balance exploration and exploitation across different phases. At the same time it tracks how Q-values evolve and uses prioritized experience replay to flag the moments when the agent's preferences change most sharply. If this works, defenders could use the resulting maps to anticipate evolving attack strategies and debug their own policies more effectively than with existing post-hoc or domain-specific tools.

Core claim

By treating cyberattacks as a POMDP, the framework exposes exploration-exploitation dynamics and phase-aware behavioural shifts; by tracking the temporal evolution of Q-values and applying prioritised experience replay, it surfaces critical learning transitions and evolving action preferences. When tested on CyberBattleSim environments of increasing complexity, these two layers together produce interpretable views of both strategic and tactical reasoning that previous explainable RL methods have not supplied at scale.

What carries the argument

The multi-layer explainability framework that operates at the MDP level via POMDP modeling of exploration-exploitation dynamics and at the policy level via temporal Q-value evolution combined with prioritised experience replay.

If this is right

  • Red-team simulations gain visibility into how agents form and shift their attack strategies over time.
  • RL policy debugging becomes possible by locating the exact learning transitions where action preferences change.
  • Phase-aware threat modelling can incorporate the identified behavioural shifts rather than treating the agent as a static black box.
  • Anticipatory defence planning can use the surfaced Q-value trajectories to predict likely next moves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layered approach could be tested on RL agents in non-cyber domains such as robotic navigation to see whether POMDP and PER analysis transfers.
  • If the insights prove actionable, trainers might deliberately insert monitoring points at the identified transition moments to improve sample efficiency.
  • Defenders could feed the extracted phase information back into their own RL models to create more responsive countermeasures.

Load-bearing premise

That breaking down POMDP exploration-exploitation patterns and Q-value changes through prioritised experience replay will deliver genuinely new and usable insights into the agent's choices rather than descriptions already available from ordinary RL logs.

What would settle it

Run the framework on an RL agent in a known CyberBattleSim scenario, then compare the insights it generates against a simple log of visited states, rewards, and actions to check whether any previously hidden reasoning steps are actually revealed.

Figures

Figures reproduced from arXiv: 2505.11708 by Diksha Goel, Jeff Wang, Kristen Moore, Minjune Kim, Thanh Thi Nguyen.

Figure 1
Figure 1. Figure 1: Progressive discovery and control escalation by the RL-based attacker agent in the ToyCTF environment. Figure (a–b) In the early-stages, the agent initially controls only the Client node and discovers the adjacent Website node. Figure (c–d) In the mid-stage, the agent uncovers additional nodes and expands its control via remote exploits and lateral movements. Figure (e–f) In the late-stages, the agent atta… view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative rewards comparison of attacker policies across environments. The shaded region represents the standard deviation of cumulative rewards across training steps for each agent. (Results indicate that while Exploiting-DQL attains the highest rewards by leveraging a fixed, trained DQL policy, standard DQL is the most effective core agent, providing high cumulative rewards with stable and scalable lear… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Exploration Strategies on cumulative rewards across environments (Results show that Standard Exploration achieves broader state-space coverage, yielding higher rewards and facilitating more robust policy convergence). (a) CTF (b) CC22 (c) CC100 (d) CC500 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of Exploration Strategies on node discovery rate across environments (Results show that Standard Exploration consistently enables faster and broader discovery, facilitating more informed policy learning). Our benchmarking shows DQL as the most effective and scalable attacker policy across all environments. While the Exploiting DQL variant achieves marginally higher rewards, it leverages a trained DQ… view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative rewards comparison in early vs. late attack phases across environments (Results show that agents achieve higher rewards in the late phase, indicating a progression from exploratory behaviour to exploitation-focused strategies as training advances). at lower cumulative rewards, while Standard Exploration achieved sustained improvement and a late-stage reward surge in CC500 after broad, systematic… view at source ↗
Figure 6
Figure 6. Figure 6: Emergence of action preferences via state-aggregated Q-values across episodes. The highlighted dominant action marks the shift from exploration to high-value, environment-specific tactics. DQL’s sharp Q-value gradients reflect discrete exploitation bursts as the agent locks onto effective attack sequences. Action indices for CyberBattleChain (1–15) and CTF (1–18) correspond to the x-axis and are listed in … view at source ↗
Figure 7
Figure 7. Figure 7: Temporal evolution of average TD-error (PER priority) across environments. Smaller environments (CTF, CC22) stabilise quickly, whereas larger networks (CC100, CC500) exhibit prolonged surges in TD-error, signalling persistent replay of unstable transitions. These spikes correspond to the performance degradation discussed in Section VI-E (“Unexpected PER Behaviour”). (a) CTF (b) CC22 (c) CC100 (d) CC500 [P… view at source ↗
Figure 8
Figure 8. Figure 8: Number of Key High-Priority States Across Environments. In smaller networks (CTF, CC22), the agent quickly converges to a small set of key states, indicating early policy convergence. In contrast, larger networks (CC100, CC500) exhibit broader, persistent distributions, reflecting prolonged exploration and slower convergence due to higher state-space complexity. replay of high-TD transitions. The irregular… view at source ↗
Figure 9
Figure 9. Figure 9: Reward progression of DQL and DQL+PER in CC100 and CC500 environments. DQL achieves higher cumulative rewards and steadier learning, while DQL+PER underperforms due to its bias toward high-TD-error transitions, which oversamples transient experiences and slows convergence [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average TD-error of DQL and DQL+PER in CC100 and CC500. DQL learns more stably, whereas PER introduces early instability and biased convergence, leading to inferior rewards. lived behaviours rather than enduring strategic improvements. Continued oversampling of such transitions after their rele￾vance has decayed injects noise into the replay buffer, disrupts value estimation, and impedes policy stabilisat… view at source ↗
Figure 11
Figure 11. Figure 11: Cumulative rewards comparison for the PPO algorithm in early vs. late attack phases across environments. Results show that PPO agent achieves higher rewards in the late phase, indicating a progression from exploratory behaviour to exploitation-focused strategies as training advances. (a) CTF: Episode 10 (b) CTF: Episode 25 (c) CC22: Episode 20 (d) CC22: Episode 35 (e) CC100: Episode 10 (f) CC100: Episode … view at source ↗
Figure 12
Figure 12. Figure 12: Emergence of action preferences through state-aggregated policy logits across episodes. The highlighted dominant action marks the shift from exploratory to high-value, environment-specific tactics. Compared with DQL, PPO’s clipped updates yield smoother yet consistent preference consolidation. Action indices for CyberBattleChain (1–15) and CTF (1–18) correspond to the x-axis and are listed in Table II and… view at source ↗
Figure 13
Figure 13. Figure 13: Explainable signatures of policy collapse. (a) Evolution of policy confidence (logit margin) and uncertainty (normalised entropy) over training. A sharp rise in confidence and collapse in entropy mark the transition to an over-confident, low-exploration regime. (b) Reward–entropy dynamics during training. Following the entropy collapse, reward stagnates, indicating behavioural failure despite high confide… view at source ↗
Figure 14
Figure 14. Figure 14: Joint evolution of task performance (normalised return), ex￾ploration (policy entropy), and decision confidence (policy sharpness) over training. the agent operates in a stable learning regime: high pol￾icy entropy sustains exploration, policy sharpness remains low, and task performance improves steadily. These signals evolve coherently, indicating effective optimisation. Between Episodes 19–26, a critica… view at source ↗
read the original abstract

Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (Markov Decision Process (MDP)-level) and tactical (policy-level) reasoning. At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving action preferences. Evaluated across CyberBattleSim environments of increasing complexity, our framework offers interpretable insights into agent behaviour at scale. Unlike previous explainable RL methods, which are {predominantly} post-hoc, domain-specific, or limited in depth, our approach is both agent- and environment-agnostic, {supporting use cases such as red-team simulation, RL policy debugging, phase-aware threat modelling and anticipatory defence planning.} By transforming black-box learning into actionable behavioural intelligence, our framework enables both defenders and developers to better anticipate, analyse, and respond to autonomous cyber threats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a multi-layer explainability framework for RL-based cyber attacker agents. At the MDP level it models attacks as POMDPs to expose exploration-exploitation dynamics and phase-aware behavioural shifts; at the policy level it tracks temporal Q-value evolution and uses Prioritised Experience Replay (PER) to identify critical learning transitions. The framework is evaluated on CyberBattleSim environments of increasing complexity and is presented as agent- and environment-agnostic, yielding actionable behavioural intelligence for red-team simulation, policy debugging, phase-aware threat modelling and anticipatory defence.

Significance. If the claimed distinction from standard RL logging were demonstrated with quantitative metrics and baseline comparisons, the framework could meaningfully advance explainability in RL-driven cyber simulations, supporting practical uses in defensive preparedness. At present the absence of such evidence leaves the significance speculative.

major comments (2)
  1. [Abstract] Abstract: The central claim that the POMDP-level and PER-based analysis produces 'interpretable insights into agent behaviour at scale' and 'actionable behavioural intelligence' is unsupported by any reported quantitative metrics, interpretability scores, or comparisons against raw Q-tables, episode traces, or standard saliency methods.
  2. [Abstract] Abstract and Evaluation description: No baseline comparisons or actionability metrics are supplied to substantiate that the outputs differ from conventional RL instrumentation or that they enable previously unavailable use cases such as anticipatory defence planning.
minor comments (1)
  1. [Abstract] Abstract contains apparent LaTeX artifacts (curly braces around 'predominantly' and the long use-case sentence); these should be removed for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the quantitative support for our claims while preserving the framework's core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the POMDP-level and PER-based analysis produces 'interpretable insights into agent behaviour at scale' and 'actionable behavioural intelligence' is unsupported by any reported quantitative metrics, interpretability scores, or comparisons against raw Q-tables, episode traces, or standard saliency methods.

    Authors: We acknowledge that the current evaluation relies primarily on qualitative case studies across CyberBattleSim environments to illustrate phase-aware shifts and critical learning transitions. While these examples demonstrate distinctions from raw Q-tables and episode traces, we did not report formal interpretability scores or direct baseline comparisons. In the revised manuscript we will add a dedicated evaluation subsection that includes quantitative comparisons, such as the proportion of unique behavioral patterns surfaced by the multi-layer framework versus standard logging, and proxy metrics for insight density per episode. revision: yes

  2. Referee: [Abstract] Abstract and Evaluation description: No baseline comparisons or actionability metrics are supplied to substantiate that the outputs differ from conventional RL instrumentation or that they enable previously unavailable use cases such as anticipatory defence planning.

    Authors: The manuscript positions the framework as providing strategic (POMDP) and tactical (PER/Q-value) layers that conventional instrumentation does not combine. However, we agree that explicit substantiation through baselines and actionability metrics is needed to move beyond illustrative examples. We will revise the evaluation section to incorporate baseline comparisons against raw traces and saliency methods, along with discussion of how the identified phase shifts and transition points support anticipatory defence scenarios within the simulated environments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard RL constructs applied to new domain

full rationale

The paper proposes a multi-layer framework that models cyberattacks as POMDPs to expose exploration-exploitation and phase shifts, then tracks temporal Q-value evolution via PER to surface learning transitions. These are established, externally defined RL techniques (POMDP formulation, Q-learning, prioritized replay) applied to CyberBattleSim environments. No derivation, equation, or central claim reduces by construction to a fitted parameter, self-referential definition, or self-citation chain. The claim of providing agent-agnostic, actionable insights is presented as an empirical outcome of the framework rather than a tautology. This is a self-contained proposal whose validity rests on external evaluation rather than internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard reinforcement learning modeling choices and the domain assumption that cyber attack sequences can be usefully represented as POMDPs; no new free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Cyberattacks can be modeled as Partially Observable Markov Decision Processes to expose exploration-exploitation dynamics and phase-aware behavioural shifts.
    Stated directly as the basis for the MDP-level layer in the abstract.

pith-pipeline@v0.9.0 · 5794 in / 1265 out tokens · 43627 ms · 2026-05-22T13:51:57.552616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

    cs.CR 2026-04 unverdicted novelty 6.0

    C-MADF learns a structural causal model to restrict response actions in an MDP and uses dual blue-red RL policies to achieve 1.8% false-positive rate and 0.979 F1 on the CICIoT2023 dataset.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Defending active directory by combining neural network based dynamic program and evolutionary diversity optimisation,

    D. Goel, M. H. Ward-Graham, A. Neumann, F. Neumann, H. Nguyen, and M. Guo, “Defending active directory by combining neural network based dynamic program and evolutionary diversity optimisation,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’22, 2022, p. 1191–1199

  2. [2]

    Cyberbattlesim,

    Microsoft Defender Research Team, “Cyberbattlesim,” https://github.com/microsoft/cyberbattlesim, 2021, Created by Christian Seifert, Michael Betser, William Blum, James Bono, Kate Farris, Emily Goren, Justin Grana, Kristian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei

  3. [3]

    Cyber operations research gym,

    “Cyber operations research gym,” https://github.com/cage- challenge/CybORG, 2022, created by Maxwell Standen, David Bowman, Son Hoang, Toby Richer, Martin Lucas, Richard Van Tassel, Phillip Vu, Mitchell Kiely, KC C., Natalie Konschnik, Joshua Collyer

  4. [4]

    Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces,

    F. Terranova, A. Lahmadi, and I. Chrisment, “Scalable and Generalizable RL Agents for Attack Path Discovery via Continuous Invariant Spaces,” in2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID), Gold Coast, Australia, Oct. 2025, p. 18. [Online]. Available: https://hal.science/hal-05182437

  5. [5]

    Evolving reinforcement learning environment to minimize learner’s achievable reward: An application on hardening active directory systems,

    D. Goel, A. Neumann, F. Neumann, H. Nguyen, and M. Guo, “Evolving reinforcement learning environment to minimize learner’s achievable reward: An application on hardening active directory systems,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’23, 2023, p. 1348–1356

  6. [6]

    Enhancing network resilience through machine learning- powered graph combinatorial optimization: Applications in cyber de- fense and information diffusion,

    D. Goel, “Enhancing network resilience through machine learning- powered graph combinatorial optimization: Applications in cyber de- fense and information diffusion,”arXiv preprint arXiv:2310.10667, 2023

  7. [7]

    Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach,

    C.-Y . Wei and H. Luo, “Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach,” inConference on learning theory. PMLR, 2021, pp. 4300–4354

  8. [8]

    Explainable ai (xai): Core ideas, techniques, and solutions,

    R. Dwivedi, D. Dave, H. Naik, S. Singhal, R. Omer, P. Patel, B. Qian, Z. Wen, T. Shah, G. Morganet al., “Explainable ai (xai): Core ideas, techniques, and solutions,”ACM computing surveys, vol. 55, no. 9, pp. 1–33, 2023

  9. [9]

    Causal explanations for sequential decision-making in multi-agent systems,

    B. Gyevnar, C. Wang, C. G. Lucas, S. B. Cohen, and S. V . Albrecht, “Causal explanations for sequential decision-making in multi-agent systems,”arXiv preprint arXiv:2302.10809, 2023

  10. [10]

    Codex: A cluster- based method for explainable reinforcement learning,

    T. K. Mathes, J. Inman, A. Col ´on, and S. Khan, “Codex: A cluster- based method for explainable reinforcement learning,”arXiv preprint arXiv:2312.04216, 2023

  11. [11]

    Explainable reinforcement learning through a causal lens,

    P. Madumal, T. Miller, L. Sonenberg, and F. Vetere, “Explainable reinforcement learning through a causal lens,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 03, 2020, pp. 2493–2500

  12. [12]

    Causal explanations for sequential decision making,

    S. B. Nashed, S. Mahmud, C. V . Goldman, and S. Zilberstein, “Causal explanations for sequential decision making,”Journal of Artificial Intel- ligence Research, vol. 83, 2025

  13. [13]

    AIRS: Ex- planation for deep reinforcement learning-based security applications,

    J. Yu, W. Guo, Q. Qin, G. Wang, T. Wang, and X. Xing, “AIRS: Ex- planation for deep reinforcement learning-based security applications,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 7375–7392

  14. [14]

    Inroads into autonomous network defence using explained reinforcement learning,

    M. Foley, M. Wang, C. Hicks, V . Mavroudiset al., “Inroads into autonomous network defence using explained reinforcement learning,” arXiv preprint arXiv:2306.09318, 2023

  15. [15]

    Experiential explanations for reinforcement learning,

    A. Alabdulkarim, M. Singh, G. Mansi, K. Hall, and M. O. Riedl, “Experiential explanations for reinforcement learning,”arXiv preprint arXiv:2210.04723, 2022

  16. [16]

    Explainable artificial intelligence for cybersecurity,

    D. K. Sharma, J. Mishra, A. Singh, R. Govil, G. Srivastava, and J. C.- W. Lin, “Explainable artificial intelligence for cybersecurity,”Computers and Electrical Engineering, vol. 103, p. 108356, 2022

  17. [17]

    Evaluation of explainable artificial intelligence: Shap, lime, and cam,

    H. T. T. Nguyen, H. Q. Cao, K. V . T. Nguyen, and N. D. K. Pham, “Evaluation of explainable artificial intelligence: Shap, lime, and cam,” inProceedings of the FPT AI Conference, 2021, pp. 1–6

  18. [18]

    Explainability of cybersecurity threats data using shap,

    R. Alenezi and S. A. Ludwig, “Explainability of cybersecurity threats data using shap,” in2021 IEEE symposium series on computational intelligence (SSCI). IEEE, 2021, pp. 01–10

  19. [19]

    Interpreting agent behaviors in reinforcement-learning-based cyber- battle simulation platforms,

    J. Claypoole, S. Cheung, A. Gehani, V . Yegneswaran, and A. Ridley, “Interpreting agent behaviors in reinforcement-learning-based cyber- battle simulation platforms,”arXiv preprint arXiv:2506.08192, 2025

  20. [20]

    Nasim: Network attack simulator,

    J. Schwartz and H. Kurniawatti, “Nasim: Network attack simulator,” https://networkattacksimulator.readthedocs.io/, 2019

  21. [21]

    Network defense is not a game,

    A. Molina-Markham, R. K. Winder, and A. Ridley, “Network defense is not a game,”arXiv preprint arXiv:2104.10262, 2021

  22. [22]

    Entity-based reinforcement learning for autonomous cyber defence,

    I. S. Thompson, A. Caron, C. Hicks, and V . Mavroudis, “Entity-based reinforcement learning for autonomous cyber defence,” inProceedings of the Workshop on Autonomous Cybersecurity, 2024, pp. 56–67

  23. [23]

    Optimizing cyber defense in dynamic active directories through rein- forcement learning,

    D. Goel, K. Moore, M. Guo, D. Wang, M. Kim, and S. Camtepe, “Optimizing cyber defense in dynamic active directories through rein- forcement learning,” inEuropean Symposium on Research in Computer Security. Springer, 2024, pp. 332–352

  24. [24]

    Learning cyber defence tactics from scratch with multi-agent reinforcement learning,

    J. Wiebe, R. A. Mallah, and L. Li, “Learning cyber defence tactics from scratch with multi-agent reinforcement learning,”arXiv preprint arXiv:2310.05939, 2023

  25. [25]

    Autonomous network cyber offence strategy through deep reinforcement learning,

    M. Sultana, A. Taylor, and L. Li, “Autonomous network cyber offence strategy through deep reinforcement learning,” inArtificial Intelligence and Machine Learning for Multi-Domain Operations Applications III, vol. 11746. SPIE, 2021, pp. 490–502

  26. [26]

    Developing opti- mal causal cyber-defence agents via cyber security simulation,

    A. Andrew, S. Spillard, J. Collyer, and N. Dhir, “Developing opti- mal causal cyber-defence agents via cyber security simulation,”arXiv preprint arXiv:2207.12355, 2022

  27. [27]

    Autonomous cyber warfare agents: dynamic rein- forcement learning for defensive cyber operations,

    D. A. Bierbrauer, R. M. Schabinger, C. Carlin, J. Mullin, J. A. Pavlik, and N. D. Bastian, “Autonomous cyber warfare agents: dynamic rein- forcement learning for defensive cyber operations,” inArtificial Intelli- gence and Machine Learning for Multi-Domain Operations Applications V, vol. 12538. SPIE, 2023, pp. 42–56

  28. [28]

    Adaptiveϵ-greedy exploration in reinforcement learning,

    M. Tokic, “Adaptiveϵ-greedy exploration in reinforcement learning,” inProceedings of the 22nd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2010, pp. 243–250

  29. [29]

    Q-learning,

    C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992

  30. [30]

    R. E. Bellman,Dynamic Programming. Princeton, NJ: Princeton University Press, 1957

  31. [31]

    Prioritized Experience Replay

    T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” inProceedings of the 4th International Conference on Learning Representations (ICLR), 2016, arXiv:1511.05952. [Online]. Available: https://arxiv.org/abs/1511.05952 Appendix Intermediate Exploration Trajectory of the RL Attacker Under Partial Observability To illustrate the ...