pith. sign in

arxiv: 2507.15163 · v4 · submitted 2025-07-21 · 📡 eess.SY · cs.CR· cs.SY

Adaptive Network Security Policies via Belief Aggregation and Rollout

Pith reviewed 2026-05-19 04:50 UTC · model grok-4.3

classification 📡 eess.SY cs.CRcs.SY
keywords network security policiesadaptive policiesparticle filteringrolloutfeature-based aggregationmodel-based methodscybersecuritypolicy adaptation
0
0 comments X

The pith

Network security policies adapt quickly to changes by updating a system model and using particle filtering, feature-based aggregation, and rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method for computing and adapting network security policies that responds to evolving vulnerabilities and shifting conditions. It relies on an updatable model or simulator combined with particle filtering to estimate beliefs about the system state, feature-based aggregation to optimize policies offline in a scalable manner, and rollout to adjust those policies online without repeating the full offline computation. The approach includes an analysis of the aggregation approximation error and conditions under which rollout achieves efficient adaptation. It is tested in simulations and testbeds, showing better performance than prior reinforcement learning methods on benchmarks including CAGE-2. A reader would care because it promises a way to automate reliable security updates with some theoretical backing in dynamic environments.

Core claim

The central claim is that security policies can be computed scalably offline via feature-based aggregation on a system model and then adapted online through rollout when the model is updated for changes in conditions or vulnerabilities, with belief estimation performed by particle filtering. The aggregation approximation error is analyzed, and rollout is shown to adapt policies efficiently under certain conditions without needing to repeat offline optimization. This combination yields a method that is scalable, provides theoretical guarantees, and adapts faster than standard reinforcement learning approaches lacking such assurances, as demonstrated in simulations and testbed results outperfr

What carries the argument

The three-part framework of particle filtering for belief estimation, feature-based aggregation for scalable offline policy computation, and rollout for online adaptation to model updates.

If this is right

  • Feature-based aggregation makes offline policy optimization scalable for large networks.
  • Rollout adapts policies online to system model changes without repeating the full offline computation.
  • The approximation error introduced by aggregation can be analyzed and bounded theoretically.
  • The overall method provides performance guarantees and faster adaptation than reinforcement learning baselines under the stated conditions.
  • Empirical validation on benchmarks including CAGE-2 shows outperformance relative to state-of-the-art methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The structure might transfer to other model-based control problems where maintaining an updatable simulator is feasible, such as resource allocation under uncertainty.
  • Frequent model updates could reduce reliance on purely online learning loops in large-scale adaptive systems.
  • Similar aggregation-plus-rollout patterns may help in approximate dynamic programming settings beyond security.

Load-bearing premise

A sufficiently accurate model or simulator of the network exists and can be updated when operational conditions or vulnerabilities change, allowing the particle filter and rollout steps to operate without large model mismatch.

What would settle it

A testbed experiment after a vulnerability change where the rollout-adapted policy shows no meaningful performance improvement or requires re-optimization time comparable to full offline recomputation.

Figures

Figures reproduced from arXiv: 2507.15163 by Dimitri Bertsekas, Emil C. Lupu, Kim Hammar, Tansu Alpcan, Yuchao Li.

Figure 1
Figure 1. Figure 1: Our method for computing adaptive network security policies. A base [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Frequency of change in networked systems [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Most common causes of outages in networked systems [ [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the networked system in the example use case. [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Feature-based belief aggregation: we map the state space [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: One-step transition diagram of the aggregate ˜ [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the scalar ϵ of Prop. 1. The illustration is based on an approximation with three representative feature beliefs: Q˜ = {q, ˜ q˜ ′ , q˜ ′′}. The corresponding belief space partitions are: Sq˜, Sq˜′ , and Sq˜′′ ; cf. Eq. (14). A special case of interest, which we refer to as identity aggregation, is when each state is mapped to a unique feature state and vice versa. In this case, the cost fun… view at source ↗
Figure 9
Figure 9. Figure 9: Observation distribution for each replica [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: The three computational layers of our method. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance of rollout when applied to the [PITH_FULL_IMAGE:figures/full_fig_p007_13.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between the theoretical error bound in ˜ [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Number of representative feature beliefs [cf. [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison between the optimal cost function ˜ [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: Policy adaptation time after a system change in our testbed ( [PITH_FULL_IMAGE:figures/full_fig_p008_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Results from the testbed evaluation. The time-to-recovery and [PITH_FULL_IMAGE:figures/full_fig_p008_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Policy adaptation time after a system change in [PITH_FULL_IMAGE:figures/full_fig_p009_18.png] view at source ↗
Figure 17
Figure 17. Figure 17: The CAGE-2 benchmark problem [27]: compute a security policy µ to protect a system against an attacker while maintaining services for clients. Methods for comparison. Over 35 methods have been evalu￾ated against the CAGE-2 benchmark. We compare our method against the current state-of-the-art methods, namely: CARDIFF [28] and C-POMCP [43, Alg. 1]. We also compare it against four baseline methods: PPO [57, … view at source ↗
read the original abstract

Evolving security vulnerabilities and shifting operational conditions require frequent updates to network security policies. These updates include adjustments to incident response procedures and modifications to access controls, among others. Reinforcement learning methods have been proposed for automating such policy adaptations, but most methods in the research literature lack performance guarantees and adapt slowly to changes. In this paper, we address these limitations and present a method for computing security policies that is scalable, offers theoretical guarantees, and adapts quickly to changes. The method uses a model or simulator of the system, which is updated when changes occur, and combines three components: belief estimation through particle filtering, offline policy computation through feature-based aggregation, and online policy adaptation through rollout. In particular, feature-based aggregation enables scalable offline optimization of a policy, while rollout adapts the policy online to changes in the system model without repeating the offline optimization. We analyze the approximation error of the aggregation and show that the rollout efficiently adapts policies to changes under certain conditions. Simulations and testbed results demonstrate that our method outperforms state-of-the-art methods on several benchmarks, including CAGE-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a method for adaptive network security policies that integrates particle filtering for belief estimation, feature-based aggregation for scalable offline policy optimization, and rollout for online adaptation to model changes. It claims to analyze the approximation error of the aggregation step and to show that rollout enables efficient adaptation under certain conditions, while outperforming state-of-the-art methods on benchmarks including CAGE-2. The approach relies on an updatable system model or simulator.

Significance. If the claimed error bounds and adaptation conditions hold, the work provides a principled model-based alternative to purely data-driven RL for network security, offering scalability through offline aggregation and rapid online updates via rollout. The empirical results on CAGE-2 strengthen the case for practical utility in evolving threat environments.

major comments (1)
  1. [Abstract and theoretical analysis of approximation error and rollout adaptation] The analysis of aggregation approximation error and the conditions for efficient rollout adaptation (as described in the abstract and the method overview) are derived under the assumption that the simulator exactly matches the true dynamics. No separate robustness or sensitivity analysis is provided for model mismatch arising from approximate updates to new vulnerabilities or traffic shifts; this directly affects the validity of the particle-filter beliefs and rollout value estimates and is load-bearing for the central performance guarantees.
minor comments (2)
  1. The abstract states that simulations and testbed results demonstrate outperformance but provides no details on the specific metrics, number of runs, or statistical tests used; adding these would improve clarity.
  2. Clarify how feature-based aggregation is constructed (e.g., choice of features and basis functions) to make the scalability claim more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract and theoretical analysis of approximation error and rollout adaptation] The analysis of aggregation approximation error and the conditions for efficient rollout adaptation (as described in the abstract and the method overview) are derived under the assumption that the simulator exactly matches the true dynamics. No separate robustness or sensitivity analysis is provided for model mismatch arising from approximate updates to new vulnerabilities or traffic shifts; this directly affects the validity of the particle-filter beliefs and rollout value estimates and is load-bearing for the central performance guarantees.

    Authors: We agree that our theoretical analysis of the aggregation approximation error and the rollout adaptation is conducted under the assumption that the simulator exactly matches the true system dynamics. This assumption is explicit in our model-based framework, where the simulator is updated to reflect changes in vulnerabilities or traffic. The particle filtering is used to maintain beliefs under uncertainty, and the rollout is shown to adapt the policy efficiently when the model is updated. While we do not provide a separate sensitivity analysis for residual model mismatch after updates, the empirical results on CAGE-2 and other benchmarks demonstrate practical performance even in realistic settings. We will revise the manuscript to explicitly state this assumption in the abstract and method overview and add a brief discussion on the implications of model mismatch for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external model and independent error analysis

full rationale

The paper presents a composite method (particle filtering for belief estimation, feature-based aggregation for offline policy, rollout for online adaptation) whose central claims rest on an external updatable simulator and a separate theoretical analysis of aggregation approximation error plus rollout adaptation conditions. No equations or steps reduce by construction to fitted inputs from the same data, self-definitions, or unverified self-citation chains. Benchmarks such as CAGE-2 supply external validation. This is the expected honest non-finding for a method grounded in approximate dynamic programming with stated modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of an updatable system model and conditions under which rollout adaptation remains efficient; these are domain assumptions rather than derived results.

axioms (1)
  • domain assumption A model or simulator of the network exists that can be updated when changes occur and is accurate enough for particle filtering and rollout to function.
    Explicitly invoked in the abstract description of the method components and adaptation process.

pith-pipeline@v0.9.0 · 5734 in / 1277 out tokens · 38491 ms · 2026-05-19T04:50:43.208157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On-Line Policy Iteration with Trajectory-Driven Policy Generation

    eess.SY 2026-04 unverdicted novelty 6.0

    An online policy iteration algorithm produces monotonically cost-improving policies for a fixed initial state by training successive policies on trajectory data generated by prior policies.

  2. On-Line Policy Iteration with Trajectory-Driven Policy Generation

    eess.SY 2026-04 unverdicted novelty 6.0

    An online policy iteration algorithm produces a sequence of monotonically cost-improving policies for fixed-initial-state deterministic control by training each new policy on the trajectory generated by the prior one.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    An intrusion-detection model,

    D. E. Denning, “An intrusion-detection model,” IEEE Transactions on Software Engineering, vol. SE-13, no. 2, pp. 222–232, 1987

  2. [2]

    A policy-based security architecture for software-defined networks,

    V . Varadharajan, K. Karmakar, U. Tupakula, and M. Hitchens, “A policy-based security architecture for software-defined networks,” IEEE Transactions on Information Forensics and Security , vol. 14, no. 4, pp. 897–912, 2019

  3. [3]

    Deep reinforcement learning for cyber security,

    T. T. Nguyen and V . J. Reddi, “Deep reinforcement learning for cyber security,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 3779–3795, 2023

  4. [4]

    The mayhem cyber reasoning system,

    T. Avgerinos, D. Brumley, J. Davis, R. Goulden, T. Nighswander, A. Rebert, and N. Williamson, “The mayhem cyber reasoning system,” IEEE Security & Privacy , vol. 16, no. 2, pp. 52–60, 2018

  5. [5]

    RRE: A game-theoretic intrusion response and recovery engine,

    S. A. Zonouz, H. Khurana, W. H. Sanders, and T. M. Yardley, “RRE: A game-theoretic intrusion response and recovery engine,” in 2009 IEEE/IFIP International Conference on Dependable Systems & Net- works, 2009, pp. 439–448

  6. [6]

    PentestGPT: Evaluating and harnessing large language models for automated penetration testing,

    G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “PentestGPT: Evaluating and harnessing large language models for automated penetration testing,” in 33rd USENIX Security Symposium (USENIX Security 24) . Philadelphia, PA: USENIX Association, Aug. 2024, pp. 847–864

  7. [7]

    When moving target defense meets attack prediction in digital twins: A convolutional and hierarchical reinforcement learning approach,

    T. Zhang, C. Xu, Y . Lian, H. Tian, J. Kang, X. Kuang, and D. Niyato, “When moving target defense meets attack prediction in digital twins: A convolutional and hierarchical reinforcement learning approach,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 10, pp. 3293– 3305, 2023

  8. [8]

    Intrusion tolerance for networked systems through two-level feedback control,

    K. Hammar and R. Stadler, “Intrusion tolerance for networked systems through two-level feedback control,” in 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , 2024, pp. 338–352

  9. [9]

    SyzVegas: Beating kernel fuzzing odds with rein- forcement learning,

    D. Wang, Z. Zhang, H. Zhang, Z. Qian, S. V . Krishnamurthy, and N. Abu-Ghazaleh, “SyzVegas: Beating kernel fuzzing odds with rein- forcement learning,” in 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, Aug. 2021, pp. 2741–2758

  10. [10]

    Bertsekas, Dynamic Programming and Optimal Control: Vol

    D. Bertsekas, Dynamic Programming and Optimal Control: Vol. II , 4th ed. Athena Scientific Belmont, 2012

  11. [11]

    Athena Scientific, 2021

    ——, Rollout, Policy Iteration, and Distributed Reinforcement Learning. Athena Scientific, 2021

  12. [12]

    On-line policy improvement using Monte- Carlo search,

    G. Tesauro and G. Galperin, “On-line policy improvement using Monte- Carlo search,” in Advances in Neural Information Processing Systems , M. Mozer, M. Jordan, and T. Petsche, Eds., vol. 9. MIT Press, 1996

  13. [13]

    Discretized approximations for POMDP with average cost,

    H. Yu and D. Bertsekas, “Discretized approximations for POMDP with average cost,” in Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, ser. UAI ’04. Arlington, Virginia, USA: AUAI Press, 2004, p. 619–627

  14. [14]

    On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces,

    N. Saldi, S. Y ¨uksel, and T. Linder, “On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces,” Math. Oper. Res., vol. 42, no. 4, p. 945–978, Nov. 2017

  15. [15]

    Feature-based aggregation and deep reinforcement learn- ing: A survey and some new implementations,

    D. Bertsekas, “Feature-based aggregation and deep reinforcement learn- ing: A survey and some new implementations,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 1, pp. 1–31, 2019

  16. [16]

    Athena Scientific, 2025, 2nd edition

    ——, A Course in Reinforcement Learning . Athena Scientific, 2025, 2nd edition

  17. [17]

    Bertsekas

    ——, “Biased aggregation, rollout, and enhanced policy improvement for reinforcement learning,” 2019, https://arxiv .org/abs/1910.02426. 12

  18. [18]

    Online policy adaptation for networked systems using rollout,

    F. S. Samani, K. Hammar, and R. Stadler, “Online policy adaptation for networked systems using rollout,” in NOMS 2024-2024 IEEE Network Operations and Management Symposium , 2024, pp. 1–9

  19. [19]

    Reinforce- ment learning based approach for flip attack detection,

    H. Liu, Y . Li, J. M ˚artensson, L. Xie, and K. H. Johansson, “Reinforce- ment learning based approach for flip attack detection,” in 2020 59th IEEE Conference on Decision and Control (CDC) . IEEE, 2020, pp. 3212–3217

  20. [20]

    Rollout approach to sensor scheduling for remote state estimation under integrity attack,

    H. Liu, Y . Li, K. H. Johansson, J. M ˚artensson, and L. Xie, “Rollout approach to sensor scheduling for remote state estimation under integrity attack,” Automatica, vol. 144, p. 110473, 2022

  21. [21]

    Adaptive security response strategies through conjectural online learning,

    K. Hammar, T. Li, R. Stadler, and Q. Zhu, “Adaptive security response strategies through conjectural online learning,” IEEE Transactions on Information Forensics and Security , vol. 20, pp. 4055–4070, 2025

  22. [22]

    Rollout-based charging strategy for electric trucks with hours-of-service regulations,

    T. Bai, Y . Li, K. H. Johansson, and J. M ˚artensson, “Rollout-based charging strategy for electric trucks with hours-of-service regulations,” IEEE Control Systems Letters , vol. 7, pp. 2167–2172, 2023

  23. [23]

    Adaptive aggregation methods for infi- nite horizon dynamic programming,

    D. Bertsekas and D. Castanon, “Adaptive aggregation methods for infi- nite horizon dynamic programming,” IEEE Transactions on Automatic Control, vol. 34, no. 6, pp. 589–598, 1989

  24. [24]

    Alpcan and T

    T. Alpcan and T. Basar, Network Security: A Decision and Game- Theoretic Approach, 1st ed. USA: Cambridge University Press, 2010

  25. [25]

    Optimal security response to network intrusions in IT systems,

    K. Hammar, “Optimal security response to network intrusions in IT systems,” Ph.D. dissertation, KTH Royal Instistute of Technology, 2024

  26. [26]

    Towards the deployment of realistic autonomous cyber network defence: A systematic review,

    S. Vyas, V . Mavroudis, and P. Burnap, “Towards the deployment of realistic autonomous cyber network defence: A systematic review,”ACM Comput. Surv., May 2025

  27. [27]

    TTCP CAGE challenge 2,

    CAGE, “TTCP CAGE challenge 2,” in AAAI-22 Workshop on Artificial Intelligence for Cyber Security (AICS) , 2022, https://github .com/cage- challenge/cage-challenge-2

  28. [28]

    Automated cyber defence: A review,

    S. Vyas, J. Hannay, A. Bolton, and P. P. Burnap, “Automated cyber defence: A review,” 2023, https://arxiv.org/abs/2303.04926, code: https: //github.com/john-cardiff/-cyborg-cage-2

  29. [29]

    Reward shaping for happier autonomous cyber security agents,

    E. Bates, V . Mavroudis, and C. Hicks, “Reward shaping for happier autonomous cyber security agents,” in Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , ser. AISec ’23, New York, NY , USA, 2023, p. 221–232

  30. [30]

    Beyond CAGE: Investigating generalization of learned autonomous network defense policies,

    M. Wolk, A. Applebaum, C. Dennler, P. Dwyer, M. Moskowitz, H. Nguyen, N. Nichols, N. Park, P. Rachwalski, F. Rau, and A. Webster, “Beyond CAGE: Investigating generalization of learned autonomous network defense policies,” 2022

  31. [31]

    Autonomous network defence using reinforcement learning,

    M. Foley, C. Hicks, K. Highnam, and V . Mavroudis, “Autonomous network defence using reinforcement learning,” in Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, ser. ASIA CCS ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1252–1254

  32. [32]

    Inroads into autonomous network defence using explained reinforcement learning,

    M. Foley, M. Wang, Z. M, C. Hicks, and V . Mavroudis, “Inroads into autonomous network defence using explained reinforcement learning,” 2023, https://arxiv.org/abs/2306.09318

  33. [33]

    Enhancing cybersecurity in industrial control system with autonomous defense using normalized proximal policy optimization model,

    S. Xu, Z. Xie, C. Zhu, X. Wang, and L. Shi, “Enhancing cybersecurity in industrial control system with autonomous defense using normalized proximal policy optimization model,” in 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS) , 2023, pp. 928–935

  34. [34]

    RICE: breaking through the training bottlenecks of reinforcement learning with explanation,

    Z. Cheng, X. Wu, J. Yu, S. Yang, G. Wang, and X. Xing, “RICE: breaking through the training bottlenecks of reinforcement learning with explanation,” in Proceedings of the 41st International Conference on Machine Learning, 2024

  35. [35]

    Structural generalization in autonomous cyber incident response with message-passing neural networks and reinforcement learning,

    J. Nyberg and P. Johnson, “Structural generalization in autonomous cyber incident response with message-passing neural networks and reinforcement learning,” in 2024 IEEE International Conference on Cyber Security and Resilience (CSR) , 2024, pp. 282–289

  36. [36]

    An empirical game-theoretic anal- ysis of autonomous cyber-defence agents,

    G. Palmer, L. Swaby, D. J. B. Harrold, M. Stewart, A. Hiles, C. Willis, I. Miles, and S. Farmer, “An empirical game-theoretic anal- ysis of autonomous cyber-defence agents,” 2025, https://arxiv .org/abs/ 2501.19206

  37. [37]

    Neuroevolution for autonomous cyber defense,

    K. Heckel, “Neuroevolution for autonomous cyber defense,” in Pro- ceedings of the Companion Conference on Genetic and Evolutionary Computation, ser. GECCO ’23 Companion. New York, NY , USA: Association for Computing Machinery, 2023, p. 651–654

  38. [38]

    A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,

    Y . Tang, J. Sun, H. Wang, J. Deng, L. Tong, and W. Xu, “A method of network attack-defense game and collaborative defense decision-making based on hierarchical multi-agent reinforcement learning,” Computers & Security, vol. 142, p. 103871, 2024

  39. [39]

    Learning cyber defence tactics from scratch with multi-agent reinforcement learning,

    J. Wiebe, R. A. Mallah, and L. Li, “Learning cyber defence tactics from scratch with multi-agent reinforcement learning,” 2023, https://arxiv.org/ abs/2310.05939

  40. [40]

    Hierarchical multi-agent reinforcement learning for cyber network defense,

    A. V . Singh, E. Rathbun, E. Graham, L. Oakley, S. Boboila, A. Oprea, and P. Chin, “Hierarchical multi-agent reinforcement learning for cyber network defense,” 2024, https://arxiv .org/abs/2410.17351

  41. [41]

    Depending on yourself when you should: Mentoring LLM with RL agents to become the master in cybersecurity games,

    Y . Yan, Y . Zhang, and K. Huang, “Depending on yourself when you should: Mentoring LLM with RL agents to become the master in cybersecurity games,” 2024, https://arxiv .org/html/2403.17674v1

  42. [42]

    Design of an autonomous cyber defence agent using hybrid AI models,

    J. F. Loevenich, E. Adler, R. Mercier, A. Velazquez, and R. R. F. Lopes, “Design of an autonomous cyber defence agent using hybrid AI models,” in 2024 International Conference on Military Communication and Information Systems (ICMCIS) , 2024, pp. 1–10

  43. [43]

    Optimal defender strategies for CAGE-2 using causal modeling and tree search,

    K. Hammar, N. Dhir, and R. Stadler, “Optimal defender strategies for CAGE-2 using causal modeling and tree search,” 2024, https://arxiv.org/ abs/2407.11070

  44. [44]

    General autonomous cybersecurity de- fense: Learning robust policies for dynamic topologies and diverse attackers,

    A. Ramamurthy and N. Dhir, “General autonomous cybersecurity de- fense: Learning robust policies for dynamic topologies and diverse attackers,” 2025, https://arxiv.org/abs/2506.22706

  45. [45]

    Leveraging large language models for autonomous cyber defense: Insights from CAGE-2 simula- tions,

    H. Mohammadi, J. J. Davis, and M. Kiely, “Leveraging large language models for autonomous cyber defense: Insights from CAGE-2 simula- tions,” IEEE Intelligent Systems , pp. 1–8, 2025

  46. [46]

    2020 DevOps trends survey,

    Atlassian and C. Research, “2020 DevOps trends survey,” 2020, https: //www.atlassian.com/whitepapers/devops-survey-2020

  47. [47]

    2024 observability forecast report,

    N. relic and E. T. R. (ETR), “2024 observability forecast report,” 2024

  48. [48]

    Optimal control of Markov processes with incomplete state information,

    K. J. ˚Astr¨om, “Optimal control of Markov processes with incomplete state information,” Journal of Mathematical Analysis and Applications , vol. 10, no. 1, pp. 174–205, 1965

  49. [49]

    Krishnamurthy, Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing

    V . Krishnamurthy, Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing . Cambridge University Press, 2016

  50. [50]

    A survey of convergence results on particle filtering methods for practitioners,

    D. Crisan and A. Doucet, “A survey of convergence results on particle filtering methods for practitioners,” IEEE Transactions on Signal Pro- cessing, vol. 50, no. 3, pp. 736–746, 2002

  51. [51]

    Feature-based belief aggrega- tion for partially observable Markov decision problems,

    Y . Li, K. Hammar, and D. Bertsekas, “Feature-based belief aggrega- tion for partially observable Markov decision problems,” 2025, https: //arxiv.org/abs/2507.04646

  52. [52]

    Bertsekas, Reinforcement Learning and Optimal Control

    D. Bertsekas, Reinforcement Learning and Optimal Control . Athena Scientific, 2019

  53. [53]

    Software for the paper

    K. Hammar, “Software for the paper ”Adaptive Network Security Policies via Belief Aggregation and Rollout”,” 2025, the software and data are available at https://github.com/Limmen/rollout aggregation and https://github.com/Limmen/csle

  54. [54]

    Bertsekas, Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control

    D. Bertsekas, Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control . Athena Scientific, 2022

  55. [55]

    CVE database,

    The MITRE Corporation, “CVE database,” 2022, https://cve .mitre.org/

  56. [56]

    CWE list,

    ——, “CWE list,” 2023, https://cwe .mitre.org/index.html

  57. [57]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”CoRR, 2017, http://arxiv.org/ abs/1707.06347

  58. [58]

    Phasic policy gradient,

    K. W. Cobbe, J. Hilton, O. Klimov, and J. Schulman, “Phasic policy gradient,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol

  59. [59]

    2020–2027

    PMLR, 18–24 Jul 2021, pp. 2020–2027

  60. [60]

    Human-level control through deep reinforcement learning,

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015

  61. [61]

    Monte-Carlo planning in large POMDPs,

    D. Silver and J. Veness, “Monte-Carlo planning in large POMDPs,” in Advances in Neural Information Processing Systems , vol. 23, 2010

  62. [62]

    Approximate solution methods for partially observable Markov and semi-Markov decision processes,

    H. Yu, “Approximate solution methods for partially observable Markov and semi-Markov decision processes,” Ph.D. dissertation, Massachusetts Institute of Technology, USA, 2006

  63. [63]

    Stable-baselines3: Reliable reinforcement learning implementa- tions,

    A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implementa- tions,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

  64. [64]

    CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms

    S. Huang, R. F. J. Dossa, C. Ye, J. Braga, D. Chakraborty, K. Mehta, and J. G. M. Ara ´ujo, “CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms.” Journal of Machine Learning Research, vol. 23, pp. 274:1–274:18, 2022

  65. [65]

    MITRE ATT&CK: Design and philosophy,

    B. E. Strom, A. Applebaum, D. P. Miller, K. C. Nickels, A. G. Penning- ton, and C. B. Thomas, “MITRE ATT&CK: Design and philosophy,” in Technical report. The MITRE Corporation, 2018

  66. [66]

    End-to-end internet packet dynamics,

    V . Paxson, “End-to-end internet packet dynamics,” in IEEE/ACM Trans- actions on Networking , 1997, pp. 277–292. 13