CSLE: A Reinforcement Learning Platform for Autonomous Security Management

Kim Hammar

arxiv: 2604.15590 · v1 · submitted 2026-04-16 · 💻 cs.CR · cs.AI

CSLE: A Reinforcement Learning Platform for Autonomous Security Management

Kim Hammar This is my paper

Pith reviewed 2026-05-10 10:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords reinforcement learningsecurity managementemulationsimulationnetworked systemsautonomous controlMarkov decision processsystem modeling

0 comments

The pith

CSLE combines emulation of real networks with simulation to train reinforcement learning agents that deliver near-optimal security management under conditions close to live operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CSLE as a platform that lets reinforcement learning develop security policies for networked systems while staying grounded in realistic conditions. It builds an emulation of the target system to capture its dynamics and create a mathematical model, then runs efficient simulations on that model to learn candidate strategies. Those strategies are tested and adjusted back in the emulation layer to reduce the difference between simulated and operational performance. The result is shown in four control problems where the learned policies reach near-optimal levels without requiring direct experiments on production networks. This matters because it gives a concrete route to adaptive, automated security that can be validated before deployment.

Core claim

CSLE consists of an emulation system that replicates key components of the target networked system in a virtualized environment to gather measurements, identify a system model such as a Markov decision process, and a simulation system where security strategies are learned efficiently through simulations of the model. Learned strategies are then evaluated and refined in the emulation system. Through four use cases in flow control, replication control, segmentation control, and recovery control, CSLE enables near-optimal security management in an environment that approximates an operational system.

What carries the argument

The dual emulation-simulation architecture that identifies a system model from virtualized measurements and then learns and refines reinforcement learning strategies iteratively between simulation and emulation.

If this is right

Security policies for networked systems can be developed and validated without direct risk to live production environments.
Reinforcement learning becomes practical for adaptive controls such as traffic flow, data replication, network segmentation, and system recovery.
The gap between theoretical reinforcement learning performance and real-world security outcomes can be narrowed through iterative refinement in emulation.
Multiple distinct security control tasks can be addressed within the same platform using the same core workflow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The platform could reduce the cost and time needed to test new security policies by limiting live-system exposure to only the final validation stage.
Accurate emulation of network dynamics appears central to successful transfer of reinforcement learning policies in security settings.
Similar dual-system designs might apply to other reinforcement learning domains where direct interaction with the real environment is costly or risky.

Load-bearing premise

The emulation system accurately replicates the behavior and dynamics of the target operational system so that strategies learned in simulation transfer without large performance loss.

What would settle it

Deploy the strategies learned through CSLE on a live operational networked system and measure whether their security performance matches the near-optimal results obtained inside the emulation environment or drops substantially.

Figures

Figures reproduced from arXiv: 2604.15590 by Kim Hammar.

**Figure 1.** Figure 1: Architectural overview of CSLE: a reinforcement learning platform for autonomous security management. gap between the environment where strategies are evaluated and a scenario playing out in an operational system. Most of the results obtained so far are limited to simulation environments, leaving their practical utility unproven. In this paper, we address this limitation by presenting a platform that ena… view at source ↗

**Figure 2.** Figure 2: Autonomous and adaptive security management of a networked system as a reinforcement learning problem. In the context of reinforcement learning, we can view security management as the problem of learning an effective defender strategy through repeated interaction with the system. In particular, by observing how the system responds to different actions, the defender can gradually improve its strategy to m… view at source ↗

**Figure 3.** Figure 3: A digital twin in CSLE is a virtual replica of a target system that runs the same software and configuration, but on virtualized hardware. Moreover, the twin controls network delays and emulates actors to replicate operational workloads. The twin is used in CSLE for strategy evaluation and system identification. 4.1 Emulation System As described above, the emulation system in CSLE is used to create a digi… view at source ↗

**Figure 4.** Figure 4: The architecture of CSLE. It is a distributed platform with N servers (N = 6 in this example), which are connected through a database (the metastore) and a virtualization layer provided by Docker Swarm. CSLE has four interfaces: a Python API, a GRPC API, a REST API, and a command-line interface. 5.1 Infrastructure CSLE runs on a distributed system with N ≥ 1 servers connected through an IP network. Each s… view at source ↗

**Figure 5.** Figure 5: Time to deploy and cleanup a digital twin in CSLE. Deploying the twin involves creating containers, attaching them to networks, configuring them, and starting management services. Cleanup involves stopping and deleting containers and networks. The time measurements were performed for a digital twin with a single network running on a server with a 24-core Intel Xeon Gold 2.10 GHz CPU and 768 GB RAM. Number… view at source ↗

**Figure 6.** Figure 6: shows the resource usage of two digital twins as a function of the client arrival rate. We observe, as expected, that the resource usage increases with the load imposed on the twins. In particular, higher client arrival rates lead to increased CPU utilization since the twins must process a larger number of service requests. In contrast, the memory usage remains stable when increasing the load. 5 10 15 31 c… view at source ↗

**Figure 7.** Figure 7: Monitoring system of a digital twin in CSLE. Emulated devices run monitoring agents that periodically push metrics to an event bus, which is consumed by pipelines that process the data and write to storage systems; the processed data is also used as input to automated control strategies to decide on control actions. Action MITRE D3FEND technique Revoke user certificates D3-CBAN certificate revocation. Blac… view at source ↗

**Figure 9.** Figure 9: Target systems for the use cases in the experimental evaluation. The system configurations are available in Appendix B. alerts to identify suspicious or malicious activity. Based on these observations, the defender can control network flows to mitigate potential network intrusions. For example, the defender can block suspicious flows or redirect them to a honeypot. When making these decisions, the defender… view at source ↗

**Figure 10.** Figure 10: Convergence curves for the different use cases. The red curves relate to the performance in the simulations and the blue curves relate to the performance when evaluating the learned strategies in the digital twins. Curves show the mean values from evaluations with 5 random seeds; shaded areas indicate standard deviations. The x-axes indicate the training times in the simulations. The top row relates to si… view at source ↗

**Figure 11.** Figure 11: Performance comparison between reinforcement learning methods and baseline strategies in the digital twin. Curves show the mean values from evaluations with 5 random seeds; shaded areas indicate standard deviations. The top row relates to the decision-theoretic models. The bottom row relates to the game-theoretic models. The acronym FP stands for fictitious play. self-play (NFSP) (Heinrich & Silver, 2016)… view at source ↗

**Figure 12.** Figure 12: Analysis of the sensitivity to model misspecification in the flow control use case. Numbers and error bars indicate the mean and the standard deviation from 5 evaluations [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Screenshot of the web interface to the management system in CSLE. The figure shows the page for configuring emulations (i.e., digital twins). In addition to this page, the web interface allows viewing reinforcement learning experiments, debugging learned security strategies, real-time monitoring of digital twins, management of simulations, and integration with large language models [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 14.** Figure 14: Screenshot of the web interface to the management system in CSLE. The figure shows the list of available pages. Actions A The defender has two actions: (S)top and (C)ontinue. The action space is thus A = {S, C}. We encode S with 1 and C with 0 to simplify the formal description below. Each stop is associated with a flow control action, and the objective is to decide the optimal times for stopping. The nu… view at source ↗

**Figure 15.** Figure 15: Screenshot of the web interface of the strategy debugger in CSLE, which allows the user to interactively step through a POMDP episode. The left panel shows the defender’s view, with infrastructure statistics updated in real time. The right panel shows the attacker’s view, which consists of partial knowledge of the system under attack. ‘ 5000 10000 15000 20000 bz(∆ y | s) ∆y # Warning ids alerts ∆y 20 40 6… view at source ↗

**Figure 16.** Figure 16: Estimated (smoothed) distributions of severe IDS alerts ∆x (top row), warning IDS alerts ∆y (middle row), and login attempts ∆z (bottom row) for the flow control POMDP. The distributions are estimated based on measurements from the digital twin [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Fitted Gaussian mixture models of z when no intrusion occurs (st = 0) and during intrusion (st = 1) for the flow control Markov game. and the defender takes the final stop action S (i.e., when lt − a (D) t = 0); (ii) when the intrusion is stopped by the defender with probability ϕl ; and (iii) when st = 1 and the attacker terminates the intrusion (a (A) t = S = 1). Reward function rl(s, a(D), a(A)) At tim… view at source ↗

**Figure 18.** Figure 18: Transition function for the replication control MDP. Objective Increasing the replication factor st improves service availability T (A) but increases cost. (T (A) is the fraction of time steps where service is available.) The goal of the controller is thus to find the optimal cost-redundancy trade-off, i.e., to minimize J = lim T→∞ "X T t=1 at T # , subject to T (A) ≥ ϵA, (12) where ϵA is the chosen lower… view at source ↗

**Figure 19.** Figure 19: Empirical observation distributions for the recovery control use case based on measurements from the digital twin. [Classification: Attempted Information Leak] [**] [1:2003068:7] ET SCAN Potential SSH Login Attempt [**] [Priority: 2] 03/12-09:14:18.112 172.31.1.42:49830 -> 172.31.2.10:22 [Classification: Misc Attack] The monitoring agent on the host pushes both the raw metrics and the IDS alerts to the K… view at source ↗

read the original abstract

Reinforcement learning is a promising approach to autonomous and adaptive security management in networked systems. However, current reinforcement learning solutions for security management are mostly limited to simulation environments and it is unclear how they generalize to operational systems. In this paper, we address this limitation by presenting CSLE: a reinforcement learning platform for autonomous security management that enables experimentation under realistic conditions. Conceptually, CSLE encompasses two systems. First, it includes an emulation system that replicates key components of the target system in a virtualized environment. We use this system to gather measurements and logs, based on which we identify a system model, such as a Markov decision process. Second, it includes a simulation system where security strategies are efficiently learned through simulations of the system model. The learned strategies are then evaluated and refined in the emulation system to close the gap between theoretical and operational performance. We demonstrate CSLE through four use cases: flow control, replication control, segmentation control, and recovery control. Through these use cases, we show that CSLE enables near-optimal security management in an environment that approximates an operational system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CSLE, a reinforcement learning platform for autonomous security management consisting of an emulation system that replicates key components of a target networked system to collect measurements and identify a model such as an MDP, paired with a simulation system for efficient strategy learning; learned strategies are then evaluated and refined in the emulation environment. The platform is illustrated through four use cases (flow control, replication control, segmentation control, and recovery control), with the central claim that CSLE enables near-optimal security management in an environment that approximates an operational system.

Significance. If the emulation accurately replicates operational dynamics and policies transfer with minimal performance loss, the platform could meaningfully advance practical RL applications in security by providing a structured way to move beyond pure simulation. The dual emulation-simulation design and concrete use cases offer a reusable experimental framework that addresses a recognized sim-to-real gap in the field.

major comments (2)

[Abstract] Abstract: the claim that the use cases demonstrate 'near-optimal security management' is unsupported by any quantitative results, performance metrics, error bars, or baseline comparisons, rendering it impossible to verify whether the central claim holds.
[Use cases] Use cases section: no side-by-side empirical benchmarking is reported comparing any metric (e.g., latency distributions, attack success rates, or resource usage) between the emulated environment and an actual production or testbed instance of the same network; this validation is load-bearing for the assertion that strategies learned in simulation transfer to approximate operational performance.

minor comments (2)

[Architecture] The description of how the MDP is identified from emulation logs could be expanded with a concrete example or pseudocode to improve reproducibility.
[Use cases] Clarify the specific RL algorithms and reward functions employed in each use case, as these details are essential for assessing the 'near-optimal' characterization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, indicating where revisions will be made to improve clarity and accuracy.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the use cases demonstrate 'near-optimal security management' is unsupported by any quantitative results, performance metrics, error bars, or baseline comparisons, rendering it impossible to verify whether the central claim holds.

Authors: We agree that the abstract's use of 'near-optimal' is not supported by summarized quantitative evidence and should be revised. The use cases section reports concrete performance metrics (e.g., attack mitigation rates and resource consumption) for policies transferred from simulation to emulation, along with comparisons against static baselines, but these details are not referenced in the abstract and no error bars or formal optimality analysis appear. We will revise the abstract to state that the use cases demonstrate effective security management strategies learned and validated in the emulation environment, removing the unsubstantiated 'near-optimal' phrasing. This change will be made in the next version. revision: yes
Referee: [Use cases] Use cases section: no side-by-side empirical benchmarking is reported comparing any metric (e.g., latency distributions, attack success rates, or resource usage) between the emulated environment and an actual production or testbed instance of the same network; this validation is load-bearing for the assertion that strategies learned in simulation transfer to approximate operational performance.

Authors: We acknowledge that the manuscript does not include direct side-by-side benchmarking against a live production or external testbed instance. The emulation system is constructed from measurements collected on real networked systems to replicate key components and dynamics, and the use cases demonstrate that simulation-learned strategies transfer to this emulated setting with limited performance loss. However, we recognize that this does not constitute full validation on an operational network. We will add an explicit limitations paragraph in the discussion section noting the reliance on emulation as a proxy and outlining future work on testbed validation. This addresses the concern without altering the core platform description. revision: partial

Circularity Check

0 steps flagged

No circularity detected in platform description or use-case demonstrations

full rationale

The paper presents CSLE as an experimental platform with an emulation system for data collection and MDP identification, followed by simulation-based RL training and refinement in emulation. No mathematical derivations, equations, or first-principles predictions are claimed that reduce to inputs by construction. The use cases (flow control, replication control, etc.) are internal demonstrations of the platform's operation rather than fitted predictions or self-referential results. The approximation to operational systems is stated as a design goal without self-citation load-bearing or ansatz smuggling. The derivation chain is therefore self-contained as a tool description, with no steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unstated premise that the emulation faithfully captures operational dynamics and that the MDP model extracted from logs is sufficient for near-optimal policy learning. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Emulation of key system components produces logs and measurements representative of the target operational environment.
Invoked when the paper states that the emulation system is used to gather data for building the system model.
domain assumption A Markov decision process extracted from emulation logs is an adequate model for learning security strategies that transfer back to the emulation.
Stated implicitly when the simulation system learns strategies on the identified system model.

pith-pipeline@v0.9.0 · 5477 in / 1355 out tokens · 35908 ms · 2026-05-10T10:17:08.660560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Anderson, H

Digital Twin Summit. Anderson, H. S., Kharkar, A., Filar, B., Evans, D., and Roth, P. Learning to evade static PE machine learning malware models via reinforcement learning, 2018. URL https://arxiv.org/abs/1801.08917. Andrew, A., Spillard, S., Collyer, J., and Dhir, N. De- veloping optimal causal cyber-defence agents via cyber security simulation. InProce...

work page doi:10.1109/icmect.2019 2018
[2]

Carrasco, J

Activity analysis of production and allocation. Carrasco, J. A. F., Pagola, I. A., Urrutia, R. O., and Ro- man, R. CYBERSHIELD: A competitive simulation environment for training AI in cybersecurity. In2024 11th International Conference on Internet of Things: Sys- tems, Management and Security (IOTSMS), pp. 11–18,

work page
[3]

Chen, Y ., Shetty, M., Somashekar, G., Ma, M., Simmhan, Y ., Mace, J., Bansal, C., Wang, R., and Rajmohan, S

doi: 10.1109/IOTSMS62296.2024.10710208. Chen, Y ., Shetty, M., Somashekar, G., Ma, M., Simmhan, Y ., Mace, J., Bansal, C., Wang, R., and Rajmohan, S. AIOpsLab: a holistic framework to evaluate AI agents for enabling autonomous clouds. https://mlsys.org/ virtual/2025/poster/3285, 2025. MLSys 2025 Poster. Clemm, A. and Cisco Systems, I.Network Management Fu...

work page doi:10.1109/iotsms62296.2024.10710208 2024
[4]

Proximal Policy Optimization Algorithms

USENIX Association. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algo- rithms. 2017. URL http://arxiv.org/abs/ 1707.06347. Schwartz, J., Kurniawati, H., and El-Mahassni, E. POMDP + information-decay: Incorporating defender’s behaviour in autonomous penetration testing.Pro- ceedings of the International C...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

CSLE: A Reinforcement Learning Platform for Au- tonomous Security Management

URL https://ojs.aaai.org/index. php/ICAPS/article/view/6666. Song, W., Li, X., Afroz, S., Garg, D., Kuznetsov, D., and Yin, H. MAB-Malware: A reinforcement learning frame- work for blackbox generation of adversarial malware. InProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’22, pp. 990–1003, New York, NY ,...

work page doi:10.1145/3488932.3497768 2022

[1] [1]

Anderson, H

Digital Twin Summit. Anderson, H. S., Kharkar, A., Filar, B., Evans, D., and Roth, P. Learning to evade static PE machine learning malware models via reinforcement learning, 2018. URL https://arxiv.org/abs/1801.08917. Andrew, A., Spillard, S., Collyer, J., and Dhir, N. De- veloping optimal causal cyber-defence agents via cyber security simulation. InProce...

work page doi:10.1109/icmect.2019 2018

[2] [2]

Carrasco, J

Activity analysis of production and allocation. Carrasco, J. A. F., Pagola, I. A., Urrutia, R. O., and Ro- man, R. CYBERSHIELD: A competitive simulation environment for training AI in cybersecurity. In2024 11th International Conference on Internet of Things: Sys- tems, Management and Security (IOTSMS), pp. 11–18,

work page

[3] [3]

Chen, Y ., Shetty, M., Somashekar, G., Ma, M., Simmhan, Y ., Mace, J., Bansal, C., Wang, R., and Rajmohan, S

doi: 10.1109/IOTSMS62296.2024.10710208. Chen, Y ., Shetty, M., Somashekar, G., Ma, M., Simmhan, Y ., Mace, J., Bansal, C., Wang, R., and Rajmohan, S. AIOpsLab: a holistic framework to evaluate AI agents for enabling autonomous clouds. https://mlsys.org/ virtual/2025/poster/3285, 2025. MLSys 2025 Poster. Clemm, A. and Cisco Systems, I.Network Management Fu...

work page doi:10.1109/iotsms62296.2024.10710208 2024

[4] [4]

Proximal Policy Optimization Algorithms

USENIX Association. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algo- rithms. 2017. URL http://arxiv.org/abs/ 1707.06347. Schwartz, J., Kurniawati, H., and El-Mahassni, E. POMDP + information-decay: Incorporating defender’s behaviour in autonomous penetration testing.Pro- ceedings of the International C...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

CSLE: A Reinforcement Learning Platform for Au- tonomous Security Management

URL https://ojs.aaai.org/index. php/ICAPS/article/view/6666. Song, W., Li, X., Afroz, S., Garg, D., Kuznetsov, D., and Yin, H. MAB-Malware: A reinforcement learning frame- work for blackbox generation of adversarial malware. InProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’22, pp. 990–1003, New York, NY ,...

work page doi:10.1145/3488932.3497768 2022