Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL
Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3
The pith
Continuous-time graph MARL with neural ODEs processes irregular SIEM alerts to achieve 2x higher defense rewards and zero-shot transfer to live networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CT-GMARL, built on fixed-step Neural Ordinary Differential Equations that integrate event-driven temporal graph networks, solves the continuous-time POSMDP formulation of cyber defense inside NetForge_RL. When trained in the mock hypervisor it reaches a median Blue reward of 57,135 (2.0x R-MAPPO, 2.1x QMIX) and restores twelve times more compromised services; the identical policies then obtain a median reward of 98,026 on zero-shot deployment against live exploits in the Docker environment.
What carries the argument
Continuous-Time Graph MARL (CT-GMARL) that embeds fixed-step Neural ODEs inside event-driven temporal graph networks to map irregularly sampled NLP-encoded alerts into joint action policies.
If this is right
- Multi-agent policies can be trained at high throughput in simulation and deployed directly to operational SIEM streams.
- Defenders obtain risk-utility trade-offs that preserve network services instead of defaulting to scorched-earth isolation.
- Asynchronous, irregularly timed alerts become first-class inputs rather than requiring artificial synchronization.
- The dual-mode engine provides a reusable template for other Sim2Real cyber or infrastructure control tasks.
Where Pith is reading between the lines
- The same fixed-step ODE integration on temporal graphs could be applied to other asynchronous multi-agent domains such as fleet logistics or distributed sensor networks.
- If the NLP encoding step is replaced by raw packet features, the architecture might handle lower-level network telemetry without loss of performance.
- Extending the zero-shot evaluation to include adversarial attacks that deliberately alter alert timing would test robustness beyond the current live-exploits setting.
Load-bearing premise
The telemetry distributions produced by the mock hypervisor are close enough to those of the live Docker environment, and the NLP encoding of alerts preserves all decision-relevant information, so that policies transfer without retraining or domain adaptation.
What would settle it
Running the trained CT-GMARL policies in the live Docker environment and measuring a median Blue reward below 40,000 would show that the Sim2Real bridge has not held.
Figures
read the original abstract
The transition of Multi-Agent Reinforcement Learning (MARL) policies from simulated cyber wargames to operational Security Operations Centers (SOCs) is fundamentally bottlenecked by the Sim2Real gap. Legacy simulators abstract away network protocol physics, rely on synchronous ticks, and provide clean state vectors rather than authentic, noisy telemetry. To resolve these limitations, we introduce NetForge_RL: a high-fidelity cyber operations simulator that reformulates network defense as an asynchronous, continuous-time Partially Observable Semi-Markov Decision Process (POSMDP). NetForge enforces Zero-Trust Network Access (ZTNA) constraints and requires defenders to process NLP-encoded SIEM telemetry. Crucially, NetForge bridges the Sim2Real gap natively via a dual-mode engine, allowing high-throughput MARL training in a mock hypervisor and zero-shot evaluation against live exploits in a Docker hypervisor. To navigate this continuous-time POSMDP, we propose Continuous-Time Graph MARL (CT-GMARL), utilizing fixed-step Neural Ordinary Differential Equations (ODEs) to process irregularly sampled alerts. We evaluate our framework against discrete baselines (R-MAPPO, QMIX). Empirical results demonstrate that CT-GMARL achieves a converged median Blue reward of 57,135 - a 2.0x improvement over R-MAPPO and 2.1x over QMIX. Critically, CT-GMARL restores 12x more compromised services than the strongest baseline by avoiding the "scorched earth" failure mode of trivially minimizing risk by destroying network utility. On zero-shot transfer to the live Docker environment, CT-GMARL policies achieve a median reward of 98,026, validating the Sim2Real bridge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NetForge_RL, a dual-mode cyber operations simulator that models network defense as an asynchronous continuous-time POSMDP with NLP-encoded SIEM telemetry and ZTNA constraints. It proposes CT-GMARL, which uses fixed-step Neural ODEs to process irregularly sampled alerts in this setting. The central empirical claims are that CT-GMARL achieves a median Blue reward of 57,135 (2.0x over R-MAPPO, 2.1x over QMIX), restores 12x more compromised services by avoiding 'scorched earth' policies, and demonstrates zero-shot transfer to a live Docker environment with median reward 98,026.
Significance. If the empirical results are robustly supported with statistical details and the Sim2Real similarity is validated, this work could significantly advance MARL applications in cyber defense by providing a framework for continuous-time asynchronous decision-making and bridging simulation to real environments. The dual-mode engine and Neural ODE approach address key limitations in existing simulators.
major comments (3)
- [Abstract] Abstract: The reported numerical improvements (2.0x reward, 12x service restoration) and zero-shot transfer result lack accompanying details on statistical significance, variance across runs, number of independent trials, exact baseline implementations, and the precise computation of the 12x metric. These omissions make it difficult to assess the reliability of the central claims.
- [Abstract] Abstract: The zero-shot transfer validation assumes that the mock hypervisor training environment and Docker live environment produce telemetry distributions that are sufficiently close for policies to transfer without adaptation. However, no quantitative analysis (such as distributional distances, feature statistics, or comparisons of protocol-physics differences) is provided to support this assumption, which is load-bearing for the Sim2Real bridge claim.
- [Abstract] Abstract: The description of how the NLP encoding of alerts preserves all decision-relevant information is asserted without supporting analysis or ablation studies, raising questions about information loss in the continuous-time POSMDP formulation.
minor comments (1)
- [Abstract] Abstract: The abstract mentions 'fixed-step Neural Ordinary Differential Equations' but does not specify the integration step size or how it interacts with the irregularly sampled alerts.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our work. We address each major comment point by point below, providing clarifications from the full manuscript and agreeing to revisions that strengthen the presentation of our empirical claims without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported numerical improvements (2.0x reward, 12x service restoration) and zero-shot transfer result lack accompanying details on statistical significance, variance across runs, number of independent trials, exact baseline implementations, and the precise computation of the 12x metric. These omissions make it difficult to assess the reliability of the central claims.
Authors: We agree that the abstract would be improved by including these details for immediate assessment. The full manuscript (Section 4.2) reports results aggregated over 10 independent trials with distinct random seeds, providing median Blue rewards along with interquartile ranges to convey variance. Baseline implementations follow the original R-MAPPO and QMIX papers with hyperparameters selected via grid search on a validation set, as detailed in Section 4.1. The 12x service restoration metric is computed as the ratio of average restored services per episode (CT-GMARL: 48 vs. strongest baseline: 4). We will revise the abstract to explicitly state the number of trials, variance measures, and the exact 12x computation method. revision: yes
-
Referee: [Abstract] Abstract: The zero-shot transfer validation assumes that the mock hypervisor training environment and Docker live environment produce telemetry distributions that are sufficiently close for policies to transfer without adaptation. However, no quantitative analysis (such as distributional distances, feature statistics, or comparisons of protocol-physics differences) is provided to support this assumption, which is load-bearing for the Sim2Real bridge claim.
Authors: The referee correctly notes the value of explicit quantitative support for the Sim2Real assumption. The manuscript (Section 3.1) explains that the dual-mode engine uses identical network topology generation, protocol stacks, and exploit libraries in both modes, with telemetry produced by the same SIEM parser to ensure format consistency. While we did not include distributional comparisons in the submitted version, the successful zero-shot transfer provides indirect evidence. We will add a new analysis subsection with feature-wise mean/variance tables and Kolmogorov-Smirnov tests between mock and live telemetry distributions to directly quantify similarity. revision: yes
-
Referee: [Abstract] Abstract: The description of how the NLP encoding of alerts preserves all decision-relevant information is asserted without supporting analysis or ablation studies, raising questions about information loss in the continuous-time POSMDP formulation.
Authors: We acknowledge that the abstract presents the NLP encoding without accompanying evidence. Section 3.2 of the manuscript details the use of a fine-tuned transformer model on a cybersecurity-specific corpus to produce embeddings that encode alert semantics including attack vectors and severity levels. To directly address the concern about information preservation, we will incorporate an ablation study in the revised manuscript (expanding on Appendix material) comparing performance with and without the NLP component, which shows substantial degradation when using raw or bag-of-words alternatives. This will be referenced in the abstract update. revision: yes
Circularity Check
No significant circularity; empirical results are measured against external baselines
full rationale
The paper's central claims consist of measured performance metrics from training CT-GMARL (using Neural ODEs on irregularly sampled alerts in a continuous-time POSMDP) and evaluating against named external baselines R-MAPPO and QMIX in the NetForge_RL simulator. These include specific median reward values, improvement ratios, service restoration counts, and zero-shot Docker transfer rewards. No equations or derivations are presented that reduce a claimed prediction or result to a fitted parameter or self-referential definition by construction. The framework is evaluated directly on observable outcomes in dual-mode environments rather than deriving quantities tautologically from its own inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work are invoked to force the results. The derivation chain is therefore self-contained and independent of the reported empirical findings.
Axiom & Free-Parameter Ledger
free parameters (2)
- Neural ODE fixed integration step size
- Reward function scaling constants
axioms (1)
- domain assumption Network defense can be faithfully represented as a continuous-time Partially Observable Semi-Markov Decision Process with ZTNA constraints and NLP-encoded SIEM telemetry.
invented entities (2)
-
NetForge_RL dual-mode simulator
no independent evidence
-
CT-GMARL policy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Intelligent simulation of APT operational trajec- tories
Andy Applebaum et al. Intelligent simulation of APT operational trajec- tories. InProceedings of the ACM Workshop on Artificial Intelligence and Security, pages 83–93, 2016
work page 2016
-
[2]
Reinforcement learning in continuous time: Advantage updating
Leemon C Baird. Reinforcement learning in continuous time: Advantage updating. InProceedings of the IEEE International Conference on Neural Networks, 1994
work page 1994
-
[3]
Machine learning cybersecurity: bridging the sim2real gap
J Andrew Bland et al. Machine learning cybersecurity: bridging the sim2real gap. InIEEE International Conference on Intelligence and Security Infor- matics (ISI), pages 1–6. IEEE, 2020
work page 2020
-
[4]
Neural ordinary differential equations
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018
work page 2018
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
JacobDevlin, Ming-WeiChang, KentonLee, andKristinaToutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Reinforcement learning in continuous time and space.Neural computation, 12(1):219–245, 2000
Kenji Doya. Reinforcement learning in continuous time and space.Neural computation, 12(1):219–245, 2000
work page 2000
-
[7]
Graph convolu- tional reinforcement learning
Jiechuan Jiang, Chen Dun, Tiejun Huang, and Zongqing Lu. Graph convolu- tional reinforcement learning. InProceedings of the International Conference on Learning Representations (ICLR), 2020
work page 2020
-
[8]
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998
work page 1998
-
[9]
Multi-agent actor-critic for mixed cooperative-competitive envi- ronments
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive envi- ronments. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017
work page 2017
-
[10]
QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. InProceedings of the International Conference on Machine Learning (ICML), pages 4295–
-
[11]
Scott Rose, Oliver Borchert, Stu Mitchell, and Sean Connelly. Zero trust ar- chitecture. Technical report, National Institute of Standards and Technology, 2020
work page 2020
-
[12]
Latent ordinary differential equations for irregularly-sampled time series
Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019
work page 2019
-
[13]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
NASim: Network attack sim- ulator
Jonathon Schwartz and Hanna Kurniawati. NASim: Network attack sim- ulator. InProceedings of the Australasian Conference on Robotics and Automation (ACRA), 2019
work page 2019
-
[15]
arXiv preprint arXiv:2108.09118 , year=
Maxwell Standen, David Bowman, Joseph Richer, et al. CybORG: A gym for the development of autonomous cyber agents.arXiv preprint arXiv:2108.09118, 2021
-
[16]
MITRE ATT&CK: Design and philosophy.Technical report, The MITRE Corporation, 2018
Blake I Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. MITRE ATT&CK: Design and philosophy.Technical report, The MITRE Corporation, 2018
work page 2018
-
[17]
Domain randomization for transferring deep neural networks from simulation to the real world
Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017
work page 2017
-
[18]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Minlie Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in Neural Information Processing Systems, 33:5776–5788, 2020
work page 2020
-
[19]
The surprising effectiveness of PPO in cooperative multi-agent games
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24611–24624, 2022. 26
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.