arxiv: 2603.16969 · v2 · submitted 2026-03-17 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

DeepStage: Learning Autonomous Defense Policies Against Multi-Stage APT Campaigns

Trung V. Phan , Tri Gia Nguyen , Thomas Bauschert

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords multi-stage APTdeep reinforcement learningprovenance graphsstage estimationautonomous cyber defensePPO agentMITRE ATT&CKPOMDP

0 comments

The pith

DeepStage learns stage-aware defense policies by estimating APT progression from provenance graphs to guide reinforcement learning actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates enterprise cyber defense as a POMDP where host provenance and network telemetry are fused into graphs. A graph neural network encoder paired with an LSTM stage estimator infers probabilistic attacker stages aligned to the MITRE ATT&CK framework. These stage beliefs then drive a hierarchical PPO agent that chooses actions across monitoring, containment, and remediation. In a realistic testbed running CALDERA APT playbooks, the method records an average F1-score of 0.887 and 84.7 percent mitigation success, exceeding a risk-aware baseline. A reader would care because the results indicate that explicit stage tracking can produce more accurate and effective autonomous responses to phased attacks than non-stage-aware policies.

Core claim

DeepStage fuses provenance graphs and applies a GNN encoder with LSTM-based stage estimation to produce probabilistic attacker stage beliefs that guide a hierarchical PPO agent in selecting defense actions, yielding higher F1-scores and mitigation success than risk-aware baselines in CALDERA-driven testbed experiments.

What carries the argument

The LSTM-based stage estimator that converts GNN embeddings of unified provenance graphs into probabilistic stage beliefs aligned with MITRE ATT&CK to condition the hierarchical PPO policy.

If this is right

Defense actions can be chosen dynamically according to the inferred current stage of an ongoing APT campaign.
Stage-aware policies improve both detection accuracy and overall mitigation success compared with stage-agnostic baselines.
Fusing host and network telemetry into single provenance graphs supplies the observability needed for the POMDP formulation to work.
Hierarchical PPO enables cost-efficient responses by restricting aggressive actions to later attack stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stage-estimation approach could be tested on other sequential threats such as ransomware or supply-chain attacks.
If the estimator proves robust to unseen tactics, the framework could reduce reliance on human analysts for routine containment decisions.
Adding more telemetry modalities to the graph construction might further improve stage inference accuracy in diverse environments.

Load-bearing premise

The CALDERA-driven APT playbooks run in the testbed produce stage progressions and provenance patterns that are representative enough of real-world multi-stage attacks for the measured performance to transfer.

What would settle it

Execute DeepStage in a live enterprise network against actual multi-stage APT campaigns and measure whether the F1-score and mitigation success remain at or above the reported testbed levels.

Figures

Figures reproduced from arXiv: 2603.16969 by Thomas Bauschert, Tri Gia Nguyen, Trung V. Phan.

**Figure 2.** Figure 2: Cost–effectiveness frontiers illustrating normalized security gain [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Defense responsiveness over APT stage transitions. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

This paper presents DeepStage, a deep reinforcement learning (DRL) framework for adaptive and stage-aware defense against Advanced Persistent Threats (APTs). The enterprise environment is formulated as a partially observable Markov decision process (POMDP), in which host provenance and network telemetry are fused into unified provenance graphs. Building on our prior work (StageFinder), DeepStage employs a graph neural network encoder and an LSTM-based stage estimator to infer probabilistic attacker stages aligned with the MITRE ATT&CK framework. The resulting stage beliefs, together with graph embeddings, are used to guide a hierarchical Proximal Policy Optimization (PPO) agent that selects defense actions across monitoring, access control, containment, and remediation. Experiments in a realistic enterprise testbed with CALDERA-driven APT playbooks show that DeepStage achieves an average F1-score of 0.887 and a mitigation success rate of 84.7%, outperforming a risk-aware DRL baseline by 21.8% in F1-score and 16.2% in mitigation success. The results demonstrate effective stage-aware and cost-efficient autonomous cyber defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeepStage, a deep reinforcement learning framework for autonomous defense against multi-stage APT campaigns. It models the enterprise environment as a POMDP, fuses host provenance and network telemetry into graphs, uses a GNN encoder and LSTM stage estimator aligned with MITRE ATT&CK to infer attacker stages, and employs a hierarchical PPO agent for selecting defense actions. Experiments in a CALDERA-driven testbed report an average F1-score of 0.887 and 84.7% mitigation success rate, outperforming a risk-aware DRL baseline by 21.8% and 16.2% respectively.

Significance. If the empirical results hold under rigorous validation, this work would advance autonomous cyber defense by showing how stage-aware inference via GNN-LSTM can be integrated with hierarchical RL to improve mitigation in partially observable enterprise settings. The approach builds explicitly on prior StageFinder work and provides concrete performance gains in a testbed environment.

major comments (2)

[Abstract and §5] Abstract and §5 (Experiments): The reported performance figures (F1-score 0.887, mitigation success 84.7%, +21.8% / +16.2% over baseline) are presented without any description of experimental design details such as hyperparameter search ranges, number of runs, error bars, data exclusion criteria, or statistical significance testing. This directly undermines assessment of whether the numbers support the central claim of effective stage-aware defense.
[§4] §4 (Testbed and Evaluation): The entire set of results derives from a single enterprise testbed driven by CALDERA APT playbooks. No cross-validation against real-world provenance traces, alternative frameworks, or sensitivity analysis to variations in stealth/lateral movement patterns is provided, leaving the representativeness assumption untested and load-bearing for the POMDP, GNN, LSTM, and PPO components.

minor comments (2)

[§3.1] Clarify the exact fusion mechanism for provenance graphs and network telemetry in §3.1; the description is high-level and could benefit from a diagram or pseudocode.
[§3.3] Add a table summarizing the action space (monitoring, access control, containment, remediation) with costs and effects to improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on experimental rigor and generalizability. We address each major comment below and will incorporate revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): The reported performance figures (F1-score 0.887, mitigation success 84.7%, +21.8% / +16.2% over baseline) are presented without any description of experimental design details such as hyperparameter search ranges, number of runs, error bars, data exclusion criteria, or statistical significance testing. This directly undermines assessment of whether the numbers support the central claim of effective stage-aware defense.

Authors: We agree that the manuscript would benefit from more detailed reporting of the experimental protocol. In the revised version, we will expand §5 (and update the abstract) to specify: the hyperparameter search ranges and method (grid search with validation) for the GNN, LSTM, and hierarchical PPO; that all metrics are averaged over 10 independent random seeds with standard deviations and error bars; that no data were excluded; and the results of paired t-tests (with p-values) confirming statistical significance of the reported gains over the baseline. These additions will directly address the concern and allow readers to better assess the reliability of the 0.887 F1 and 84.7% mitigation figures. revision: yes
Referee: [§4] §4 (Testbed and Evaluation): The entire set of results derives from a single enterprise testbed driven by CALDERA APT playbooks. No cross-validation against real-world provenance traces, alternative frameworks, or sensitivity analysis to variations in stealth/lateral movement patterns is provided, leaving the representativeness assumption untested and load-bearing for the POMDP, GNN, LSTM, and PPO components.

Authors: We acknowledge that a single testbed constitutes a genuine limitation for claims of broad applicability. In the revision we will add (i) a sensitivity analysis subsection in §4 that systematically varies stealth probability and lateral-movement speed within the CALDERA environment and reports the resulting performance ranges, and (ii) an explicit limitations paragraph discussing the single-testbed constraint. Cross-validation against real-world provenance traces, however, cannot be performed at present because no suitably labeled public datasets exist that match the required host-plus-network granularity and MITRE ATT&CK stage annotations; we will therefore note this as future work rather than claim it has been done. revision: partial

standing simulated objections not resolved

Cross-validation against real-world provenance traces is not feasible due to the absence of publicly available, appropriately labeled datasets.

Circularity Check

1 steps flagged

One minor self-citation to prior StageFinder work that is not load-bearing for the empirical claims.

specific steps

self citation load bearing [Abstract]
"Building on our prior work (StageFinder), DeepStage employs a graph neural network encoder and an LSTM-based stage estimator to infer probabilistic attacker stages aligned with the MITRE ATT&CK framework."

The stage estimator is imported from the authors' own prior paper, yet the headline performance figures are produced by running the full system against CALDERA-generated traces in the testbed; the citation does not reduce the reported F1 or mitigation gains to the citation itself.

full rationale

The paper's central claims consist of measured F1-score (0.887) and mitigation success rate (84.7%) obtained from explicit testbed experiments using CALDERA playbooks. These are not quantities derived by definition or by fitting parameters to the same data. The single self-citation to StageFinder supports the LSTM stage estimator module but does not justify the performance numbers or force them by construction. The POMDP formulation, GNN encoder, and hierarchical PPO are standard components whose outputs are evaluated externally in the testbed rather than being tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The framework relies on standard machine-learning components whose training involves many free parameters; no new physical entities are postulated.

free parameters (1)

GNN, LSTM, and PPO hyperparameters
Standard training knobs whose specific values are not reported in the abstract but are required for the reported performance.

axioms (1)

domain assumption Enterprise networks and APT activity can be usefully modeled as a POMDP whose observations are provenance graphs.
Explicitly stated as the problem formulation in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1493 out tokens · 53562 ms · 2026-05-15T10:23:34.232350+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The enterprise environment is formulated as a partially observable Markov decision process (POMDP), in which host provenance and network telemetry are fused into unified provenance graphs... graph neural network encoder and an LSTM-based stage estimator... hierarchical Proximal Policy Optimization (PPO) agent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Learning the apt kill chain: Temporal reasoning over provenance data for attack stage estimation,

T. V . Phanet al., “Learning the apt kill chain: Temporal reasoning over provenance data for attack stage estimation,” inIEEE International Conference on Communications (ICC), 2026

work page 2026
[2]

A survey on advanced persistent threats: Tech- niques, solutions, challenges, and research opportunities,

A. Alshamraniet al., “A survey on advanced persistent threats: Tech- niques, solutions, challenges, and research opportunities,”IEEE Com- munications Surveys & Tutorials, vol. 21, no. 2, pp. 1851–1877, 2019

work page 2019
[3]

Mitre att&ck®: A knowledge base of adversary tactics, techniques, and common knowledge

T. M. Corporation, “Mitre att&ck®: A knowledge base of adversary tactics, techniques, and common knowledge.” https://attack.mitre.org,

work page
[4]

Accessed: 2026-03-08

work page 2026
[5]

A survey on advanced persistent threat detection: A unified framework, challenges, and countermeasures,

B. Zhanget al., “A survey on advanced persistent threat detection: A unified framework, challenges, and countermeasures,”ACM Comput. Surv., vol. 57, Nov. 2024

work page 2024
[6]

Explainable deep learning approach for advanced persistent threats (apts) detection in cybersecurity: A review,

N. H. A. Mutalibet al., “Explainable deep learning approach for advanced persistent threats (apts) detection in cybersecurity: A review,” Artificial Intelligence Review, vol. 57, no. 11, p. 297, 2024

work page 2024
[7]

A survey of intrusion detection systems leveraging host data,

R. A. Bridgeset al., “A survey of intrusion detection systems leveraging host data,”ACM Comput. Surv., vol. 52, Nov. 2019

work page 2019
[8]

MITRE Caldera: Automated Adversary Emulation Platform,

T. M. Corporation, “MITRE Caldera: Automated Adversary Emulation Platform,” 2024. Open-source adversary-emulation system used for breach-and-attack simulation of APT playbooks

work page 2024
[9]

Deep reinforcement learning for cyber security,

T. T. Nguyenet al., “Deep reinforcement learning for cyber security,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 3779–3795, 2021

work page 2021
[10]

Network intrusion response using deep reinforcement learning in an aircraft it-ot scenario,

M. Reaneyet al., “Network intrusion response using deep reinforcement learning in an aircraft it-ot scenario,” inProceedings of the 19th International Conference on Availability, Reliability and Security, ARES ’24, Association for Computing Machinery, 2024

work page 2024
[11]

Automated apt defense using reinforcement learning and attack graph risk-based situation awareness,

A. T. Leet al., “Automated apt defense using reinforcement learning and attack graph risk-based situation awareness,” inProceedings of the Workshop on Autonomous Cybersecurity, AutonomousCyber ’24, (New York, NY , USA), p. 23–33, Association for Computing Machinery, 2024

work page 2024
[12]

Leveraging deep reinforcement learning for cyber- attack paths prediction: Formulation, generalization, and evaluation,

F. Terranovaet al., “Leveraging deep reinforcement learning for cyber- attack paths prediction: Formulation, generalization, and evaluation,” inProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’24, (New York, NY , USA), p. 1–16, Association for Computing Machinery, 2024

work page 2024
[13]

Deep-shield: Multiphase mitigation of apt via hierarchical deep reinforcement learning,

Y . Caoet al., “Deep-shield: Multiphase mitigation of apt via hierarchical deep reinforcement learning,”IEEE Internet of Things Journal, vol. 12, no. 15, pp. 30970–30982, 2025

work page 2025
[14]

Offline reinforcement learning for autonomous cyber de- fense agents,

A. Weiet al., “Offline reinforcement learning for autonomous cyber de- fense agents,” in2024 Winter Simulation Conference (WSC), pp. 1978– 1989, 2024

work page 1978
[15]

Zeek: The Network Security Monitor (for- merly Bro)

V . Paxson and T. Z. Project, “Zeek: The Network Security Monitor (for- merly Bro).” https://zeek.org/, 2019. Open-source network monitoring framework. Accessed: 2026-03-08

work page 2019
[16]

Planning and others,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and others,”Artificial Intelligence, vol. 101, no. 1–2, pp. 99–134, 1998

work page 1998
[17]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal,et al., “Proximal policy optimiza- tion algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Practical whole-system provenance capture,

T. Pasquieret al., “Practical whole-system provenance capture,” in Symposium on Cloud Computing (SoCC’17), ACM, 2017

work page 2017
[19]

Ppo hyperparameter optimization,

J. Baptista, “Ppo hyperparameter optimization,”PhD Weekly Report,

work page
[20]

Discusses hyperparameter ranges and effects specifically for Proximal Policy Optimization

work page
[21]

Dynamic SDN-based radio access network slicing with deep reinforcement learning for URLLC and eMBB services,

A. Filaliet al., “Dynamic SDN-based radio access network slicing with deep reinforcement learning for URLLC and eMBB services,” IEEE Transactions on Network Science and Engineering, vol. 9, no. 4, pp. 2174–2187, 2022

work page 2022