pith. machine review for the scientific record. sign in

arxiv: 2603.16969 · v2 · submitted 2026-03-17 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

DeepStage: Learning Autonomous Defense Policies Against Multi-Stage APT Campaigns

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords multi-stage APTdeep reinforcement learningprovenance graphsstage estimationautonomous cyber defensePPO agentMITRE ATT&CKPOMDP
0
0 comments X

The pith

DeepStage learns stage-aware defense policies by estimating APT progression from provenance graphs to guide reinforcement learning actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates enterprise cyber defense as a POMDP where host provenance and network telemetry are fused into graphs. A graph neural network encoder paired with an LSTM stage estimator infers probabilistic attacker stages aligned to the MITRE ATT&CK framework. These stage beliefs then drive a hierarchical PPO agent that chooses actions across monitoring, containment, and remediation. In a realistic testbed running CALDERA APT playbooks, the method records an average F1-score of 0.887 and 84.7 percent mitigation success, exceeding a risk-aware baseline. A reader would care because the results indicate that explicit stage tracking can produce more accurate and effective autonomous responses to phased attacks than non-stage-aware policies.

Core claim

DeepStage fuses provenance graphs and applies a GNN encoder with LSTM-based stage estimation to produce probabilistic attacker stage beliefs that guide a hierarchical PPO agent in selecting defense actions, yielding higher F1-scores and mitigation success than risk-aware baselines in CALDERA-driven testbed experiments.

What carries the argument

The LSTM-based stage estimator that converts GNN embeddings of unified provenance graphs into probabilistic stage beliefs aligned with MITRE ATT&CK to condition the hierarchical PPO policy.

If this is right

  • Defense actions can be chosen dynamically according to the inferred current stage of an ongoing APT campaign.
  • Stage-aware policies improve both detection accuracy and overall mitigation success compared with stage-agnostic baselines.
  • Fusing host and network telemetry into single provenance graphs supplies the observability needed for the POMDP formulation to work.
  • Hierarchical PPO enables cost-efficient responses by restricting aggressive actions to later attack stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stage-estimation approach could be tested on other sequential threats such as ransomware or supply-chain attacks.
  • If the estimator proves robust to unseen tactics, the framework could reduce reliance on human analysts for routine containment decisions.
  • Adding more telemetry modalities to the graph construction might further improve stage inference accuracy in diverse environments.

Load-bearing premise

The CALDERA-driven APT playbooks run in the testbed produce stage progressions and provenance patterns that are representative enough of real-world multi-stage attacks for the measured performance to transfer.

What would settle it

Execute DeepStage in a live enterprise network against actual multi-stage APT campaigns and measure whether the F1-score and mitigation success remain at or above the reported testbed levels.

Figures

Figures reproduced from arXiv: 2603.16969 by Thomas Bauschert, Tri Gia Nguyen, Trung V. Phan.

Figure 1
Figure 1. Figure 1: Data and control flow of the proposed DeepStage framework. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cost–effectiveness frontiers illustrating normalized security gain [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Defense responsiveness over APT stage transitions. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

This paper presents DeepStage, a deep reinforcement learning (DRL) framework for adaptive and stage-aware defense against Advanced Persistent Threats (APTs). The enterprise environment is formulated as a partially observable Markov decision process (POMDP), in which host provenance and network telemetry are fused into unified provenance graphs. Building on our prior work (StageFinder), DeepStage employs a graph neural network encoder and an LSTM-based stage estimator to infer probabilistic attacker stages aligned with the MITRE ATT&CK framework. The resulting stage beliefs, together with graph embeddings, are used to guide a hierarchical Proximal Policy Optimization (PPO) agent that selects defense actions across monitoring, access control, containment, and remediation. Experiments in a realistic enterprise testbed with CALDERA-driven APT playbooks show that DeepStage achieves an average F1-score of 0.887 and a mitigation success rate of 84.7%, outperforming a risk-aware DRL baseline by 21.8% in F1-score and 16.2% in mitigation success. The results demonstrate effective stage-aware and cost-efficient autonomous cyber defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeepStage, a deep reinforcement learning framework for autonomous defense against multi-stage APT campaigns. It models the enterprise environment as a POMDP, fuses host provenance and network telemetry into graphs, uses a GNN encoder and LSTM stage estimator aligned with MITRE ATT&CK to infer attacker stages, and employs a hierarchical PPO agent for selecting defense actions. Experiments in a CALDERA-driven testbed report an average F1-score of 0.887 and 84.7% mitigation success rate, outperforming a risk-aware DRL baseline by 21.8% and 16.2% respectively.

Significance. If the empirical results hold under rigorous validation, this work would advance autonomous cyber defense by showing how stage-aware inference via GNN-LSTM can be integrated with hierarchical RL to improve mitigation in partially observable enterprise settings. The approach builds explicitly on prior StageFinder work and provides concrete performance gains in a testbed environment.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Experiments): The reported performance figures (F1-score 0.887, mitigation success 84.7%, +21.8% / +16.2% over baseline) are presented without any description of experimental design details such as hyperparameter search ranges, number of runs, error bars, data exclusion criteria, or statistical significance testing. This directly undermines assessment of whether the numbers support the central claim of effective stage-aware defense.
  2. [§4] §4 (Testbed and Evaluation): The entire set of results derives from a single enterprise testbed driven by CALDERA APT playbooks. No cross-validation against real-world provenance traces, alternative frameworks, or sensitivity analysis to variations in stealth/lateral movement patterns is provided, leaving the representativeness assumption untested and load-bearing for the POMDP, GNN, LSTM, and PPO components.
minor comments (2)
  1. [§3.1] Clarify the exact fusion mechanism for provenance graphs and network telemetry in §3.1; the description is high-level and could benefit from a diagram or pseudocode.
  2. [§3.3] Add a table summarizing the action space (monitoring, access control, containment, remediation) with costs and effects to improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on experimental rigor and generalizability. We address each major comment below and will incorporate revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Experiments): The reported performance figures (F1-score 0.887, mitigation success 84.7%, +21.8% / +16.2% over baseline) are presented without any description of experimental design details such as hyperparameter search ranges, number of runs, error bars, data exclusion criteria, or statistical significance testing. This directly undermines assessment of whether the numbers support the central claim of effective stage-aware defense.

    Authors: We agree that the manuscript would benefit from more detailed reporting of the experimental protocol. In the revised version, we will expand §5 (and update the abstract) to specify: the hyperparameter search ranges and method (grid search with validation) for the GNN, LSTM, and hierarchical PPO; that all metrics are averaged over 10 independent random seeds with standard deviations and error bars; that no data were excluded; and the results of paired t-tests (with p-values) confirming statistical significance of the reported gains over the baseline. These additions will directly address the concern and allow readers to better assess the reliability of the 0.887 F1 and 84.7% mitigation figures. revision: yes

  2. Referee: [§4] §4 (Testbed and Evaluation): The entire set of results derives from a single enterprise testbed driven by CALDERA APT playbooks. No cross-validation against real-world provenance traces, alternative frameworks, or sensitivity analysis to variations in stealth/lateral movement patterns is provided, leaving the representativeness assumption untested and load-bearing for the POMDP, GNN, LSTM, and PPO components.

    Authors: We acknowledge that a single testbed constitutes a genuine limitation for claims of broad applicability. In the revision we will add (i) a sensitivity analysis subsection in §4 that systematically varies stealth probability and lateral-movement speed within the CALDERA environment and reports the resulting performance ranges, and (ii) an explicit limitations paragraph discussing the single-testbed constraint. Cross-validation against real-world provenance traces, however, cannot be performed at present because no suitably labeled public datasets exist that match the required host-plus-network granularity and MITRE ATT&CK stage annotations; we will therefore note this as future work rather than claim it has been done. revision: partial

standing simulated objections not resolved
  • Cross-validation against real-world provenance traces is not feasible due to the absence of publicly available, appropriately labeled datasets.

Circularity Check

1 steps flagged

One minor self-citation to prior StageFinder work that is not load-bearing for the empirical claims.

specific steps
  1. self citation load bearing [Abstract]
    "Building on our prior work (StageFinder), DeepStage employs a graph neural network encoder and an LSTM-based stage estimator to infer probabilistic attacker stages aligned with the MITRE ATT&CK framework."

    The stage estimator is imported from the authors' own prior paper, yet the headline performance figures are produced by running the full system against CALDERA-generated traces in the testbed; the citation does not reduce the reported F1 or mitigation gains to the citation itself.

full rationale

The paper's central claims consist of measured F1-score (0.887) and mitigation success rate (84.7%) obtained from explicit testbed experiments using CALDERA playbooks. These are not quantities derived by definition or by fitting parameters to the same data. The single self-citation to StageFinder supports the LSTM stage estimator module but does not justify the performance numbers or force them by construction. The POMDP formulation, GNN encoder, and hierarchical PPO are standard components whose outputs are evaluated externally in the testbed rather than being tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The framework relies on standard machine-learning components whose training involves many free parameters; no new physical entities are postulated.

free parameters (1)
  • GNN, LSTM, and PPO hyperparameters
    Standard training knobs whose specific values are not reported in the abstract but are required for the reported performance.
axioms (1)
  • domain assumption Enterprise networks and APT activity can be usefully modeled as a POMDP whose observations are provenance graphs.
    Explicitly stated as the problem formulation in the abstract.

pith-pipeline@v0.9.0 · 5498 in / 1493 out tokens · 53562 ms · 2026-05-15T10:23:34.232350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The enterprise environment is formulated as a partially observable Markov decision process (POMDP), in which host provenance and network telemetry are fused into unified provenance graphs... graph neural network encoder and an LSTM-based stage estimator... hierarchical Proximal Policy Optimization (PPO) agent

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Learning the apt kill chain: Temporal reasoning over provenance data for attack stage estimation,

    T. V . Phanet al., “Learning the apt kill chain: Temporal reasoning over provenance data for attack stage estimation,” inIEEE International Conference on Communications (ICC), 2026

  2. [2]

    A survey on advanced persistent threats: Tech- niques, solutions, challenges, and research opportunities,

    A. Alshamraniet al., “A survey on advanced persistent threats: Tech- niques, solutions, challenges, and research opportunities,”IEEE Com- munications Surveys & Tutorials, vol. 21, no. 2, pp. 1851–1877, 2019

  3. [3]

    Mitre att&ck®: A knowledge base of adversary tactics, techniques, and common knowledge

    T. M. Corporation, “Mitre att&ck®: A knowledge base of adversary tactics, techniques, and common knowledge.” https://attack.mitre.org,

  4. [4]

    Accessed: 2026-03-08

  5. [5]

    A survey on advanced persistent threat detection: A unified framework, challenges, and countermeasures,

    B. Zhanget al., “A survey on advanced persistent threat detection: A unified framework, challenges, and countermeasures,”ACM Comput. Surv., vol. 57, Nov. 2024

  6. [6]

    Explainable deep learning approach for advanced persistent threats (apts) detection in cybersecurity: A review,

    N. H. A. Mutalibet al., “Explainable deep learning approach for advanced persistent threats (apts) detection in cybersecurity: A review,” Artificial Intelligence Review, vol. 57, no. 11, p. 297, 2024

  7. [7]

    A survey of intrusion detection systems leveraging host data,

    R. A. Bridgeset al., “A survey of intrusion detection systems leveraging host data,”ACM Comput. Surv., vol. 52, Nov. 2019

  8. [8]

    MITRE Caldera: Automated Adversary Emulation Platform,

    T. M. Corporation, “MITRE Caldera: Automated Adversary Emulation Platform,” 2024. Open-source adversary-emulation system used for breach-and-attack simulation of APT playbooks

  9. [9]

    Deep reinforcement learning for cyber security,

    T. T. Nguyenet al., “Deep reinforcement learning for cyber security,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 3779–3795, 2021

  10. [10]

    Network intrusion response using deep reinforcement learning in an aircraft it-ot scenario,

    M. Reaneyet al., “Network intrusion response using deep reinforcement learning in an aircraft it-ot scenario,” inProceedings of the 19th International Conference on Availability, Reliability and Security, ARES ’24, Association for Computing Machinery, 2024

  11. [11]

    Automated apt defense using reinforcement learning and attack graph risk-based situation awareness,

    A. T. Leet al., “Automated apt defense using reinforcement learning and attack graph risk-based situation awareness,” inProceedings of the Workshop on Autonomous Cybersecurity, AutonomousCyber ’24, (New York, NY , USA), p. 23–33, Association for Computing Machinery, 2024

  12. [12]

    Leveraging deep reinforcement learning for cyber- attack paths prediction: Formulation, generalization, and evaluation,

    F. Terranovaet al., “Leveraging deep reinforcement learning for cyber- attack paths prediction: Formulation, generalization, and evaluation,” inProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’24, (New York, NY , USA), p. 1–16, Association for Computing Machinery, 2024

  13. [13]

    Deep-shield: Multiphase mitigation of apt via hierarchical deep reinforcement learning,

    Y . Caoet al., “Deep-shield: Multiphase mitigation of apt via hierarchical deep reinforcement learning,”IEEE Internet of Things Journal, vol. 12, no. 15, pp. 30970–30982, 2025

  14. [14]

    Offline reinforcement learning for autonomous cyber de- fense agents,

    A. Weiet al., “Offline reinforcement learning for autonomous cyber de- fense agents,” in2024 Winter Simulation Conference (WSC), pp. 1978– 1989, 2024

  15. [15]

    Zeek: The Network Security Monitor (for- merly Bro)

    V . Paxson and T. Z. Project, “Zeek: The Network Security Monitor (for- merly Bro).” https://zeek.org/, 2019. Open-source network monitoring framework. Accessed: 2026-03-08

  16. [16]

    Planning and others,

    L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and others,”Artificial Intelligence, vol. 101, no. 1–2, pp. 99–134, 1998

  17. [17]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal,et al., “Proximal policy optimiza- tion algorithms,”arXiv preprint arXiv:1707.06347, 2017

  18. [18]

    Practical whole-system provenance capture,

    T. Pasquieret al., “Practical whole-system provenance capture,” in Symposium on Cloud Computing (SoCC’17), ACM, 2017

  19. [19]

    Ppo hyperparameter optimization,

    J. Baptista, “Ppo hyperparameter optimization,”PhD Weekly Report,

  20. [20]

    Discusses hyperparameter ranges and effects specifically for Proximal Policy Optimization

  21. [21]

    Dynamic SDN-based radio access network slicing with deep reinforcement learning for URLLC and eMBB services,

    A. Filaliet al., “Dynamic SDN-based radio access network slicing with deep reinforcement learning for URLLC and eMBB services,” IEEE Transactions on Network Science and Engineering, vol. 9, no. 4, pp. 2174–2187, 2022