pith. sign in

arxiv: 2605.20896 · v2 · pith:4FU7PMUWnew · submitted 2026-05-20 · 💻 cs.CR · cs.AI· cs.LG

GenAI-Driven Threat Detection with Microsoft Security Copilot

Pith reviewed 2026-05-21 04:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords threat detectionautonomous agentslarge language modelssecurity incidentsmalicious activitycybersecurityincident investigationadaptive detection
0
0 comments X

The pith

An autonomous agent uses language-model planning to investigate security incidents and generate novel detections for hidden threats at production scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Dynamic Threat Detection Agent, an always-on system that examines security incidents in Microsoft Defender by building a unified activity timeline and running a planner-executor loop. The agent forms attack-specific hypotheses, gathers evidence, and fills gaps in existing alerts by producing new explainable detections with titles, severity ratings, and remediation steps. In a 120-day real-world deployment it reaches 80.1 percent precision according to customer feedback and surfaces novel alerts in roughly 15 percent of cases examined. Offline tests show it recovers hidden malicious activity with an F1 score of 0.78, an improvement over prior model versions and baselines. These results indicate that such agents can shift defenders from constant manual rule updates toward continuous, adaptive investigation.

Core claim

DTDA combines a unified activity timeline spanning alerts, events, user behavior, and threat intelligence with versioned LLM prompt contracts, a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, and dynamic alert generation that supplies context-relevant titles, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack descriptions. When integrated into Microsoft Security Copilot and run across tens of thousands of customers, the agent achieves 80.1 percent precision from customer feedback, produces novel alerts for approximately 15 percent of investigated incidents, recovers 0.78

What carries the argument

The planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, guided by versioned LLM prompt contracts with schema validation and fail-closed suppression.

If this is right

  • DTDA can run continuously at industry scale, processing single-incident investigations in a median of 28 minutes at a median token cost of USD 2.04 with a 0.38 percent job-level failure rate.
  • The agent reduces the need for defenders to maintain constantly updated detection logic by translating evolving attacker tradecraft into new alerts autonomously.
  • Offline results show measurable gains when moving from GPT-4.1 to GPT-5.4 within the same planner-executor framework, indicating that model improvements directly translate to better recovery of hidden activity.
  • Novel alerts appear in 15 percent of investigated incidents while maintaining 80.1 percent precision, suggesting the system surfaces activity that existing detectors miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The combination of prompt contracts and bounded retries offers a practical pattern for making large-language-model agents reliable enough for high-stakes operational use.
  • If the same architecture were applied to other security products beyond Microsoft Defender, it could create a consistent layer of adaptive investigation across endpoint, network, and cloud signals.
  • Over time the agent could accumulate a growing library of attack hypotheses that become reusable building blocks for future investigations.

Load-bearing premise

Customer feedback on alert precision serves as an unbiased proxy for true positive rate, and the agent loop does not systematically miss real threats or create inappropriate alerts across the full range of attack types.

What would settle it

A controlled test that injects a set of known attack scenarios into live customer environments and measures the fraction of those scenarios for which DTDA either detects the activity or correctly generates a new alert against ground-truth labels.

Figures

Figures reproduced from arXiv: 2605.20896 by Amir Gharib, Scott Freitas.

Figure 1
Figure 1. Figure 1: Overview of the DTDA architecture: an industry-scale framework for autonomous threat detection. DTDA builds incident-centered activity timelines from alerts, events, UEBA, and threat intel; runs a bounded planner-executor investigation to gather supporting and refuting evidence; and emits a dynamic alert when the investigation identifies novel malicious activity. 1.1 Contributions We introduce DTDA ( [PIT… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the bounded planner-executor inves [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Runtime scaling by number of incidents per job, line [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent integrated into Microsoft Security Copilot that investigates security incidents in Microsoft Defender. DTDA combines a unified activity timeline, versioned LLM prompt contracts with schema validation and fail-closed suppression, a planner-executor loop for hypothesis generation and evidence gathering, and dynamic alert generation including titles, severity, MITRE mappings, and remediation guidance. The paper reports a 120-day online evaluation across tens of thousands of customers yielding 80.1% precision from customer feedback with novel alerts on approximately 15% of investigated incidents, plus offline results of 0.78 F1 using GPT-5.4 (improving 0.12 F1 over GPT-4.1 and 0.26 F1 over baseline), and operational metrics of median 28-minute end-to-end investigations at median USD 2.04 token cost with 0.38% job failure rate.

Significance. If the reported results hold under rigorous independent validation, the work would demonstrate the viability of autonomous GenAI agents for continuous threat detection at production scale, moving security operations from reactive rule maintenance toward proactive discovery of hidden malicious activity. The large-scale deployment, specific numerical outcomes from both online customer feedback and offline F1 comparisons, and concrete operational metrics (time, cost, failure rate) provide practical evidence that could inform similar LLM-agent systems in security. The use of prompt contracts with grounding and bounded retries is a concrete engineering contribution.

major comments (2)
  1. [Evaluation / Online results] Online evaluation (120-day deployment results): The headline 80.1% precision and ~15% novel-alert rate are computed exclusively from customer feedback, yet the manuscript provides no details on feedback collection volume, response rates, analyst selection criteria, or any independent ground-truth validation. This is load-bearing for the central claim of recovering hidden malicious activity at scale, because customer feedback is an incomplete proxy that may undercount false negatives and introduce confirmation bias once an LLM-generated alert is presented.
  2. [Evaluation / Offline results] Offline evaluation: The abstract and results sections report 0.78 F1 with GPT-5.4 and specific improvements (0.12 over GPT-4.1, 0.26 over baseline) but supply no description of how the hidden-malicious-activity labels were obtained, how the evaluation set was sampled or constructed independently of the DTDA prompt contracts, or the exact implementation and features of the baseline. Without these, the numerical gains cannot be interpreted as evidence that the planner-executor loop systematically reduces undetected false negatives.
minor comments (2)
  1. [Results] Clarify the exact model versions referenced as GPT-5.4 and GPT-4.1 (internal releases, dates, or public equivalents) to allow reproducibility and comparison with external work.
  2. [Related Work] Add citations to prior academic work on LLM agents for security incident investigation and prompt-engineering techniques with schema validation to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Online results] Online evaluation (120-day deployment results): The headline 80.1% precision and ~15% novel-alert rate are computed exclusively from customer feedback, yet the manuscript provides no details on feedback collection volume, response rates, analyst selection criteria, or any independent ground-truth validation. This is load-bearing for the central claim of recovering hidden malicious activity at scale, because customer feedback is an incomplete proxy that may undercount false negatives and introduce confirmation bias once an LLM-generated alert is presented.

    Authors: We agree that greater transparency on the online evaluation would strengthen the paper. Customer feedback is gathered through the standard voluntary rating interface in Microsoft Defender and Security Copilot, where analysts explicitly mark generated alerts as true or false positives. Due to privacy regulations and the deployment scale across tens of thousands of customers, we cannot release granular statistics such as exact response volumes or selection criteria. We have added a high-level description of the feedback mechanism and a limitations paragraph acknowledging potential biases and the proxy nature of the metric. These changes appear in the revised Evaluation section. revision: partial

  2. Referee: [Evaluation / Offline results] Offline evaluation: The abstract and results sections report 0.78 F1 with GPT-5.4 and specific improvements (0.12 over GPT-4.1, 0.26 over baseline) but supply no description of how the hidden-malicious-activity labels were obtained, how the evaluation set was sampled or constructed independently of the DTDA prompt contracts, or the exact implementation and features of the baseline. Without these, the numerical gains cannot be interpreted as evidence that the planner-executor loop systematically reduces undetected false negatives.

    Authors: We appreciate this observation. The offline labels were produced by expert security analysts who reviewed complete incident timelines and external threat intelligence to identify activity missed by existing detections. The evaluation incidents were sampled from a disjoint time window and customer set that was not used during prompt-contract development. The baseline is a non-agentic LLM prompt performing direct classification without hypothesis generation or evidence collection. We have expanded Section 4.2 with these methodological details, including a summary of dataset construction and baseline features, to support interpretation of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results presented as direct measurements

full rationale

The paper reports empirical outcomes from a 120-day online deployment (80.1% precision via customer feedback, ~15% novel alerts) and offline evaluation (0.78 F1 with GPT-5.4). These quantities are measured directly from system operation and labeled incident data rather than derived via equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce to prior fitted values or self-citations by construction; the central claims rest on external deployment metrics and comparisons that remain independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the premise that carefully engineered LLM prompts plus a unified data timeline are sufficient to produce reliable attack hypotheses and evidence gathering at scale; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Large language models guided by versioned prompt contracts with schema validation and grounding requirements can generate accurate attack hypotheses and supporting evidence from security timelines
    This assumption directly enables the planner-executor investigation loop and dynamic alert generation described in the abstract.
invented entities (1)
  • Dynamic Threat Detection Agent (DTDA) no independent evidence
    purpose: Always-on adaptive agent that investigates incidents and generates explainable detections when attack-story gaps are found
    The agent is the primary new system introduced by the paper.

pith-pipeline@v0.9.0 · 5851 in / 1616 out tokens · 69161 ms · 2026-05-21T04:26:11.906263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.