GenAI-Driven Threat Detection with Microsoft Security Copilot
Pith reviewed 2026-05-21 04:26 UTC · model grok-4.3
The pith
An autonomous agent uses language-model planning to investigate security incidents and generate novel detections for hidden threats at production scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DTDA combines a unified activity timeline spanning alerts, events, user behavior, and threat intelligence with versioned LLM prompt contracts, a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, and dynamic alert generation that supplies context-relevant titles, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack descriptions. When integrated into Microsoft Security Copilot and run across tens of thousands of customers, the agent achieves 80.1 percent precision from customer feedback, produces novel alerts for approximately 15 percent of investigated incidents, recovers 0.78
What carries the argument
The planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, guided by versioned LLM prompt contracts with schema validation and fail-closed suppression.
If this is right
- DTDA can run continuously at industry scale, processing single-incident investigations in a median of 28 minutes at a median token cost of USD 2.04 with a 0.38 percent job-level failure rate.
- The agent reduces the need for defenders to maintain constantly updated detection logic by translating evolving attacker tradecraft into new alerts autonomously.
- Offline results show measurable gains when moving from GPT-4.1 to GPT-5.4 within the same planner-executor framework, indicating that model improvements directly translate to better recovery of hidden activity.
- Novel alerts appear in 15 percent of investigated incidents while maintaining 80.1 percent precision, suggesting the system surfaces activity that existing detectors miss.
Where Pith is reading between the lines
- The combination of prompt contracts and bounded retries offers a practical pattern for making large-language-model agents reliable enough for high-stakes operational use.
- If the same architecture were applied to other security products beyond Microsoft Defender, it could create a consistent layer of adaptive investigation across endpoint, network, and cloud signals.
- Over time the agent could accumulate a growing library of attack hypotheses that become reusable building blocks for future investigations.
Load-bearing premise
Customer feedback on alert precision serves as an unbiased proxy for true positive rate, and the agent loop does not systematically miss real threats or create inappropriate alerts across the full range of attack types.
What would settle it
A controlled test that injects a set of known attack scenarios into live customer environments and measures the fraction of those scenarios for which DTDA either detects the activity or correctly generates a new alert against ground-truth labels.
Figures
read the original abstract
Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent integrated into Microsoft Security Copilot that investigates security incidents in Microsoft Defender. DTDA combines a unified activity timeline, versioned LLM prompt contracts with schema validation and fail-closed suppression, a planner-executor loop for hypothesis generation and evidence gathering, and dynamic alert generation including titles, severity, MITRE mappings, and remediation guidance. The paper reports a 120-day online evaluation across tens of thousands of customers yielding 80.1% precision from customer feedback with novel alerts on approximately 15% of investigated incidents, plus offline results of 0.78 F1 using GPT-5.4 (improving 0.12 F1 over GPT-4.1 and 0.26 F1 over baseline), and operational metrics of median 28-minute end-to-end investigations at median USD 2.04 token cost with 0.38% job failure rate.
Significance. If the reported results hold under rigorous independent validation, the work would demonstrate the viability of autonomous GenAI agents for continuous threat detection at production scale, moving security operations from reactive rule maintenance toward proactive discovery of hidden malicious activity. The large-scale deployment, specific numerical outcomes from both online customer feedback and offline F1 comparisons, and concrete operational metrics (time, cost, failure rate) provide practical evidence that could inform similar LLM-agent systems in security. The use of prompt contracts with grounding and bounded retries is a concrete engineering contribution.
major comments (2)
- [Evaluation / Online results] Online evaluation (120-day deployment results): The headline 80.1% precision and ~15% novel-alert rate are computed exclusively from customer feedback, yet the manuscript provides no details on feedback collection volume, response rates, analyst selection criteria, or any independent ground-truth validation. This is load-bearing for the central claim of recovering hidden malicious activity at scale, because customer feedback is an incomplete proxy that may undercount false negatives and introduce confirmation bias once an LLM-generated alert is presented.
- [Evaluation / Offline results] Offline evaluation: The abstract and results sections report 0.78 F1 with GPT-5.4 and specific improvements (0.12 over GPT-4.1, 0.26 over baseline) but supply no description of how the hidden-malicious-activity labels were obtained, how the evaluation set was sampled or constructed independently of the DTDA prompt contracts, or the exact implementation and features of the baseline. Without these, the numerical gains cannot be interpreted as evidence that the planner-executor loop systematically reduces undetected false negatives.
minor comments (2)
- [Results] Clarify the exact model versions referenced as GPT-5.4 and GPT-4.1 (internal releases, dates, or public equivalents) to allow reproducibility and comparison with external work.
- [Related Work] Add citations to prior academic work on LLM agents for security incident investigation and prompt-engineering techniques with schema validation to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Evaluation / Online results] Online evaluation (120-day deployment results): The headline 80.1% precision and ~15% novel-alert rate are computed exclusively from customer feedback, yet the manuscript provides no details on feedback collection volume, response rates, analyst selection criteria, or any independent ground-truth validation. This is load-bearing for the central claim of recovering hidden malicious activity at scale, because customer feedback is an incomplete proxy that may undercount false negatives and introduce confirmation bias once an LLM-generated alert is presented.
Authors: We agree that greater transparency on the online evaluation would strengthen the paper. Customer feedback is gathered through the standard voluntary rating interface in Microsoft Defender and Security Copilot, where analysts explicitly mark generated alerts as true or false positives. Due to privacy regulations and the deployment scale across tens of thousands of customers, we cannot release granular statistics such as exact response volumes or selection criteria. We have added a high-level description of the feedback mechanism and a limitations paragraph acknowledging potential biases and the proxy nature of the metric. These changes appear in the revised Evaluation section. revision: partial
-
Referee: [Evaluation / Offline results] Offline evaluation: The abstract and results sections report 0.78 F1 with GPT-5.4 and specific improvements (0.12 over GPT-4.1, 0.26 over baseline) but supply no description of how the hidden-malicious-activity labels were obtained, how the evaluation set was sampled or constructed independently of the DTDA prompt contracts, or the exact implementation and features of the baseline. Without these, the numerical gains cannot be interpreted as evidence that the planner-executor loop systematically reduces undetected false negatives.
Authors: We appreciate this observation. The offline labels were produced by expert security analysts who reviewed complete incident timelines and external threat intelligence to identify activity missed by existing detections. The evaluation incidents were sampled from a disjoint time window and customer set that was not used during prompt-contract development. The baseline is a non-agentic LLM prompt performing direct classification without hypothesis generation or evidence collection. We have expanded Section 4.2 with these methodological details, including a summary of dataset construction and baseline features, to support interpretation of the reported gains. revision: yes
Circularity Check
No circularity: empirical results presented as direct measurements
full rationale
The paper reports empirical outcomes from a 120-day online deployment (80.1% precision via customer feedback, ~15% novel alerts) and offline evaluation (0.78 F1 with GPT-5.4). These quantities are measured directly from system operation and labeled incident data rather than derived via equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce to prior fitted values or self-citations by construction; the central claims rest on external deployment metrics and comparisons that remain independently falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models guided by versioned prompt contracts with schema validation and grounding requirements can generate accurate attack hypotheses and supporting evidence from security timelines
invented entities (1)
-
Dynamic Threat Detection Agent (DTDA)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DTDA combines: (1) a unified activity timeline... (2) versioned LLM prompt contracts... (3) a planner-executor investigation loop... (4) dynamic alert generation... In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.