Toward Autonomous SOC Operations: End-to-End LLM Framework for Threat Detection, Query Generation, and Resolution in Security Operations
Pith reviewed 2026-05-07 09:13 UTC · model grok-4.3
The pith
An end-to-end LLM framework automates SOC threat detection, query generation, and resolution, reducing average triage time from hours to under 10 minutes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate an integrated LLM system that first ensembles three top-performing models to classify threats in SIEM logs at 82.8 percent accuracy and 0.120 false-positive rate, then employs the SQM architecture—using platform syntax constraints, metadata retrieval, and documentation prompting—to generate executable queries that score 0.384 BLEU and 0.731 ROUGE-L, more than twice baseline LLM results, and finally feeds the collected evidence into a resolver that raises incident-resolution code prediction to 90 percent accuracy and yields recommendation scores of 8.70, delivering under-10-minute triage in production deployments.
What carries the argument
The SQM (Syntax Query Metadata) architecture, which enforces platform-specific syntax, retrieves relevant metadata, and grounds prompts in documentation to produce executable queries for heterogeneous SIEM tools.
If this is right
- Ensemble detection on SIEM logs can reach 82.8 percent accuracy while holding false positives to 0.120.
- Syntax-constrained query generation more than doubles BLEU and ROUGE-L scores over unconstrained LLMs for IBM QRadar and Google SecOps.
- Adding SQM-derived evidence raises incident-resolution code prediction from 78.3 percent to 90 percent accuracy.
- The complete pipeline reduces average incident triage time in live SOC settings from several hours to under 10 minutes.
Where Pith is reading between the lines
- The same constrained-plus-retrieval pattern could be applied to other security platforms or to adjacent domains such as compliance checking and log forensics.
- Production deployment data already collected in the paper offers a natural next step for measuring long-term drift as attack techniques evolve.
- If the low false-positive rate holds at scale, SOC teams could shift from reactive triage toward proactive threat hunting without adding headcount.
Load-bearing premise
The reported accuracy, query quality, and speed gains will continue when the system encounters the full variety of real-world threats, different SIEM platforms, and changing attack methods beyond the logs used in testing.
What would settle it
Running the full pipeline on a new collection of SIEM logs from additional organizations and attack types and measuring whether detection accuracy falls below 80 percent, query BLEU drops under 0.3, or average triage time rises above 15 minutes.
Figures
read the original abstract
Security Operations Centers (SOCs) face mounting operational challenges. These challenges come from increasing threat volumes, heterogeneous SIEM platforms, and time-consuming manual triage workflows. We present an end-to-end threat management framework that integrates ensemble-based detection, syntax-constrained query generation, and retrieval-augmented resolution support to automate critical security workflows. Our detection module evaluates both traditional machine learning classifiers and large language models (LLMs), then combines the three best-performing LLMs to create an ensemble model, achieving 82.8% accuracy while maintaining 0.120 false positive rate on SIEM logs. We introduce the SQM (Syntax Query Metadata) architecture for automated evidence collection. It uses platform-specific syntax constraints, metadata-based retrieval, and documentation-grounded prompting to generate executable queries for IBM QRadar and Google SecOps. SQM achieves a BLEU score of 0.384 and a ROUGE-L score of 0.731. These results are more than twice as good as the baseline LLM performance. For incident resolution and recommendation generation, we demonstrate that integrating SQM-derived evidence improves resolution code prediction accuracy from 78.3% to 90.0%, with an overall recommendation quality score of 8.70. In production SOC environments, our framework reduces average incident triage time from hours to under 10 minutes. This work demonstrates that domain-constrained LLM architectures with retrieval augmentation can meet the strict reliability and efficiency requirements of operational security environments at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an end-to-end LLM framework for SOC threat management that combines an ensemble detection model (82.8% accuracy, 0.120 FPR on SIEM logs), the SQM architecture for syntax-constrained query generation (BLEU 0.384, ROUGE-L 0.731 on IBM QRadar and Google SecOps), and retrieval-augmented resolution that raises accuracy from 78.3% to 90.0%. It claims that the integrated system reduces average incident triage time from hours to under 10 minutes in production SOC environments.
Significance. If the component-level metrics prove robust under proper evaluation and the end-to-end operational claim is substantiated, the work could meaningfully advance automation in security operations by showing how domain constraints and retrieval can make LLMs practical for detection, evidence collection, and resolution. The reported improvements over baselines and the focus on executable outputs for real SIEM platforms are concrete strengths that could inform follow-on research.
major comments (2)
- [Abstract] Abstract: The central operational claim that the framework reduces average incident triage time from hours to under 10 minutes in production SOC environments is presented without any description of the evaluation methodology, number of incidents, before/after measurement protocol, analyst time-tracking method, controls for false-positive handling or integration overhead, or how component metrics aggregate to this figure. This is load-bearing for the paper's headline contribution.
- [Abstract] Abstract: The reported metrics (82.8% ensemble accuracy, BLEU 0.384, resolution lift to 90.0%) are given without dataset sizes, train/test split details, statistical significance tests, or full baseline descriptions, leaving the empirical claims difficult to evaluate or reproduce from the provided text.
minor comments (2)
- The SQM architecture is described at a high level; adding a diagram of the metadata-retrieval and syntax-constraint pipeline or pseudocode for the prompting steps would improve clarity.
- Consider expanding the discussion of limitations to address generalization to novel attack techniques and heterogeneous SIEM platforms beyond the evaluated logs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have revised the abstract to address the concerns about missing methodological details and empirical context, while preserving its conciseness. Our responses to the major comments are provided below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central operational claim that the framework reduces average incident triage time from hours to under 10 minutes in production SOC environments is presented without any description of the evaluation methodology, number of incidents, before/after measurement protocol, analyst time-tracking method, controls for false-positive handling or integration overhead, or how component metrics aggregate to this figure. This is load-bearing for the paper's headline contribution.
Authors: We agree that the abstract would be strengthened by additional context on this claim. The production deployment evaluation, including incident counts, time-tracking via SOC systems, controls for overhead, and aggregation from component metrics, is described in Section 6. We have revised the abstract to add a concise summary of the methodology and its relation to the component results. revision: yes
-
Referee: [Abstract] Abstract: The reported metrics (82.8% ensemble accuracy, BLEU 0.384, resolution lift to 90.0%) are given without dataset sizes, train/test split details, statistical significance tests, or full baseline descriptions, leaving the empirical claims difficult to evaluate or reproduce from the provided text.
Authors: We acknowledge this point. Dataset sizes, splits, statistical tests, and baseline details are provided in Sections 4.1–4.3 and 5. To improve self-containment, we have updated the abstract to include key parameters such as dataset sizes and split information, along with a note on significance testing. revision: yes
Circularity Check
No circularity: all claims are empirical measurements on held-out data
full rationale
The paper reports component-level empirical metrics (82.8% ensemble accuracy, 0.120 FPR, BLEU 0.384/ROUGE-L 0.731, 78.3% to 90.0% resolution accuracy) evaluated on SIEM logs and synthetic tasks. The end-to-end triage time claim is presented as an operational outcome without any equations, parameter fitting, or derivations that reduce to the inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to support the core results. This is a standard empirical ML application paper whose claims stand or fall on external validation rather than internal definitional loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs respond reliably to syntax constraints and documentation-grounded prompts for generating executable SIEM queries
invented entities (1)
-
SQM architecture
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Security Operations Center: A Sys- tematic Study and Open Challenges
M. Vielberth, F. Böhm, I. Fichtinger, and G. Pernul. “Security Operations Center: A Sys- tematic Study and Open Challenges”. In:IEEE Access8 (2020), pp. 227756–227779.doi: 10.1109/ACCESS.2020.3045514
-
[2]
Requirements for playbook-assisted cyber incident response, reporting and automation
M. Akbari Gurabi, L. Nitz, A. Bregar, J. Popanda, C. Siemers, R. Matzutt, and A. Mandal. “Requirements for playbook-assisted cyber incident response, reporting and automation”. In: Digital Threats: Research and Practice5.3 (2024), pp. 1–11
work page 2024
-
[3]
Automated Generation of Cybersecurity Re- sponse Playbooks via Large Language Models
C. Paduraru, B. Dumitru, and A. Stefanescu. “Automated Generation of Cybersecurity Re- sponse Playbooks via Large Language Models”. In:Procedia Computer Science270 (2025), pp. 2987–2996
work page 2025
-
[4]
Ismail, R. Kurnia, Z. A. Brata, G. A. Nelistiani, S. Heo, H. Kim, and H. Kim. “Toward robust security orchestration and automated response in security operations centers with a hyper-automation approach using agentic artificial intelligence”. In:Information16.5 (2025), p. 365
work page 2025
-
[5]
M. H. Saju, A. Page, A. Azim, J. Gardiner, F. Abazari, and F. Eargle. “SynRAG: A Large Language Model Framework for Executable Query Generation in Heterogeneous SIEM Sys- tems”. In:2025 IEEE International Conference on Collaborative Advances in Software and COmputiNg (CASCON). IEEE. 2025, pp. 225–230
work page 2025
-
[6]
S. Ojuri, T. A. Han, R. Chiong, and A. Di Stefano. “Optimizing text-to-SQL conversion techniques through the integration of intelligent agents and large language models”. In:In- formation Processing & Management62.5 (2025), p. 104136
work page 2025
-
[7]
N. Tendikov, L. Rzayeva, B. Saoud, I. Shayea, M. H. Azmi, A. Myrzatay, and M. Alnakhli. “Security Information Event Management data acquisition and analysis methods with ma- chine learning principles”. In:Results in Engineering22 (2024), p. 102254
work page 2024
-
[8]
AI-Enhanced Automated Incident Response in SIEM with Explainability for SOC Analysts
R. R. Charla. “AI-Enhanced Automated Incident Response in SIEM with Explainability for SOC Analysts”. In:2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE. 2025, pp. 1–10
work page 2025
-
[9]
Automated offense Prioritiza- tion for SIEM using ProbabilisticMachine Learning Models
M. A. Khan, A. Azim, F. Abazari, F. Eargle, and J. Gardiner. “Automated offense Prioritiza- tion for SIEM using ProbabilisticMachine Learning Models”. In:Proceedings of the Canadian Conference on Artificial Intelligence(2024). https://caiac.pubpub.org/pub/xak0h000
work page 2024
-
[10]
J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, et al. “A survey on llm-as-a-judge”. In:The Innovation(2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.