pith. sign in

arxiv: 2604.27321 · v1 · submitted 2026-04-30 · 💻 cs.CR · cs.AI· cs.IR

Toward Autonomous SOC Operations: End-to-End LLM Framework for Threat Detection, Query Generation, and Resolution in Security Operations

Pith reviewed 2026-05-07 09:13 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.IR
keywords SOC automationLLM ensemblethreat detectionSIEM query generationincident resolutionsecurity operationsretrieval augmentation
0
0 comments X

The pith

An end-to-end LLM framework automates SOC threat detection, query generation, and resolution, reducing average triage time from hours to under 10 minutes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a complete pipeline that combines an ensemble of large language models for spotting threats in SIEM logs, a syntax-constrained generator called SQM for pulling evidence from specific platforms like QRadar and SecOps, and retrieval-augmented support for writing resolution steps. The authors show this system reaches 82.8 percent detection accuracy with a low false-positive rate, doubles the quality of generated queries compared with plain LLMs, and lifts resolution accuracy from 78.3 to 90 percent. In live SOC use, the whole process shrinks manual triage from hours down to less than ten minutes. The central point is that adding domain rules and retrieval to LLMs can produce reliable automation for security operations that normally demand constant human oversight.

Core claim

The authors demonstrate an integrated LLM system that first ensembles three top-performing models to classify threats in SIEM logs at 82.8 percent accuracy and 0.120 false-positive rate, then employs the SQM architecture—using platform syntax constraints, metadata retrieval, and documentation prompting—to generate executable queries that score 0.384 BLEU and 0.731 ROUGE-L, more than twice baseline LLM results, and finally feeds the collected evidence into a resolver that raises incident-resolution code prediction to 90 percent accuracy and yields recommendation scores of 8.70, delivering under-10-minute triage in production deployments.

What carries the argument

The SQM (Syntax Query Metadata) architecture, which enforces platform-specific syntax, retrieves relevant metadata, and grounds prompts in documentation to produce executable queries for heterogeneous SIEM tools.

If this is right

  • Ensemble detection on SIEM logs can reach 82.8 percent accuracy while holding false positives to 0.120.
  • Syntax-constrained query generation more than doubles BLEU and ROUGE-L scores over unconstrained LLMs for IBM QRadar and Google SecOps.
  • Adding SQM-derived evidence raises incident-resolution code prediction from 78.3 percent to 90 percent accuracy.
  • The complete pipeline reduces average incident triage time in live SOC settings from several hours to under 10 minutes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constrained-plus-retrieval pattern could be applied to other security platforms or to adjacent domains such as compliance checking and log forensics.
  • Production deployment data already collected in the paper offers a natural next step for measuring long-term drift as attack techniques evolve.
  • If the low false-positive rate holds at scale, SOC teams could shift from reactive triage toward proactive threat hunting without adding headcount.

Load-bearing premise

The reported accuracy, query quality, and speed gains will continue when the system encounters the full variety of real-world threats, different SIEM platforms, and changing attack methods beyond the logs used in testing.

What would settle it

Running the full pipeline on a new collection of SIEM logs from additional organizations and attack types and measuring whether detection accuracy falls below 80 percent, query BLEU drops under 0.3, or average triage time rises above 15 minutes.

Figures

Figures reproduced from arXiv: 2604.27321 by Akramul Azim, Md Hasan Saju.

Figure 1
Figure 1. Figure 1: End-to-End Threat Management Framework Overview metadata, and behavioral signals. To improve detection robustness and reduce individual model bias, the three best-performing LLMs are combined using a majority-voting ensemble strategy. Following ensemble-based classification, detected critical events are prioritized using a risk scoring mechanism derived from SIEM-generated metadata. Each log source include… view at source ↗
Figure 2
Figure 2. Figure 2: Query Generation Architecture as high-similarity exemplars that anchor query generation and reduce ambiguity in operator selection, field usage, and aggregation patterns. To further improve correctness, SQM incorporates a documentation-based knowledge base built from official AQL and YARA-L 2.0 documentation, which provides platform rules and guidance for constructing valid queries. The final query generat… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of query generation and threat resolution. with framework-generated outputs, showing that SQM achieves high structural and semantic alignment with expert-defined logic, but baseline LLM queries have syntax issues. 4.3. Resolution Code Prediction with and without SQM view at source ↗
read the original abstract

Security Operations Centers (SOCs) face mounting operational challenges. These challenges come from increasing threat volumes, heterogeneous SIEM platforms, and time-consuming manual triage workflows. We present an end-to-end threat management framework that integrates ensemble-based detection, syntax-constrained query generation, and retrieval-augmented resolution support to automate critical security workflows. Our detection module evaluates both traditional machine learning classifiers and large language models (LLMs), then combines the three best-performing LLMs to create an ensemble model, achieving 82.8% accuracy while maintaining 0.120 false positive rate on SIEM logs. We introduce the SQM (Syntax Query Metadata) architecture for automated evidence collection. It uses platform-specific syntax constraints, metadata-based retrieval, and documentation-grounded prompting to generate executable queries for IBM QRadar and Google SecOps. SQM achieves a BLEU score of 0.384 and a ROUGE-L score of 0.731. These results are more than twice as good as the baseline LLM performance. For incident resolution and recommendation generation, we demonstrate that integrating SQM-derived evidence improves resolution code prediction accuracy from 78.3% to 90.0%, with an overall recommendation quality score of 8.70. In production SOC environments, our framework reduces average incident triage time from hours to under 10 minutes. This work demonstrates that domain-constrained LLM architectures with retrieval augmentation can meet the strict reliability and efficiency requirements of operational security environments at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an end-to-end LLM framework for SOC threat management that combines an ensemble detection model (82.8% accuracy, 0.120 FPR on SIEM logs), the SQM architecture for syntax-constrained query generation (BLEU 0.384, ROUGE-L 0.731 on IBM QRadar and Google SecOps), and retrieval-augmented resolution that raises accuracy from 78.3% to 90.0%. It claims that the integrated system reduces average incident triage time from hours to under 10 minutes in production SOC environments.

Significance. If the component-level metrics prove robust under proper evaluation and the end-to-end operational claim is substantiated, the work could meaningfully advance automation in security operations by showing how domain constraints and retrieval can make LLMs practical for detection, evidence collection, and resolution. The reported improvements over baselines and the focus on executable outputs for real SIEM platforms are concrete strengths that could inform follow-on research.

major comments (2)
  1. [Abstract] Abstract: The central operational claim that the framework reduces average incident triage time from hours to under 10 minutes in production SOC environments is presented without any description of the evaluation methodology, number of incidents, before/after measurement protocol, analyst time-tracking method, controls for false-positive handling or integration overhead, or how component metrics aggregate to this figure. This is load-bearing for the paper's headline contribution.
  2. [Abstract] Abstract: The reported metrics (82.8% ensemble accuracy, BLEU 0.384, resolution lift to 90.0%) are given without dataset sizes, train/test split details, statistical significance tests, or full baseline descriptions, leaving the empirical claims difficult to evaluate or reproduce from the provided text.
minor comments (2)
  1. The SQM architecture is described at a high level; adding a diagram of the metadata-retrieval and syntax-constraint pipeline or pseudocode for the prompting steps would improve clarity.
  2. Consider expanding the discussion of limitations to address generalization to novel attack techniques and heterogeneous SIEM platforms beyond the evaluated logs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have revised the abstract to address the concerns about missing methodological details and empirical context, while preserving its conciseness. Our responses to the major comments are provided below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central operational claim that the framework reduces average incident triage time from hours to under 10 minutes in production SOC environments is presented without any description of the evaluation methodology, number of incidents, before/after measurement protocol, analyst time-tracking method, controls for false-positive handling or integration overhead, or how component metrics aggregate to this figure. This is load-bearing for the paper's headline contribution.

    Authors: We agree that the abstract would be strengthened by additional context on this claim. The production deployment evaluation, including incident counts, time-tracking via SOC systems, controls for overhead, and aggregation from component metrics, is described in Section 6. We have revised the abstract to add a concise summary of the methodology and its relation to the component results. revision: yes

  2. Referee: [Abstract] Abstract: The reported metrics (82.8% ensemble accuracy, BLEU 0.384, resolution lift to 90.0%) are given without dataset sizes, train/test split details, statistical significance tests, or full baseline descriptions, leaving the empirical claims difficult to evaluate or reproduce from the provided text.

    Authors: We acknowledge this point. Dataset sizes, splits, statistical tests, and baseline details are provided in Sections 4.1–4.3 and 5. To improve self-containment, we have updated the abstract to include key parameters such as dataset sizes and split information, along with a note on significance testing. revision: yes

Circularity Check

0 steps flagged

No circularity: all claims are empirical measurements on held-out data

full rationale

The paper reports component-level empirical metrics (82.8% ensemble accuracy, 0.120 FPR, BLEU 0.384/ROUGE-L 0.731, 78.3% to 90.0% resolution accuracy) evaluated on SIEM logs and synthetic tasks. The end-to-end triage time claim is presented as an operational outcome without any equations, parameter fitting, or derivations that reduce to the inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to support the core results. This is a standard empirical ML application paper whose claims stand or fall on external validation rather than internal definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can be made reliable for executable security queries through syntax constraints and retrieval, plus the empirical claim that the chosen ensemble generalizes; no explicit free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption LLMs respond reliably to syntax constraints and documentation-grounded prompts for generating executable SIEM queries
    Central to the SQM module description and performance claims.
invented entities (1)
  • SQM architecture no independent evidence
    purpose: Automated generation of platform-specific executable queries using syntax constraints and metadata retrieval
    Presented as a new component that doubles baseline LLM performance on query tasks.

pith-pipeline@v0.9.0 · 5575 in / 1492 out tokens · 81195 ms · 2026-05-07T09:13:36.608957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    Security Operations Center: A Sys- tematic Study and Open Challenges

    M. Vielberth, F. Böhm, I. Fichtinger, and G. Pernul. “Security Operations Center: A Sys- tematic Study and Open Challenges”. In:IEEE Access8 (2020), pp. 227756–227779.doi: 10.1109/ACCESS.2020.3045514

  2. [2]

    Requirements for playbook-assisted cyber incident response, reporting and automation

    M. Akbari Gurabi, L. Nitz, A. Bregar, J. Popanda, C. Siemers, R. Matzutt, and A. Mandal. “Requirements for playbook-assisted cyber incident response, reporting and automation”. In: Digital Threats: Research and Practice5.3 (2024), pp. 1–11

  3. [3]

    Automated Generation of Cybersecurity Re- sponse Playbooks via Large Language Models

    C. Paduraru, B. Dumitru, and A. Stefanescu. “Automated Generation of Cybersecurity Re- sponse Playbooks via Large Language Models”. In:Procedia Computer Science270 (2025), pp. 2987–2996

  4. [4]

    Toward robust security orchestration and automated response in security operations centers with a hyper-automation approach using agentic artificial intelligence

    Ismail, R. Kurnia, Z. A. Brata, G. A. Nelistiani, S. Heo, H. Kim, and H. Kim. “Toward robust security orchestration and automated response in security operations centers with a hyper-automation approach using agentic artificial intelligence”. In:Information16.5 (2025), p. 365

  5. [5]

    SynRAG: A Large Language Model Framework for Executable Query Generation in Heterogeneous SIEM Sys- tems

    M. H. Saju, A. Page, A. Azim, J. Gardiner, F. Abazari, and F. Eargle. “SynRAG: A Large Language Model Framework for Executable Query Generation in Heterogeneous SIEM Sys- tems”. In:2025 IEEE International Conference on Collaborative Advances in Software and COmputiNg (CASCON). IEEE. 2025, pp. 225–230

  6. [6]

    Optimizing text-to-SQL conversion techniques through the integration of intelligent agents and large language models

    S. Ojuri, T. A. Han, R. Chiong, and A. Di Stefano. “Optimizing text-to-SQL conversion techniques through the integration of intelligent agents and large language models”. In:In- formation Processing & Management62.5 (2025), p. 104136

  7. [7]

    Security Information Event Management data acquisition and analysis methods with ma- chine learning principles

    N. Tendikov, L. Rzayeva, B. Saoud, I. Shayea, M. H. Azmi, A. Myrzatay, and M. Alnakhli. “Security Information Event Management data acquisition and analysis methods with ma- chine learning principles”. In:Results in Engineering22 (2024), p. 102254

  8. [8]

    AI-Enhanced Automated Incident Response in SIEM with Explainability for SOC Analysts

    R. R. Charla. “AI-Enhanced Automated Incident Response in SIEM with Explainability for SOC Analysts”. In:2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE. 2025, pp. 1–10

  9. [9]

    Automated offense Prioritiza- tion for SIEM using ProbabilisticMachine Learning Models

    M. A. Khan, A. Azim, F. Abazari, F. Eargle, and J. Gardiner. “Automated offense Prioritiza- tion for SIEM using ProbabilisticMachine Learning Models”. In:Proceedings of the Canadian Conference on Artificial Intelligence(2024). https://caiac.pubpub.org/pub/xak0h000

  10. [10]

    A survey on llm-as-a-judge

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, et al. “A survey on llm-as-a-judge”. In:The Innovation(2024)