arxiv: 2604.25846 · v1 · submitted 2026-04-28 · 💻 cs.CR · cs.AI

Recognition: unknown

Towards Agentic Investigation of Security Alerts

Even Eilertsen , Vasileios Mavroeidis , Gudmund Grov

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:35 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords security alertsagentic workflowLLM investigationSuricata logsalert triagelog correlationcybersecurity automation

0 comments

The pith

An agentic workflow lets LLMs investigate security alerts with higher accuracy than direct prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a workflow that augments large language models with predefined queries and constrained tools to handle the initial review of security alerts. The model first receives an overview of available log data, then selects and runs specific queries, extracts evidence from the results, and issues a verdict on whether the alert is real. A reader would care because analysts face too many alerts with too little context, forcing slow manual checks across log sources. The reported outcome is that this structured process produces significantly more accurate verdicts than the same model given the alert without the workflow. The approach therefore combines standard analyst practices with LLMs to reduce the manual load on early-stage triage.

Core claim

The paper claims that an agentic workflow—providing an overview of log data through predefined queries, letting the LLM choose and execute specific queries via structured SQL on Suricata logs and grep-based text search, extracting raw evidence, and delivering a final verdict—enables the LLM to investigate and judge security alerts more accurately than the identical LLM used without this workflow.

What carries the argument

The agentic workflow that sequences overview queries, LLM-driven query selection, evidence extraction from results, and verdict generation under constrained tool access.

If this is right

The workflow automates the first stages of alert investigation by correlating information across log sources.
Final verdicts on alerts reach higher accuracy when the LLM follows the structured sequence instead of answering directly.
Manual workload decreases because the model handles planning and evidence gathering for routine alerts.
LLMs can act as virtual security analysts when combined with existing investigation practices and constrained tools.
High volumes of alerts become more manageable once initial triage is offloaded to the agentic process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Constraining the LLM to a small set of tools and queries may reduce the chance of fabricating evidence not present in the logs.
The same structure could be applied to other detection systems or log formats once suitable overview queries are defined.
In operational settings the workflow might serve as a filter that flags only uncertain cases for human review.
A direct test would measure performance drop when alerts require correlation with external threat intelligence not reachable by the current tools.

Load-bearing premise

The predefined queries and limited SQL-plus-grep tools supply enough context for the LLM to choose the right steps and reach correct verdicts without overlooking key evidence or hallucinating details.

What would settle it

Testing the workflow on alerts whose decisive evidence requires either a query outside the predefined set or data sources beyond Suricata logs, then checking whether verdict accuracy falls to or below the direct-LLM baseline.

Figures

Figures reproduced from arXiv: 2604.25846 by Even Eilertsen, Gudmund Grov, Vasileios Mavroeidis.

**Figure 1.** Figure 1: illustrates the overall architecture of the proposed LLM-based agentic workflow. The core of the system is the iterative agentic security investigation loop, which is depicted in view at source ↗

**Figure 2.** Figure 2: Agentic security investigation loop a) Investigator LLM: The investigation loop starts with sending the alert, along with the overview query, to the Investigator LLM. The Investigator is the first of the three LLM components4 , tasked with initiating the investigation and identifying relevant data for further analysis. In the prompt the LLM is provided with relevant information to provide context for its t… view at source ↗

**Figure 3.** Figure 3: Workflow for baseline experiment V. RESULTS In this section, we present the outcomes of the experiments. The workflow has been tested on both subsets, with each model executed 100 times. Throughout this evaluation, accuracy is used as the percentage of runs where the final verdict (Malicious, Benign, or Uncertain) correctly matches the ground truth label. For each subset and model, accuracy is calculated … view at source ↗

read the original abstract

Security analysts are overwhelmed by the volume of alerts and the low context provided by many detection systems. Early-stage investigations typically require manual correlation across multiple log sources, a task that is usually time-consuming. In this paper, we present an experimental, agentic workflow that leverages large language models (LLMs) augmented with predefined queries and constrained tool access (structured SQL over Suricata logs and grep-based text search) to automate the first stages of alert investigation. The proposed workflow integrates queries to provide an overview of the available data, and LLM components that selects which queries to use based on the overview results, extracts raw evidence from the query results, and delivers a final verdict of the alert. Our results demonstrate that the LLM-powered workflow can investigate log sources, plan an investigation, and produce a final verdict that has a significantly higher accuracy than a verdict produced by the same LLM without the proposed workflow. By recognizing the inherent limitations of directly applying LLMs to high-volume and unstructured data, we propose combining existing investigation practices of real-world analysts with a structured approach to leverage LLMs as virtual security analysts, thereby assisting and reducing the manual workload.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The agentic workflow for Suricata alert triage is practical but the accuracy claim needs an ablation to separate data injection from query selection and planning.

read the letter

The main thing here is that the paper puts together a concrete agentic loop for the first stages of security alert investigation on Suricata logs, using overview queries, LLM-driven query selection, evidence extraction, and a final verdict, with constrained SQL and grep tools. This beats a plain LLM prompt on accuracy in their tests, and the approach of grounding the model in real analyst steps rather than raw unstructured data is sensible and new as a packaged workflow. It does a decent job showing how to combine existing investigation practices with LLMs without letting the model run wild on high-volume logs. The tool constraints are a smart practical choice that should cut down on hallucinations. The soft spot is the evaluation setup. Their baseline runs the LLM on the alert alone with no log data, while the workflow feeds it structured results from the queries. Without an ablation that runs the full set of queries non-agentically and hands the complete output to the LLM for a verdict, it is difficult to tell how much the agentic selection and extraction steps actually contribute versus simply providing more context. The abstract states significantly higher accuracy but the details on datasets, metrics, and error analysis are not visible in the provided text, which makes the central claim hard to judge fully. This paper is for security researchers and SOC practitioners who want applied examples of LLM agents in operations rather than pure theory. A reader focused on reducing alert overload will get usable ideas from the workflow description. It shows honest engagement with the limits of direct LLM application and tries to fix them in a grounded way. I would send it for peer review with a request for the ablation and full quantitative results.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an experimental agentic workflow that augments LLMs with predefined queries and constrained tool access (structured SQL over Suricata logs and grep-based text search) to automate early-stage security alert investigation. The workflow provides data overviews, lets the LLM select queries, extract evidence, and render a final verdict. The central empirical claim is that this produces significantly higher accuracy than the same LLM applied directly to the alert without the workflow.

Significance. If the accuracy gains can be isolated to the agentic components and replicated on representative datasets, the work would offer a practical contribution to reducing analyst workload in security operations centers. It usefully combines real-world analyst practices with structured LLM tooling to mitigate hallucination risks on high-volume unstructured logs. The constrained tool design is a strength that grounds the approach in existing investigation patterns.

major comments (2)

[Results] Results section: the baseline comparison supplies the LLM only the raw alert (no log data), while the workflow injects structured results from the predefined SQL/grep queries. Without an ablation that runs the full set of queries non-agentically and feeds the complete results to the LLM for verdict, the accuracy improvement cannot be attributed to query selection, planning, or extraction rather than simple data injection. This is load-bearing for the headline claim.
[Evaluation] Evaluation: the abstract asserts significantly higher accuracy but the manuscript supplies no quantitative results, dataset details, statistical tests, or error analysis. The central performance claim therefore cannot be verified from the given text.

minor comments (2)

[Workflow Description] The description of how the LLM selects queries from the overview could include a concrete example of an overview output and the resulting query choice.
[Verdict Component] Clarify the exact criteria used for the final verdict (e.g., thresholds or decision rules) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We appreciate the recognition of the work's potential practical value in security operations centers and the strengths of the constrained tool design. We address each major comment below and commit to revisions that strengthen the empirical claims.

read point-by-point responses

Referee: [Results] Results section: the baseline comparison supplies the LLM only the raw alert (no log data), while the workflow injects structured results from the predefined SQL/grep queries. Without an ablation that runs the full set of queries non-agentically and feeds the complete results to the LLM for verdict, the accuracy improvement cannot be attributed to query selection, planning, or extraction rather than simple data injection. This is load-bearing for the headline claim.

Authors: We agree that this is a substantive concern and that the current baseline does not fully isolate the contribution of the agentic elements. The existing comparison demonstrates the value of structured data access over raw-alert prompting, but an additional control is needed to separate the effects of data injection from those of dynamic query selection, planning, and evidence extraction. In the revised manuscript we will add a non-agentic ablation that executes the complete predefined query set (all SQL and grep queries), aggregates every result, and supplies the full evidence bundle to the LLM solely for verdict generation. Accuracy, precision, and error rates from this ablation will be reported alongside the full workflow and the original baseline, allowing readers to quantify the incremental benefit attributable to the agentic components. revision: yes
Referee: [Evaluation] Evaluation: the abstract asserts significantly higher accuracy but the manuscript supplies no quantitative results, dataset details, statistical tests, or error analysis. The central performance claim therefore cannot be verified from the given text.

Authors: We acknowledge that the current version of the manuscript does not supply the quantitative detail required to verify the central claim. Although the abstract and results narrative state that accuracy is significantly higher, explicit metrics, dataset characteristics, statistical tests, and error analysis are absent. In the revision we will expand the Evaluation section to include: (1) a complete dataset description (number of alerts, alert categories, log sources, volume of Suricata events, and the process used to obtain ground-truth verdicts); (2) concrete accuracy, precision, recall, and F1 scores for both the agentic workflow and the baseline, together with confidence intervals; (3) statistical significance testing (e.g., McNemar’s test or bootstrap resampling); and (4) a qualitative and quantitative error analysis that categorizes false-positive and false-negative cases for each condition. These additions will make the performance claims verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical evaluation of an LLM-augmented agentic workflow for security alert investigation, comparing verdict accuracy against a baseline LLM without the workflow. The central claim rests on experimental results from applying the workflow (overview queries, LLM-driven query selection, evidence extraction, and final verdict) to Suricata logs, rather than any mathematical derivation, fitted parameters, or self-referential definitions. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear in the provided text; the accuracy improvement is reported as a direct outcome of the tested implementation. The experimental design (predefined queries plus constrained tools) supplies context but does not reduce the reported result to a tautology by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on standard LLM capabilities, existing log-query practices, and the assumption that constrained tools suffice; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5497 in / 1023 out tokens · 43166 ms · 2026-05-07T15:35:25.232491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Tariq, M

S. Tariq, M. Baruwal Chhetri, S. Nepal, and C. Paris, “Alert fatigue in security operations centres: Research challenges and opportunities,” ACM Computing Surveys, vol. 57, no. 9, pp. 1–38, Apr. 2025. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/3723158[Accessed: Sep. 7, 2025]

work page doi:10.1145/3723158 2025
[2]

Stress, burnout, and security fatigue in cybersecurity: A human factors problem,

C. Nobles, “Stress, burnout, and security fatigue in cybersecurity: A human factors problem,”Holistica Journal of Business and Public Administration, vol. 13, no. 1, pp. 49–72, 2022. [Online]. Available: https://sciendo.com/pdf/10.2478/hjbpa-2022-0003 [Accessed: Oct. 10, 2025]

work page doi:10.2478/hjbpa-2022-0003 2022
[3]

Com- bating alert fatigue in the security operations centre,

P. Kearney, M. Abdelsamea, X. Schmoor, F. Shah, and I. Vickers, “Com- bating alert fatigue in the security operations centre,”SSRN Electronic Journal, Nov. 15, 2023. [Online]. Available: https://papers.ssrn.com/ sol3/papers.cfm?abstract id=4633965 [Accessed: Sep. 7, 2025]

2023
[4]

Could SOAR save skills-short SOCs?,

R. Brewer, “Could SOAR save skills-short SOCs?,”Computer Fraud & Security, vol. 2019, no. 10, pp. 8–11, 2019. [Online]. Avail- able: https://doi.org/10.1016/S1361-3723(19)30106-X [Accessed: Sep. 7, 2025]

work page doi:10.1016/s1361-3723(19)30106-x 2019
[5]

Survey of intrusion detection systems: techniques, datasets and challenges,

A. Khraisat, I. Gondal, P. Vamplew, and J. Kamruzzaman, “Survey of intrusion detection systems: techniques, datasets and challenges,” inCybersecurity, vol. 2, no. 1, pp. 1–22, Springer, 2019. [On- line]. Available: https://cybersecurity.springeropen.com/articles/10.1186/ s42400-019-0038-7

2019
[6]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems, vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds. Red Hook, NY , USA: Curran Associates,...

2023
[7]

Benchmarking large language models for log analysis, security, and interpretation.Journal of Network and Systems Management, 32:59, 2024

E. Karlsen, X. Luo, N. Zincir-Heywood, and M. Heywood, “Benchmarking large language models for log analysis, security, and interpretation,” Journal of Network and Systems Management, vol. 32, no. 3, p. 59, 2024. [Online]. Available: https://link.springer.com/article/10.1007/s10922-024-09831-x . [Accessed: Sep. 7, 2025]

work page doi:10.1007/s10922-024-09831-x 2024
[8]

Microsoft Security Copilot,

Microsoft, “Microsoft Security Copilot,” Microsoft Security, 2025. [On- line]. Available: https://www.microsoft.com/en-us/security/business/ai- machine-learning/microsoft-security-copilot. [Accessed: Sep. 7, 2025]

2025
[9]

Agentic AI as a proactive cybercrime sentinel: De- tecting and deterring social engineering attacks,

A. Mohammed, “Agentic AI as a proactive cybercrime sentinel: De- tecting and deterring social engineering attacks,”Journal of Data and Digital Innovation (JDDI), vol. 2, no. 2, pp. 109–117, 2025. [Online]. Available: https://datalensjourna.com/index.php/JDDI/article/view/18/13 [Accessed: Oct. 10, 2025]

2025
[10]

Big Data and Cognitive Computing 9(7), 184 (2025) doi:10.3390/bdcc9070184

H. F. Atlam, “LLMs in Cyber Security: Bridging practice and education,” Big Data and Cognitive Computing, vol. 9, no. 7, p. 184, 2025. [Online]. Available: https://doi.org/10.3390/bdcc9070184 . [Accessed: Sep. 7, 2025]

work page doi:10.3390/bdcc9070184 2025
[11]

Generative AI and large language models for cyber security: All insights you need,

M. A. Ferrag, F. Alwahedi, A. Battah, B. Cherif, A. Mechri, and N. Tihanyi, “Generative AI and large language models for cyber security: All insights you need,” SSRN, 2024. [Online]. Available: https://ssrn.com/abstract=4853709 . [Accessed: Sep. 7, 2025]

2024
[12]

ChatGPT prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “ChatGPT prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” in *Generative AI for Effective Software Development*, A. Nguyen-Duc, P. Abrahamsson, and F. Khomh, Eds. Cham: Springer Nature Switzerland, 2024, pp. 71–

2024
[13]

Available: https://doi.org/10.1007/978-3-031-55642-5 4 [Accessed: Oct

[Online]. Available: https://doi.org/10.1007/978-3-031-55642-5 4 [Accessed: Oct. 28, 2025]

work page doi:10.1007/978-3-031-55642-5 2025
[14]

A Survey of Context Engineering for Large Language Models

L. Mei, J. Yao, Y . Ge, Y . Wang, B. Bi, Y . Cai, J. Liu, M. Li, Z.-Z. Li, D. Zhang,et al., “A survey of context engineering for large language models,”arXiv preprint arXiv:2507.13334, 2025. [Online]. Available: https://arxiv.org/abs/2507.13334 [Accessed: Oct. 10, 2025]

work page internal anchor Pith review arXiv 2025
[15]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu,et al., “AutoGen: Enabling next-gen LLM applications via multi-agent conversations,” inProceedings of the First Conference on Language Modeling, 2024. [Online]. Available: https://arxiv.org/abs/ 2308.08155 [Accessed: Oct. 10, 2025]

work page internal anchor Pith review arXiv 2024
[16]

Microsoft Digital Defense Report 2024,

Microsoft, “Microsoft Digital Defense Report 2024,” Microsoft Security Insider, 2024. [Online]. Available: https://www. microsoft.com/en-us/security/security-insider/intelligence-reports/ microsoft-digital-defense-report-2024 [Accessed: Oct. 10, 2025]

2024
[17]

The CISO Report,

Splunk, “The CISO Report,” Splunk, 2025. [Online]. Available: https: //www.splunk.com/en us/form/ciso-report/ [Accessed: Oct. 10, 2025]

2025
[18]

Security + AI on Google Cloud,

Google, “Security + AI on Google Cloud,” Google Cloud, 2024. [Online]. Available: https://cloud.google.com/security/ai [Accessed: Oct. 10, 2025]

2024
[19]

AI Copilots Simplify Cybersecurity,

Palo Alto Networks, “AI Copilots Simplify Cybersecurity,” Palo Alto Networks, 2024. [Online]. Available: https://www.paloaltonetworks. com/precision-ai-security/copilots [Accessed: Oct. 10, 2025]

2024
[20]

Global Cybersecurity Outlook 2024,

World Economic Forum Accenture, “Global Cybersecurity Outlook 2024,” World Economic Forum, Jan. 2024. [Online]. Available: https://www3.weforum.org/docs/WEF Global Cybersecurity Outlook 2024.pdf [Accessed: Oct. 10, 2025]

2024
[21]

Transforming cybersecurity with agentic AI to combat emerging cyber threats,

N. Kshetri, “Transforming cybersecurity with agentic AI to combat emerging cyber threats,”Telecommunications Policy, p. 102976, 2025. [Online]. Available: https://doi.org/10.1016/j.telpol.2025.102976 [Ac- cessed: Oct. 10, 2025]

work page doi:10.1016/j.telpol.2025.102976 2025
[22]

Landauer, F

M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner, and A. Rauber,AIT Log Data Set V1.1, Zenodo, 2020. DOI:10.5281/zenodo.4264796. [Online]. Available: https: //zenodo.org/records/4264796 [Accessed: Oct. 11, 2025]

work page doi:10.5281/zenodo.4264796 2020
[23]

99% false positives: A qualitative study of SOC analysts’ perspectives on security alarms,

B. A. Alahmadi, L. Axon, and I. Martinovic, “99% false positives: A qualitative study of SOC analysts’ perspectives on security alarms,” in Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 2783–2800

2022