Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

Alex Kantchelian; Aniket Anand; Daniel Fields; David Tao; Grant Ho; Kurt Thomas; Yiwei Hou

arxiv: 2606.10281 · v1 · pith:46GNITN5new · submitted 2026-06-09 · 💻 cs.CR · cs.CL

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

Aniket Anand , Yiwei Hou , Daniel Fields , Alex Kantchelian , David Tao , Kurt Thomas , Grant Ho This is my paper

Pith reviewed 2026-06-27 13:09 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords LLMssecurity audit logsattack investigationbenchmark datasetincident responselog analysismodel evaluationsecurity operations

0 comments

The pith

AuditBench evaluates LLMs on four security log investigation tasks across 50 scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AuditBench, a dataset of system audit logs from Linux and Windows covering over 50 security scenarios with both malicious and benign activity. It evaluates five frontier LLMs on four tasks that incident response teams perform, such as triaging alerts and identifying persistence mechanisms. Performance and error profiles vary with model size, data representation, prompt construction, and task type. The work also characterizes the quality of explanations produced by the models and the specific errors they make. This establishes a basis for assessing LLM use in security log analysis and points to directions for improvement.

Core claim

AuditBench consists of system audit logs collected from Linux and Windows machines and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using this benchmark, five frontier LLMs are evaluated on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. The analysis shows how LLM performance and error profiles vary according to different design choices such as differences in model size, data representation, prompt construction, and specific investigation tasks, while also characterizing the quality of the expl

What carries the argument

AuditBench, a benchmark dataset of Linux and Windows system audit logs spanning 50+ security scenarios for testing LLMs on four incident-response tasks.

If this is right

LLM performance on security log analysis depends on choices like model size, data representation, and prompt construction.
Models produce explanations of varying quality and commit distinct error types that differ by task.
Practitioners gain concrete guidance on when and how to apply LLMs in security operations.
Future work can target the specific weaknesses identified in current model behavior on these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks of this form could be extended to other log types such as network or application logs to test broader LLM utility.
Error profiles could guide the design of hybrid systems that combine LLMs with rule-based detectors for better reliability.
Longitudinal testing on evolving attack techniques would reveal whether current performance holds as threats change.

Load-bearing premise

The 50+ scenarios and four tasks in AuditBench are representative of the actual challenges and data distributions that incident response teams encounter in production environments.

What would settle it

Measure how often the same LLMs reach correct conclusions on a fresh collection of audit logs drawn directly from real production security incidents, compared against expert human analysts.

Figures

Figures reproduced from arXiv: 2606.10281 by Alex Kantchelian, Aniket Anand, Daniel Fields, David Tao, Grant Ho, Kurt Thomas, Yiwei Hou.

**Figure 1.** Figure 1: Performance of the 5 different LLMs across the benchmark’s tasks (§ 5.1); each point represents one LLM on a task, where some [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: The number of distinct false positives (y-axis) that exactly N [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AuditBench supplies a new public dataset and task suite for LLM log investigation, but its scenarios lack any reported check against real incident-response distributions.

read the letter

The paper's main contribution is AuditBench itself: a collection of Linux and Windows audit logs across more than 50 scenarios, paired with four standard investigation tasks such as alert triage and persistence detection. That combination has not appeared in prior work, so the dataset and evaluation framework are genuinely new.

The evaluation runs five frontier models, varies prompt style and data representation, and reports error types plus explanation quality. Those breakdowns are the parts that could actually help practitioners decide when an LLM is worth trying in an SOC workflow.

The soft spot is representativeness. The abstract and stress-test note give no quantitative comparison of event distributions, alert density, or persistence mechanisms against production incident corpora, nor any expert validation step. Without that, the performance numbers and error profiles remain tied to the authors' constructed scenarios rather than to the noisier, higher-volume logs teams actually face.

The work is aimed at researchers who build or evaluate AI tools for security operations. It is narrow but concrete, and the systematic error analysis is honest enough to be worth referee time. I would send it out for review rather than desk-reject, with the expectation that the authors will need to address the realism gap in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces AuditBench, a benchmark of system audit logs from Linux and Windows machines spanning over 50 security investigation scenarios (malicious and benign), and uses it to evaluate five frontier LLMs on four incident-response tasks (alert triaging to persistence-mechanism identification). It analyzes how performance and error profiles vary with model size, data representation, prompt construction, and task, characterizes explanation quality, and claims to supply a foundation for LLM assessment in security logs plus practitioner insights.

Significance. If the scenarios prove representative of production IR distributions and ground truth is rigorously established, the benchmark and error analysis would supply a useful empirical foundation for LLM evaluation in security operations and concrete directions for prompt/model improvements. The current lack of validation evidence against real corpora substantially reduces the strength of those claims.

major comments (2)

[Abstract; benchmark design paragraph] Abstract and benchmark-design section: the central claim that AuditBench supplies 'a foundation for assessing the capabilities of LLMs for investigating security logs' and 'novel insights for practitioners' rests on the 50+ scenarios being representative of real incident-response data distributions. No quantitative comparison (event-type histograms, alert density, persistence-mechanism prevalence, etc.) to any production IR corpus, nor expert validation of realism, is described; without this the reported performance numbers and error profiles cannot be assumed to generalize.
[Abstract] Abstract: the description of the evaluation supplies no information on how the 50+ scenarios were validated, how ground truth labels were established, what statistical controls or inter-annotator agreement measures were applied, or any quantitative performance numbers. This omission makes the data-to-claim link unverifiable from the provided text.

minor comments (1)

[Abstract] Abstract states the benchmark 'spans over 50 different security investigation scenarios' but does not indicate the exact count, task distribution, or balance between malicious and benign cases; adding these figures would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on the benchmark's validation and the abstract's content. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract; benchmark design paragraph] Abstract and benchmark-design section: the central claim that AuditBench supplies 'a foundation for assessing the capabilities of LLMs for investigating security logs' and 'novel insights for practitioners' rests on the 50+ scenarios being representative of real incident-response data distributions. No quantitative comparison (event-type histograms, alert density, persistence-mechanism prevalence, etc.) to any production IR corpus, nor expert validation of realism, is described; without this the reported performance numbers and error profiles cannot be assumed to generalize.

Authors: We agree that the absence of quantitative comparisons to production corpora limits the strength of generalizability claims. Production security logs are typically not publicly shareable due to privacy and operational security reasons. The scenarios in AuditBench were constructed based on publicly documented attack techniques from sources such as MITRE ATT&CK and common system behaviors, with the goal of covering a diverse set of investigation tasks. We will revise the benchmark design section to provide additional details on the scenario generation process and any internal validation steps performed. However, direct quantitative matching to real corpora is not feasible without access to such data. revision: partial
Referee: [Abstract] Abstract: the description of the evaluation supplies no information on how the 50+ scenarios were validated, how ground truth labels were established, what statistical controls or inter-annotator agreement measures were applied, or any quantitative performance numbers. This omission makes the data-to-claim link unverifiable from the provided text.

Authors: The abstract is intended as a concise summary and does not include full methodological details, which are elaborated in the body of the paper. Ground truth labels were established by the authors through controlled execution of known malicious and benign activities in isolated environments, ensuring deterministic labels based on the scenario definitions. No inter-annotator agreement was required as the labels derive from the experimental setup rather than subjective judgment. Quantitative performance numbers are reported in the evaluation section. We will update the abstract to briefly reference the scenario validation approach and include example performance metrics to better link data to claims. revision: yes

standing simulated objections not resolved

Direct quantitative comparison of AuditBench scenarios to proprietary production IR corpora due to lack of public access to such datasets.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent dataset and measurements

full rationale

The paper constructs a new benchmark (AuditBench) from Linux/Windows audit logs across 50+ scenarios and measures LLM performance on four explicit tasks. No equations, fitted parameters, derivations, or self-citation chains appear in the abstract or described structure. Central claims rest on direct empirical results from the new data rather than reducing to prior inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper; the abstract introduces no mathematical derivations, fitted constants, or new postulated entities. The central claim rests on the unstated premise that the chosen scenarios and tasks are representative, but that premise is not formalized as an axiom or free parameter in the provided text.

pith-pipeline@v0.9.1-grok · 5732 in / 1334 out tokens · 22423 ms · 2026-06-27T13:09:05.024337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 1 linked inside Pith

[1]

Siraaj Akhtar, Saad Khan, and Simon Parkinson. 2025. LLM-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677(2025)

arXiv 2025
[2]

Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% false positives: A qualitative study of {SOC} analysts’ perspectives on security alarms. In31st USENIX Security Symposium (USENIX Security’22). 2783–2800

2022
[3]

Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% false positives: A qualitative study of {SOC} analysts’ perspectives on security alarms. In31st USENIX Security Symposium (USENIX Security 22). 2783–2800

2022
[4]

Crispin Almodovar, Fariza Sabrina, Sarvnaz Karimi, and Salahuddin Azad. 2022. Can language models help in system security? investigating log anomaly detec- tion using bert. InProceedings of the 20th Annual Workshop of the Australasian Language Technology Association. 139–147

2022
[5]

Anthropic. 2025. Effective Context Engineering for AI Agents. https://www.anth ropic.com/engineering/effective-context-engineering-for-ai-agents. Accessed: November 1, 2025

2025
[6]

Sidahmed Benabderrahmane, Petko Valtchev, James Cheney, and Talal Rahwan
[7]

In2025 13th International Symposium on Digital Forensics and Security (ISDFS)

APT-LLM: Embedding-Based Anomaly Detection of Cyber Advanced Persistent Threats Using Large Language Models. In2025 13th International Symposium on Digital Forensics and Security (ISDFS). IEEE, 1–6
[8]

Tristan Bilot, Baoxiang Jiang, Zefeng Li, Nour El Madhoun, Khaldoun Al Agha, Anis Zouaoui, and Thomas Pasquier. 2025. Sometimes Simpler is Better: A Comprehensive Analysis of {State-of-the-Art} {Provenance-Based} Intrusion Detection Systems. In34th USENIX Security Symposium (USENIX Security 25). 7193–7212

2025
[9]

CrowdStrike Blog. 2023. Introducing Charlotte AI, CrowdStrike’s Generative AI Security Analyst: Ushering in the Future of AI-Powered Cybersecurity. https: //www.crowdstrike.com/en-us/blog/crowdstrike-introduces-charlotte-ai-to- deliver-generative-ai-powered-cybersecurity

2023
[10]

Red Canary. 2025. Atomic Red Team. https://www.atomicredteam.io/. [Online; accessed 28-April-2025]

2025
[11]

Zijun Cheng, Qiujian Lv, Jinyuan Liang, Yan Wang, Degang Sun, Thomas Pasquier, and Xueyuan Han. 2024. Kairos: Practical intrusion detection and investigation using whole-system provenance. In2024 IEEE Symposium on Secu- rity and Privacy (SP). IEEE, 3533–3551

2024
[12]

CMU Information Security Office. 2025. CMU Incident Response Procedure. https://www.cmu.edu/iso/governance/procedures/IRPlan.html

2025
[13]

MITRE Corporation. 2025. MITRE ATT&CK. https://attack.mitre.org/

2025
[14]

Tianyu Cui, Shiyu Ma, Ziang Chen, Tong Xiao, Chenyu Zhao, Shimin Tao, Yilun Liu, Shenglin Zhang, Duoming Lin, Changchang Liu, et al . 2025. LogEval: A comprehensive benchmark suite for LLMs in log analysis.Empirical Software Engineering30, 6 (2025), 173

2025
[15]

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. InMachine learning challenges workshop. Springer, 177–190

2005
[16]

DARPA / Five Directions. 2020. Operationally Transparent Cyber (OpTC) Data Release. https://github.com/FiveDirections/OpTC-data

2020
[17]

DARPA I2O. 2020. DARPA Transparent Computing Data Release. https://github .com/darpa-i2o/Transparent-Computing

2020
[18]

Roman Daszczyszak, Dan Ellis, Steve Luke, and Sean Whitley. 2019. TTP-based hunting.MITRE Corp, McLean V A, Tech. Rep(2019)

2019
[19]

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24). 847–864

2024
[20]

Feng Dong, Shaofei Li, Peng Jiang, Ding Li, Haoyu Wang, Liangyi Huang, Xusheng Xiao, Jiedong Chen, Xiapu Luo, Yao Guo, et al. 2023. Are we there yet? an industrial viewpoint on provenance-based endpoint detection and response tools. InACM Conference on Computer and Communications Security

2023
[21]

Ashish Gehani and Dawood Tariq. 2012. SPADE: support for provenance auditing in distributed environments. InInternational Middleware Conference

2012
[22]

Google Developers. 2025. Prompting Strategies. https://ai.google.dev/gemini- api/docs/prompting-strategies. Accessed: November 1, 2025

2025
[23]

Wei Guan, Jian Cao, Shiyou Qian, Jianqi Gao, and Chun Ouyang. 2024. Logllm: Log-based anomaly detection using large language models.arXiv preprint arXiv:2411.08561(2024)

arXiv 2024
[24]

Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. Logbert: Log anomaly detec- tion via bert. In2021 international joint conference on neural networks (IJCNN). IEEE, 1–8

2021
[25]

Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer
[26]

InNetwork and Distributed System Security Symposium (NDSS’20)

Unicorn: Runtime provenance-based detector for advanced persistent threats. InNetwork and Distributed System Security Symposium (NDSS’20). 1–19
[27]

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, and Sadid Hasan. 2024. Does prompt formatting have any impact on llm performance? arXiv preprint arXiv:2411.10541(2024)

arXiv 2024
[28]

IBM. 2025. IBM QRadar. https://www.ibm.com/products/qradar

2025
[29]

Muhammad Adil Inam, Yinfang Chen, Akul Goyal, Jason Liu, Jaron Mink, Noor Michael, Sneha Gaur, Adam Bates, and Wajih Ul Hassan. 2023. SoK: History is a Vast Early Warning System: Auditing the Provenance of System Intrusions . In IEEE Symposium on Security and Privacy (SP)

2023
[30]

Zian Jia, Yun Xiong, Yuhong Nan, Yao Zhang, Jinjing Zhao, and Mi Wen. 2024. {MAGIC}: Detecting advanced persistent threats via masked graph representa- tion learning. In33rd USENIX Security Symposium (USENIX Security 24). 5197– 5214

2024
[31]

Gavin Jones, Dimitrios Kasimatis, Nikolaos Pitropakis, Richard Macfarlane, and William J Buchanan. 2025. Analysing the role of LLMs in cybersecurity incident management.International Journal of Information Security24, 6 (2025), 1–14

2025
[32]

Diana Kramer, Lambert Rosique, Ajay Narotam, Elie Bursztein, Patrick Gage Kelley, Kurt Thomas, and Allison Woodruff. 2025. Integrating large language models into security incident response. InTwenty-First Symposium on Usable Privacy and Security (SOUPS 2025). 133–148

2025
[33]

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. arXiv preprint arXiv:2506.11791(2025)

arXiv 2025
[34]

Chenyu Li, Zhengjia Zhu, Jiyan He, and Xiu Zhang. 2025. RedChronos: A Large Language Model-Based Log Analysis System for Insider Threat Detection in Enterprises.arXiv preprint arXiv:2503.02702(2025)

arXiv 2025
[35]

Lei Lin, Jiayi Fu, Pengli Liu, Qingyang Li, Yan Gong, Junchen Wan, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, and Kun Gai. 2023. Just ask one more time! self-agreement improves reasoning of language models in (almost) all scenarios. arXiv preprint arXiv:2311.08154(2023)

arXiv 2023
[36]

Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yuhang Chen, Yanqing Zhao, Hao Yang, and Yanfei Jiang. 2024. Interpretable online log analysis using large language models with prompt strategies. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 35–46

2024
[37]

Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang
[38]

InProceedings of the IEEE/ACM 46th International Conference on Software Engineering

Llmparser: An exploratory study on using large language models for log parsing. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
[39]

Ashish Bharadwaj Madabhushana. 2021. Configure Linux system auditing with auditd. https://www.redhat.com/en/blog/configure-linux-auditing-auditd

2021
[40]

Microsoft. 2025. Process Monitor v4.01. https://learn.microsof t.com/en- us/sysinternals/downloads/procmon

2025
[41]

Microsoft Learn. 2025. Prompt engineering - Azure AI Foundry. https://learn. microsoft.com/en-us/azure/ai-foundry/openai/concepts/prompt-engineering. Accessed: November 1, 2025

2025
[42]

MITRE ATT&CK. 2025. eCAR - extended Cyber Analytics Repository Model. https://github.com/FiveDirections/OpTC-data/blob/master/ecar.md. [Online; accessed 28-April-2025]

2025
[43]

MITRE ATT&CK. 2025. MITRE Cyber Analytics Repository. https://car.mitre.or g/. [Online; accessed 28-April-2025]

2025
[44]

Alex Nelson, Sanjay Rekhi, Murugiah Souppaya, and Karen Scarfone. 2025. In- cident Response Recommendations and Considerations for Cybersecurity Risk Management.NIST SP 800-61r3(2025)

2025
[45]

NVIDIA. 2025. What is Chain of Thought (CoT) Prompting? https://www.nvid ia.com/en-us/glossary/cot-prompting/

2025
[46]

Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David Eyers, Margo Seltzer, and Jean Bacon. 2017. Practical whole-system provenance capture. InSymposium on Cloud Computing

2017
[47]

Mati Ur Rehman, Hadi Ahmadi, and Wajih Ul Hassan. 2024. Flash: A compre- hensive approach to intrusion detection via provenance graph representation learning. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 3552–3570

2024
[48]

Microsoft Security. 2025. Microsoft Security Copilot. https://www.microsoft.co m/en-us/security/business/ai-machine-learning/microsoft-security-copilot

2025
[49]

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Far- shad Khorrami, et al. 2024. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37 (2024), 57472–57498

2024
[50]

Chengyu Song, Linru Ma, Jianming Zheng, Jinzhi Liao, Hongyu Kuang, and Lin Yang. 2024. Audit-llm: Multi-agent collaboration for log-based insider threat detection.arXiv preprint arXiv:2408.08902(2024)

arXiv 2024
[51]

U.S. CISA. 2025. CISA Cybersecurity Incident Response. https://www.ci sa.gov/topics/cybersecurity- best- practices/organizations- and- cyber- safety/cybersecurity-incident-response

2025
[52]

Cybersecurity & Infrastructure Security Agency

U.S. Cybersecurity & Infrastructure Security Agency. 2021. Cybersecurity In- cident & Vulnerability Response Playbooks. https://www.cisa.gov/resources- tools/resources/federal- government- cybersecurity- incident- and- vulnerability-response-playbooks. Anand et al

2021
[53]

U.S. NIST. 2021. Cybersecurity Framework 1.1. https://www.nist.gov/cyberfram ework/getting-started/online-learning/five-functions

2021
[54]

U.S. NIST. 2025. Special Publication (SP) 800-61 Revision 3. https://csrc.nist.gov/ projects/incident-response

2025
[55]

Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, and Martin Vechev. 2025. BaxBench: Can LLMs generate correct and secure backends?arXiv preprint arXiv:2502.11844(2025)

arXiv 2025
[56]

Manfred Vielberth, Fabian Böhm, Ines Fichtinger, and Günther Pernul. 2020. Security Operations Center: A Systematic Study and Open Challenges.IEEE Access8 (2020)

2020
[57]

Dawei Wang, Geng Zhou, Xianglong Li, Yu Bai, Li Chen, Ting Qin, Jian Sun, and Dan Li. 2025. The Digital Cybersecurity Expert: How Far Have We Come?. In 2025 IEEE Symposium on Security and Privacy (S&P’25). IEEE, 3273–3290

2025
[58]

Lingzhi Wang, Xiangmin Shen, Weijian Li, Zhenyuan Li, R Sekar, Han Liu, and Yan Chen. 2025. Incorporating Gradients to Rules: Towards Lightweight, Adaptive Provenance-based Intrusion Detection. (2025)

2025
[59]

Su Wang, Zhiliang Wang, Tao Zhou, Hongbin Sun, Xia Yin, Dongqi Han, Han Zhang, Xingang Shi, and Jiahai Yang. 2022. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning.IEEE Transactions on Information Forensics and Security17 (2022), 3972–3987

2022
[60]

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. 2025. CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale.arXiv preprint arXiv:2506.02548(2025)

arXiv 2025
[61]

Wikipedia. 2025. F-score. https://en.wikipedia.org/wiki/F-score

2025
[62]

Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, et al. 2025. Excytin-bench: Evaluating llm agents on cyber threat investigation.arXiv preprint arXiv:2507.14201(2025)

Pith/arXiv arXiv 2025
[63]

Iyer, and Gang Wang

Limin Yang, Zhi Chen, Chenkai Wang, Zhenning Zhang, Sushruth Booma, Phuong Cao, Constantin Adam, Alexander Withers, Zbigniew Kalbarczyk, Ravis- hankar K. Iyer, and Gang Wang. 2024. True Attacks, Attack Attempts, or Benign Triggers? An Empirical Measurement of Network Alerts in a Security Operations Center. InUSENIX Security Symposium (USENIX Security’24)

2024
[64]

Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, et al. 2025. BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems.arXiv preprint arXiv:2505.15216(2025)

arXiv 2025
[65]

Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, et al
[66]

InThe Thirteenth International Conference on Learning Representations (ICLR’25)

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. InThe Thirteenth International Conference on Learning Representations (ICLR’25)
[67]

Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. 2025. When llms meet cybersecurity: A systematic literature review.Cybersecurity8, 1 (2025), 55

2025
[68]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

2023
[69]

Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesh- eng Wu, Quanzheng Li, and Qingsong Wen. 2024. Logparser-llm: Advancing efficient log parsing with large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4559–4570

2024
[70]

timestamp

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. 2025. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities. InForty-second International Conference on Machine Learning (ICML’25). Open Science We will release all code ...

2025

[1] [1]

Siraaj Akhtar, Saad Khan, and Simon Parkinson. 2025. LLM-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677(2025)

arXiv 2025

[2] [2]

Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% false positives: A qualitative study of {SOC} analysts’ perspectives on security alarms. In31st USENIX Security Symposium (USENIX Security’22). 2783–2800

2022

[3] [3]

Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% false positives: A qualitative study of {SOC} analysts’ perspectives on security alarms. In31st USENIX Security Symposium (USENIX Security 22). 2783–2800

2022

[4] [4]

Crispin Almodovar, Fariza Sabrina, Sarvnaz Karimi, and Salahuddin Azad. 2022. Can language models help in system security? investigating log anomaly detec- tion using bert. InProceedings of the 20th Annual Workshop of the Australasian Language Technology Association. 139–147

2022

[5] [5]

Anthropic. 2025. Effective Context Engineering for AI Agents. https://www.anth ropic.com/engineering/effective-context-engineering-for-ai-agents. Accessed: November 1, 2025

2025

[6] [6]

Sidahmed Benabderrahmane, Petko Valtchev, James Cheney, and Talal Rahwan

[7] [7]

In2025 13th International Symposium on Digital Forensics and Security (ISDFS)

APT-LLM: Embedding-Based Anomaly Detection of Cyber Advanced Persistent Threats Using Large Language Models. In2025 13th International Symposium on Digital Forensics and Security (ISDFS). IEEE, 1–6

[8] [8]

Tristan Bilot, Baoxiang Jiang, Zefeng Li, Nour El Madhoun, Khaldoun Al Agha, Anis Zouaoui, and Thomas Pasquier. 2025. Sometimes Simpler is Better: A Comprehensive Analysis of {State-of-the-Art} {Provenance-Based} Intrusion Detection Systems. In34th USENIX Security Symposium (USENIX Security 25). 7193–7212

2025

[9] [9]

CrowdStrike Blog. 2023. Introducing Charlotte AI, CrowdStrike’s Generative AI Security Analyst: Ushering in the Future of AI-Powered Cybersecurity. https: //www.crowdstrike.com/en-us/blog/crowdstrike-introduces-charlotte-ai-to- deliver-generative-ai-powered-cybersecurity

2023

[10] [10]

Red Canary. 2025. Atomic Red Team. https://www.atomicredteam.io/. [Online; accessed 28-April-2025]

2025

[11] [11]

Zijun Cheng, Qiujian Lv, Jinyuan Liang, Yan Wang, Degang Sun, Thomas Pasquier, and Xueyuan Han. 2024. Kairos: Practical intrusion detection and investigation using whole-system provenance. In2024 IEEE Symposium on Secu- rity and Privacy (SP). IEEE, 3533–3551

2024

[12] [12]

CMU Information Security Office. 2025. CMU Incident Response Procedure. https://www.cmu.edu/iso/governance/procedures/IRPlan.html

2025

[13] [13]

MITRE Corporation. 2025. MITRE ATT&CK. https://attack.mitre.org/

2025

[14] [14]

Tianyu Cui, Shiyu Ma, Ziang Chen, Tong Xiao, Chenyu Zhao, Shimin Tao, Yilun Liu, Shenglin Zhang, Duoming Lin, Changchang Liu, et al . 2025. LogEval: A comprehensive benchmark suite for LLMs in log analysis.Empirical Software Engineering30, 6 (2025), 173

2025

[15] [15]

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. InMachine learning challenges workshop. Springer, 177–190

2005

[16] [16]

DARPA / Five Directions. 2020. Operationally Transparent Cyber (OpTC) Data Release. https://github.com/FiveDirections/OpTC-data

2020

[17] [17]

DARPA I2O. 2020. DARPA Transparent Computing Data Release. https://github .com/darpa-i2o/Transparent-Computing

2020

[18] [18]

Roman Daszczyszak, Dan Ellis, Steve Luke, and Sean Whitley. 2019. TTP-based hunting.MITRE Corp, McLean V A, Tech. Rep(2019)

2019

[19] [19]

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24). 847–864

2024

[20] [20]

Feng Dong, Shaofei Li, Peng Jiang, Ding Li, Haoyu Wang, Liangyi Huang, Xusheng Xiao, Jiedong Chen, Xiapu Luo, Yao Guo, et al. 2023. Are we there yet? an industrial viewpoint on provenance-based endpoint detection and response tools. InACM Conference on Computer and Communications Security

2023

[21] [21]

Ashish Gehani and Dawood Tariq. 2012. SPADE: support for provenance auditing in distributed environments. InInternational Middleware Conference

2012

[22] [22]

Google Developers. 2025. Prompting Strategies. https://ai.google.dev/gemini- api/docs/prompting-strategies. Accessed: November 1, 2025

2025

[23] [23]

Wei Guan, Jian Cao, Shiyou Qian, Jianqi Gao, and Chun Ouyang. 2024. Logllm: Log-based anomaly detection using large language models.arXiv preprint arXiv:2411.08561(2024)

arXiv 2024

[24] [24]

Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. Logbert: Log anomaly detec- tion via bert. In2021 international joint conference on neural networks (IJCNN). IEEE, 1–8

2021

[25] [25]

Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer

[26] [26]

InNetwork and Distributed System Security Symposium (NDSS’20)

Unicorn: Runtime provenance-based detector for advanced persistent threats. InNetwork and Distributed System Security Symposium (NDSS’20). 1–19

[27] [27]

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, and Sadid Hasan. 2024. Does prompt formatting have any impact on llm performance? arXiv preprint arXiv:2411.10541(2024)

arXiv 2024

[28] [28]

IBM. 2025. IBM QRadar. https://www.ibm.com/products/qradar

2025

[29] [29]

Muhammad Adil Inam, Yinfang Chen, Akul Goyal, Jason Liu, Jaron Mink, Noor Michael, Sneha Gaur, Adam Bates, and Wajih Ul Hassan. 2023. SoK: History is a Vast Early Warning System: Auditing the Provenance of System Intrusions . In IEEE Symposium on Security and Privacy (SP)

2023

[30] [30]

Zian Jia, Yun Xiong, Yuhong Nan, Yao Zhang, Jinjing Zhao, and Mi Wen. 2024. {MAGIC}: Detecting advanced persistent threats via masked graph representa- tion learning. In33rd USENIX Security Symposium (USENIX Security 24). 5197– 5214

2024

[31] [31]

Gavin Jones, Dimitrios Kasimatis, Nikolaos Pitropakis, Richard Macfarlane, and William J Buchanan. 2025. Analysing the role of LLMs in cybersecurity incident management.International Journal of Information Security24, 6 (2025), 1–14

2025

[32] [32]

Diana Kramer, Lambert Rosique, Ajay Narotam, Elie Bursztein, Patrick Gage Kelley, Kurt Thomas, and Allison Woodruff. 2025. Integrating large language models into security incident response. InTwenty-First Symposium on Usable Privacy and Security (SOUPS 2025). 133–148

2025

[33] [33]

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. arXiv preprint arXiv:2506.11791(2025)

arXiv 2025

[34] [34]

Chenyu Li, Zhengjia Zhu, Jiyan He, and Xiu Zhang. 2025. RedChronos: A Large Language Model-Based Log Analysis System for Insider Threat Detection in Enterprises.arXiv preprint arXiv:2503.02702(2025)

arXiv 2025

[35] [35]

Lei Lin, Jiayi Fu, Pengli Liu, Qingyang Li, Yan Gong, Junchen Wan, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, and Kun Gai. 2023. Just ask one more time! self-agreement improves reasoning of language models in (almost) all scenarios. arXiv preprint arXiv:2311.08154(2023)

arXiv 2023

[36] [36]

Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yuhang Chen, Yanqing Zhao, Hao Yang, and Yanfei Jiang. 2024. Interpretable online log analysis using large language models with prompt strategies. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 35–46

2024

[37] [37]

Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang

[38] [38]

InProceedings of the IEEE/ACM 46th International Conference on Software Engineering

Llmparser: An exploratory study on using large language models for log parsing. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

[39] [39]

Ashish Bharadwaj Madabhushana. 2021. Configure Linux system auditing with auditd. https://www.redhat.com/en/blog/configure-linux-auditing-auditd

2021

[40] [40]

Microsoft. 2025. Process Monitor v4.01. https://learn.microsof t.com/en- us/sysinternals/downloads/procmon

2025

[41] [41]

Microsoft Learn. 2025. Prompt engineering - Azure AI Foundry. https://learn. microsoft.com/en-us/azure/ai-foundry/openai/concepts/prompt-engineering. Accessed: November 1, 2025

2025

[42] [42]

MITRE ATT&CK. 2025. eCAR - extended Cyber Analytics Repository Model. https://github.com/FiveDirections/OpTC-data/blob/master/ecar.md. [Online; accessed 28-April-2025]

2025

[43] [43]

MITRE ATT&CK. 2025. MITRE Cyber Analytics Repository. https://car.mitre.or g/. [Online; accessed 28-April-2025]

2025

[44] [44]

Alex Nelson, Sanjay Rekhi, Murugiah Souppaya, and Karen Scarfone. 2025. In- cident Response Recommendations and Considerations for Cybersecurity Risk Management.NIST SP 800-61r3(2025)

2025

[45] [45]

NVIDIA. 2025. What is Chain of Thought (CoT) Prompting? https://www.nvid ia.com/en-us/glossary/cot-prompting/

2025

[46] [46]

Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David Eyers, Margo Seltzer, and Jean Bacon. 2017. Practical whole-system provenance capture. InSymposium on Cloud Computing

2017

[47] [47]

Mati Ur Rehman, Hadi Ahmadi, and Wajih Ul Hassan. 2024. Flash: A compre- hensive approach to intrusion detection via provenance graph representation learning. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 3552–3570

2024

[48] [48]

Microsoft Security. 2025. Microsoft Security Copilot. https://www.microsoft.co m/en-us/security/business/ai-machine-learning/microsoft-security-copilot

2025

[49] [49]

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Far- shad Khorrami, et al. 2024. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37 (2024), 57472–57498

2024

[50] [50]

Chengyu Song, Linru Ma, Jianming Zheng, Jinzhi Liao, Hongyu Kuang, and Lin Yang. 2024. Audit-llm: Multi-agent collaboration for log-based insider threat detection.arXiv preprint arXiv:2408.08902(2024)

arXiv 2024

[51] [51]

U.S. CISA. 2025. CISA Cybersecurity Incident Response. https://www.ci sa.gov/topics/cybersecurity- best- practices/organizations- and- cyber- safety/cybersecurity-incident-response

2025

[52] [52]

Cybersecurity & Infrastructure Security Agency

U.S. Cybersecurity & Infrastructure Security Agency. 2021. Cybersecurity In- cident & Vulnerability Response Playbooks. https://www.cisa.gov/resources- tools/resources/federal- government- cybersecurity- incident- and- vulnerability-response-playbooks. Anand et al

2021

[53] [53]

U.S. NIST. 2021. Cybersecurity Framework 1.1. https://www.nist.gov/cyberfram ework/getting-started/online-learning/five-functions

2021

[54] [54]

U.S. NIST. 2025. Special Publication (SP) 800-61 Revision 3. https://csrc.nist.gov/ projects/incident-response

2025

[55] [55]

Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, and Martin Vechev. 2025. BaxBench: Can LLMs generate correct and secure backends?arXiv preprint arXiv:2502.11844(2025)

arXiv 2025

[56] [56]

Manfred Vielberth, Fabian Böhm, Ines Fichtinger, and Günther Pernul. 2020. Security Operations Center: A Systematic Study and Open Challenges.IEEE Access8 (2020)

2020

[57] [57]

Dawei Wang, Geng Zhou, Xianglong Li, Yu Bai, Li Chen, Ting Qin, Jian Sun, and Dan Li. 2025. The Digital Cybersecurity Expert: How Far Have We Come?. In 2025 IEEE Symposium on Security and Privacy (S&P’25). IEEE, 3273–3290

2025

[58] [58]

Lingzhi Wang, Xiangmin Shen, Weijian Li, Zhenyuan Li, R Sekar, Han Liu, and Yan Chen. 2025. Incorporating Gradients to Rules: Towards Lightweight, Adaptive Provenance-based Intrusion Detection. (2025)

2025

[59] [59]

Su Wang, Zhiliang Wang, Tao Zhou, Hongbin Sun, Xia Yin, Dongqi Han, Han Zhang, Xingang Shi, and Jiahai Yang. 2022. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning.IEEE Transactions on Information Forensics and Security17 (2022), 3972–3987

2022

[60] [60]

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. 2025. CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale.arXiv preprint arXiv:2506.02548(2025)

arXiv 2025

[61] [61]

Wikipedia. 2025. F-score. https://en.wikipedia.org/wiki/F-score

2025

[62] [62]

Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, et al. 2025. Excytin-bench: Evaluating llm agents on cyber threat investigation.arXiv preprint arXiv:2507.14201(2025)

Pith/arXiv arXiv 2025

[63] [63]

Iyer, and Gang Wang

Limin Yang, Zhi Chen, Chenkai Wang, Zhenning Zhang, Sushruth Booma, Phuong Cao, Constantin Adam, Alexander Withers, Zbigniew Kalbarczyk, Ravis- hankar K. Iyer, and Gang Wang. 2024. True Attacks, Attack Attempts, or Benign Triggers? An Empirical Measurement of Network Alerts in a Security Operations Center. InUSENIX Security Symposium (USENIX Security’24)

2024

[64] [64]

Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, et al. 2025. BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems.arXiv preprint arXiv:2505.15216(2025)

arXiv 2025

[65] [65]

Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, et al

[66] [66]

InThe Thirteenth International Conference on Learning Representations (ICLR’25)

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. InThe Thirteenth International Conference on Learning Representations (ICLR’25)

[67] [67]

Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. 2025. When llms meet cybersecurity: A systematic literature review.Cybersecurity8, 1 (2025), 55

2025

[68] [68]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

2023

[69] [69]

Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesh- eng Wu, Quanzheng Li, and Qingsong Wen. 2024. Logparser-llm: Advancing efficient log parsing with large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4559–4570

2024

[70] [70]

timestamp

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. 2025. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities. InForty-second International Conference on Machine Learning (ICML’25). Open Science We will release all code ...

2025