pith. sign in

arxiv: 2606.10281 · v1 · pith:46GNITN5new · submitted 2026-06-09 · 💻 cs.CR · cs.CL

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

Pith reviewed 2026-06-27 13:09 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLMssecurity audit logsattack investigationbenchmark datasetincident responselog analysismodel evaluationsecurity operations
0
0 comments X

The pith

AuditBench evaluates LLMs on four security log investigation tasks across 50 scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AuditBench, a dataset of system audit logs from Linux and Windows covering over 50 security scenarios with both malicious and benign activity. It evaluates five frontier LLMs on four tasks that incident response teams perform, such as triaging alerts and identifying persistence mechanisms. Performance and error profiles vary with model size, data representation, prompt construction, and task type. The work also characterizes the quality of explanations produced by the models and the specific errors they make. This establishes a basis for assessing LLM use in security log analysis and points to directions for improvement.

Core claim

AuditBench consists of system audit logs collected from Linux and Windows machines and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using this benchmark, five frontier LLMs are evaluated on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. The analysis shows how LLM performance and error profiles vary according to different design choices such as differences in model size, data representation, prompt construction, and specific investigation tasks, while also characterizing the quality of the expl

What carries the argument

AuditBench, a benchmark dataset of Linux and Windows system audit logs spanning 50+ security scenarios for testing LLMs on four incident-response tasks.

If this is right

  • LLM performance on security log analysis depends on choices like model size, data representation, and prompt construction.
  • Models produce explanations of varying quality and commit distinct error types that differ by task.
  • Practitioners gain concrete guidance on when and how to apply LLMs in security operations.
  • Future work can target the specific weaknesses identified in current model behavior on these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks of this form could be extended to other log types such as network or application logs to test broader LLM utility.
  • Error profiles could guide the design of hybrid systems that combine LLMs with rule-based detectors for better reliability.
  • Longitudinal testing on evolving attack techniques would reveal whether current performance holds as threats change.

Load-bearing premise

The 50+ scenarios and four tasks in AuditBench are representative of the actual challenges and data distributions that incident response teams encounter in production environments.

What would settle it

Measure how often the same LLMs reach correct conclusions on a fresh collection of audit logs drawn directly from real production security incidents, compared against expert human analysts.

Figures

Figures reproduced from arXiv: 2606.10281 by Alex Kantchelian, Aniket Anand, Daniel Fields, David Tao, Grant Ho, Kurt Thomas, Yiwei Hou.

Figure 1
Figure 1. Figure 1: Performance of the 5 different LLMs across the benchmark’s tasks (§ 5.1); each point represents one LLM on a task, where some [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The number of distinct false positives (y-axis) that exactly N [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AuditBench, a benchmark of system audit logs from Linux and Windows machines spanning over 50 security investigation scenarios (malicious and benign), and uses it to evaluate five frontier LLMs on four incident-response tasks (alert triaging to persistence-mechanism identification). It analyzes how performance and error profiles vary with model size, data representation, prompt construction, and task, characterizes explanation quality, and claims to supply a foundation for LLM assessment in security logs plus practitioner insights.

Significance. If the scenarios prove representative of production IR distributions and ground truth is rigorously established, the benchmark and error analysis would supply a useful empirical foundation for LLM evaluation in security operations and concrete directions for prompt/model improvements. The current lack of validation evidence against real corpora substantially reduces the strength of those claims.

major comments (2)
  1. [Abstract; benchmark design paragraph] Abstract and benchmark-design section: the central claim that AuditBench supplies 'a foundation for assessing the capabilities of LLMs for investigating security logs' and 'novel insights for practitioners' rests on the 50+ scenarios being representative of real incident-response data distributions. No quantitative comparison (event-type histograms, alert density, persistence-mechanism prevalence, etc.) to any production IR corpus, nor expert validation of realism, is described; without this the reported performance numbers and error profiles cannot be assumed to generalize.
  2. [Abstract] Abstract: the description of the evaluation supplies no information on how the 50+ scenarios were validated, how ground truth labels were established, what statistical controls or inter-annotator agreement measures were applied, or any quantitative performance numbers. This omission makes the data-to-claim link unverifiable from the provided text.
minor comments (1)
  1. [Abstract] Abstract states the benchmark 'spans over 50 different security investigation scenarios' but does not indicate the exact count, task distribution, or balance between malicious and benign cases; adding these figures would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on the benchmark's validation and the abstract's content. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract; benchmark design paragraph] Abstract and benchmark-design section: the central claim that AuditBench supplies 'a foundation for assessing the capabilities of LLMs for investigating security logs' and 'novel insights for practitioners' rests on the 50+ scenarios being representative of real incident-response data distributions. No quantitative comparison (event-type histograms, alert density, persistence-mechanism prevalence, etc.) to any production IR corpus, nor expert validation of realism, is described; without this the reported performance numbers and error profiles cannot be assumed to generalize.

    Authors: We agree that the absence of quantitative comparisons to production corpora limits the strength of generalizability claims. Production security logs are typically not publicly shareable due to privacy and operational security reasons. The scenarios in AuditBench were constructed based on publicly documented attack techniques from sources such as MITRE ATT&CK and common system behaviors, with the goal of covering a diverse set of investigation tasks. We will revise the benchmark design section to provide additional details on the scenario generation process and any internal validation steps performed. However, direct quantitative matching to real corpora is not feasible without access to such data. revision: partial

  2. Referee: [Abstract] Abstract: the description of the evaluation supplies no information on how the 50+ scenarios were validated, how ground truth labels were established, what statistical controls or inter-annotator agreement measures were applied, or any quantitative performance numbers. This omission makes the data-to-claim link unverifiable from the provided text.

    Authors: The abstract is intended as a concise summary and does not include full methodological details, which are elaborated in the body of the paper. Ground truth labels were established by the authors through controlled execution of known malicious and benign activities in isolated environments, ensuring deterministic labels based on the scenario definitions. No inter-annotator agreement was required as the labels derive from the experimental setup rather than subjective judgment. Quantitative performance numbers are reported in the evaluation section. We will update the abstract to briefly reference the scenario validation approach and include example performance metrics to better link data to claims. revision: yes

standing simulated objections not resolved
  • Direct quantitative comparison of AuditBench scenarios to proprietary production IR corpora due to lack of public access to such datasets.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent dataset and measurements

full rationale

The paper constructs a new benchmark (AuditBench) from Linux/Windows audit logs across 50+ scenarios and measures LLM performance on four explicit tasks. No equations, fitted parameters, derivations, or self-citation chains appear in the abstract or described structure. Central claims rest on direct empirical results from the new data rather than reducing to prior inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper; the abstract introduces no mathematical derivations, fitted constants, or new postulated entities. The central claim rests on the unstated premise that the chosen scenarios and tasks are representative, but that premise is not formalized as an axiom or free parameter in the provided text.

pith-pipeline@v0.9.1-grok · 5732 in / 1334 out tokens · 22423 ms · 2026-06-27T13:09:05.024337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 1 linked inside Pith

  1. [1]

    Siraaj Akhtar, Saad Khan, and Simon Parkinson. 2025. LLM-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677(2025)

  2. [2]

    Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% false positives: A qualitative study of {SOC} analysts’ perspectives on security alarms. In31st USENIX Security Symposium (USENIX Security’22). 2783–2800

  3. [3]

    Bushra A Alahmadi, Louise Axon, and Ivan Martinovic. 2022. 99% false positives: A qualitative study of {SOC} analysts’ perspectives on security alarms. In31st USENIX Security Symposium (USENIX Security 22). 2783–2800

  4. [4]

    Crispin Almodovar, Fariza Sabrina, Sarvnaz Karimi, and Salahuddin Azad. 2022. Can language models help in system security? investigating log anomaly detec- tion using bert. InProceedings of the 20th Annual Workshop of the Australasian Language Technology Association. 139–147

  5. [5]

    Anthropic. 2025. Effective Context Engineering for AI Agents. https://www.anth ropic.com/engineering/effective-context-engineering-for-ai-agents. Accessed: November 1, 2025

  6. [6]

    Sidahmed Benabderrahmane, Petko Valtchev, James Cheney, and Talal Rahwan

  7. [7]

    In2025 13th International Symposium on Digital Forensics and Security (ISDFS)

    APT-LLM: Embedding-Based Anomaly Detection of Cyber Advanced Persistent Threats Using Large Language Models. In2025 13th International Symposium on Digital Forensics and Security (ISDFS). IEEE, 1–6

  8. [8]

    Tristan Bilot, Baoxiang Jiang, Zefeng Li, Nour El Madhoun, Khaldoun Al Agha, Anis Zouaoui, and Thomas Pasquier. 2025. Sometimes Simpler is Better: A Comprehensive Analysis of {State-of-the-Art} {Provenance-Based} Intrusion Detection Systems. In34th USENIX Security Symposium (USENIX Security 25). 7193–7212

  9. [9]

    CrowdStrike Blog. 2023. Introducing Charlotte AI, CrowdStrike’s Generative AI Security Analyst: Ushering in the Future of AI-Powered Cybersecurity. https: //www.crowdstrike.com/en-us/blog/crowdstrike-introduces-charlotte-ai-to- deliver-generative-ai-powered-cybersecurity

  10. [10]

    Red Canary. 2025. Atomic Red Team. https://www.atomicredteam.io/. [Online; accessed 28-April-2025]

  11. [11]

    Zijun Cheng, Qiujian Lv, Jinyuan Liang, Yan Wang, Degang Sun, Thomas Pasquier, and Xueyuan Han. 2024. Kairos: Practical intrusion detection and investigation using whole-system provenance. In2024 IEEE Symposium on Secu- rity and Privacy (SP). IEEE, 3533–3551

  12. [12]

    CMU Information Security Office. 2025. CMU Incident Response Procedure. https://www.cmu.edu/iso/governance/procedures/IRPlan.html

  13. [13]

    MITRE Corporation. 2025. MITRE ATT&CK. https://attack.mitre.org/

  14. [14]

    Tianyu Cui, Shiyu Ma, Ziang Chen, Tong Xiao, Chenyu Zhao, Shimin Tao, Yilun Liu, Shenglin Zhang, Duoming Lin, Changchang Liu, et al . 2025. LogEval: A comprehensive benchmark suite for LLMs in log analysis.Empirical Software Engineering30, 6 (2025), 173

  15. [15]

    Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. InMachine learning challenges workshop. Springer, 177–190

  16. [16]

    DARPA / Five Directions. 2020. Operationally Transparent Cyber (OpTC) Data Release. https://github.com/FiveDirections/OpTC-data

  17. [17]

    DARPA I2O. 2020. DARPA Transparent Computing Data Release. https://github .com/darpa-i2o/Transparent-Computing

  18. [18]

    Roman Daszczyszak, Dan Ellis, Steve Luke, and Sean Whitley. 2019. TTP-based hunting.MITRE Corp, McLean V A, Tech. Rep(2019)

  19. [19]

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24). 847–864

  20. [20]

    Feng Dong, Shaofei Li, Peng Jiang, Ding Li, Haoyu Wang, Liangyi Huang, Xusheng Xiao, Jiedong Chen, Xiapu Luo, Yao Guo, et al. 2023. Are we there yet? an industrial viewpoint on provenance-based endpoint detection and response tools. InACM Conference on Computer and Communications Security

  21. [21]

    Ashish Gehani and Dawood Tariq. 2012. SPADE: support for provenance auditing in distributed environments. InInternational Middleware Conference

  22. [22]

    Google Developers. 2025. Prompting Strategies. https://ai.google.dev/gemini- api/docs/prompting-strategies. Accessed: November 1, 2025

  23. [23]

    Wei Guan, Jian Cao, Shiyou Qian, Jianqi Gao, and Chun Ouyang. 2024. Logllm: Log-based anomaly detection using large language models.arXiv preprint arXiv:2411.08561(2024)

  24. [24]

    Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. Logbert: Log anomaly detec- tion via bert. In2021 international joint conference on neural networks (IJCNN). IEEE, 1–8

  25. [25]

    Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer

  26. [26]

    InNetwork and Distributed System Security Symposium (NDSS’20)

    Unicorn: Runtime provenance-based detector for advanced persistent threats. InNetwork and Distributed System Security Symposium (NDSS’20). 1–19

  27. [27]

    Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, and Sadid Hasan. 2024. Does prompt formatting have any impact on llm performance? arXiv preprint arXiv:2411.10541(2024)

  28. [28]

    IBM. 2025. IBM QRadar. https://www.ibm.com/products/qradar

  29. [29]

    Muhammad Adil Inam, Yinfang Chen, Akul Goyal, Jason Liu, Jaron Mink, Noor Michael, Sneha Gaur, Adam Bates, and Wajih Ul Hassan. 2023. SoK: History is a Vast Early Warning System: Auditing the Provenance of System Intrusions . In IEEE Symposium on Security and Privacy (SP)

  30. [30]

    Zian Jia, Yun Xiong, Yuhong Nan, Yao Zhang, Jinjing Zhao, and Mi Wen. 2024. {MAGIC}: Detecting advanced persistent threats via masked graph representa- tion learning. In33rd USENIX Security Symposium (USENIX Security 24). 5197– 5214

  31. [31]

    Gavin Jones, Dimitrios Kasimatis, Nikolaos Pitropakis, Richard Macfarlane, and William J Buchanan. 2025. Analysing the role of LLMs in cybersecurity incident management.International Journal of Information Security24, 6 (2025), 1–14

  32. [32]

    Diana Kramer, Lambert Rosique, Ajay Narotam, Elie Bursztein, Patrick Gage Kelley, Kurt Thomas, and Allison Woodruff. 2025. Integrating large language models into security incident response. InTwenty-First Symposium on Usable Privacy and Security (SOUPS 2025). 133–148

  33. [33]

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. arXiv preprint arXiv:2506.11791(2025)

  34. [34]

    Chenyu Li, Zhengjia Zhu, Jiyan He, and Xiu Zhang. 2025. RedChronos: A Large Language Model-Based Log Analysis System for Insider Threat Detection in Enterprises.arXiv preprint arXiv:2503.02702(2025)

  35. [35]

    Lei Lin, Jiayi Fu, Pengli Liu, Qingyang Li, Yan Gong, Junchen Wan, Fuzheng Zhang, Zhongyuan Wang, Di Zhang, and Kun Gai. 2023. Just ask one more time! self-agreement improves reasoning of language models in (almost) all scenarios. arXiv preprint arXiv:2311.08154(2023)

  36. [36]

    Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yuhang Chen, Yanqing Zhao, Hao Yang, and Yanfei Jiang. 2024. Interpretable online log analysis using large language models with prompt strategies. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 35–46

  37. [37]

    Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang

  38. [38]

    InProceedings of the IEEE/ACM 46th International Conference on Software Engineering

    Llmparser: An exploratory study on using large language models for log parsing. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  39. [39]

    Ashish Bharadwaj Madabhushana. 2021. Configure Linux system auditing with auditd. https://www.redhat.com/en/blog/configure-linux-auditing-auditd

  40. [40]

    Microsoft. 2025. Process Monitor v4.01. https://learn.microsof t.com/en- us/sysinternals/downloads/procmon

  41. [41]

    Microsoft Learn. 2025. Prompt engineering - Azure AI Foundry. https://learn. microsoft.com/en-us/azure/ai-foundry/openai/concepts/prompt-engineering. Accessed: November 1, 2025

  42. [42]

    MITRE ATT&CK. 2025. eCAR - extended Cyber Analytics Repository Model. https://github.com/FiveDirections/OpTC-data/blob/master/ecar.md. [Online; accessed 28-April-2025]

  43. [43]

    MITRE ATT&CK. 2025. MITRE Cyber Analytics Repository. https://car.mitre.or g/. [Online; accessed 28-April-2025]

  44. [44]

    Alex Nelson, Sanjay Rekhi, Murugiah Souppaya, and Karen Scarfone. 2025. In- cident Response Recommendations and Considerations for Cybersecurity Risk Management.NIST SP 800-61r3(2025)

  45. [45]

    NVIDIA. 2025. What is Chain of Thought (CoT) Prompting? https://www.nvid ia.com/en-us/glossary/cot-prompting/

  46. [46]

    Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David Eyers, Margo Seltzer, and Jean Bacon. 2017. Practical whole-system provenance capture. InSymposium on Cloud Computing

  47. [47]

    Mati Ur Rehman, Hadi Ahmadi, and Wajih Ul Hassan. 2024. Flash: A compre- hensive approach to intrusion detection via provenance graph representation learning. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 3552–3570

  48. [48]

    Microsoft Security. 2025. Microsoft Security Copilot. https://www.microsoft.co m/en-us/security/business/ai-machine-learning/microsoft-security-copilot

  49. [49]

    Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Far- shad Khorrami, et al. 2024. Nyu ctf bench: A scalable open-source benchmark dataset for evaluating llms in offensive security.Advances in Neural Information Processing Systems37 (2024), 57472–57498

  50. [50]

    Chengyu Song, Linru Ma, Jianming Zheng, Jinzhi Liao, Hongyu Kuang, and Lin Yang. 2024. Audit-llm: Multi-agent collaboration for log-based insider threat detection.arXiv preprint arXiv:2408.08902(2024)

  51. [51]

    U.S. CISA. 2025. CISA Cybersecurity Incident Response. https://www.ci sa.gov/topics/cybersecurity- best- practices/organizations- and- cyber- safety/cybersecurity-incident-response

  52. [52]

    Cybersecurity & Infrastructure Security Agency

    U.S. Cybersecurity & Infrastructure Security Agency. 2021. Cybersecurity In- cident & Vulnerability Response Playbooks. https://www.cisa.gov/resources- tools/resources/federal- government- cybersecurity- incident- and- vulnerability-response-playbooks. Anand et al

  53. [53]

    U.S. NIST. 2021. Cybersecurity Framework 1.1. https://www.nist.gov/cyberfram ework/getting-started/online-learning/five-functions

  54. [54]

    U.S. NIST. 2025. Special Publication (SP) 800-61 Revision 3. https://csrc.nist.gov/ projects/incident-response

  55. [55]

    Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, and Martin Vechev. 2025. BaxBench: Can LLMs generate correct and secure backends?arXiv preprint arXiv:2502.11844(2025)

  56. [56]

    Manfred Vielberth, Fabian Böhm, Ines Fichtinger, and Günther Pernul. 2020. Security Operations Center: A Systematic Study and Open Challenges.IEEE Access8 (2020)

  57. [57]

    Dawei Wang, Geng Zhou, Xianglong Li, Yu Bai, Li Chen, Ting Qin, Jian Sun, and Dan Li. 2025. The Digital Cybersecurity Expert: How Far Have We Come?. In 2025 IEEE Symposium on Security and Privacy (S&P’25). IEEE, 3273–3290

  58. [58]

    Lingzhi Wang, Xiangmin Shen, Weijian Li, Zhenyuan Li, R Sekar, Han Liu, and Yan Chen. 2025. Incorporating Gradients to Rules: Towards Lightweight, Adaptive Provenance-based Intrusion Detection. (2025)

  59. [59]

    Su Wang, Zhiliang Wang, Tao Zhou, Hongbin Sun, Xia Yin, Dongqi Han, Han Zhang, Xingang Shi, and Jiahai Yang. 2022. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning.IEEE Transactions on Information Forensics and Security17 (2022), 3972–3987

  60. [60]

    Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. 2025. CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale.arXiv preprint arXiv:2506.02548(2025)

  61. [61]

    Wikipedia. 2025. F-score. https://en.wikipedia.org/wiki/F-score

  62. [62]

    Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, et al. 2025. Excytin-bench: Evaluating llm agents on cyber threat investigation.arXiv preprint arXiv:2507.14201(2025)

  63. [63]

    Iyer, and Gang Wang

    Limin Yang, Zhi Chen, Chenkai Wang, Zhenning Zhang, Sushruth Booma, Phuong Cao, Constantin Adam, Alexander Withers, Zbigniew Kalbarczyk, Ravis- hankar K. Iyer, and Gang Wang. 2024. True Attacks, Attack Attempts, or Benign Triggers? An Empirical Measurement of Network Alerts in a Security Operations Center. InUSENIX Security Symposium (USENIX Security’24)

  64. [64]

    Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, et al. 2025. BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems.arXiv preprint arXiv:2505.15216(2025)

  65. [65]

    Andy K Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Julian Jasper, et al

  66. [66]

    InThe Thirteenth International Conference on Learning Representations (ICLR’25)

    Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. InThe Thirteenth International Conference on Learning Representations (ICLR’25)

  67. [67]

    Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. 2025. When llms meet cybersecurity: A systematic literature review.Cybersecurity8, 1 (2025), 55

  68. [68]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

  69. [69]

    Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesh- eng Wu, Quanzheng Li, and Qingsong Wen. 2024. Logparser-llm: Advancing efficient log parsing with large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4559–4570

  70. [70]

    timestamp

    Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, et al. 2025. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities. InForty-second International Conference on Machine Learning (ICML’25). Open Science We will release all code ...