pith. sign in

arxiv: 2507.14201 · v3 · submitted 2025-07-14 · 💻 cs.CR · cs.AI· cs.CL

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

Pith reviewed 2026-05-19 04:34 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords LLM agentscyber threat investigationsecurity benchmarksinvestigation graphssecurity logsmulti-hop reasoningMicrosoft Sentinel
0
0 comments X

The pith

A new benchmark tests LLM agents on cyber threat investigation using questions from security log graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ExCyTIn-Bench to evaluate how well LLM agents can investigate cyber threats by answering questions that require tracing evidence across multiple security logs. It builds the benchmark from logs in a controlled Azure environment covering 57 tables, first creating investigation graphs with expert detection logic and then generating 7542 questions by pairing nodes so that one supplies context and the other supplies the answer. Experiments across models show the strongest result reaches only a 0.606 reward score. A sympathetic reader would care because this setup offers a scalable way to measure progress toward agents that can assist human analysts with complex, multi-hop log analysis.

Core claim

ExCyTIn-Bench is built by applying expert-crafted detection logic to security logs from Microsoft Sentinel and related services to form threat investigation graphs, then using LLMs to generate questions from node pairs on those graphs. Each question anchors the start node as background context and the end node as the verifiable answer, producing automatic and explainable ground truth while supporting 7542 questions across 57 log tables in a reusable pipeline.

What carries the argument

Threat investigation graphs formed from paired nodes in extracted security logs, where expert detection logic supplies the edges and LLM generation converts node pairs into questions with explicit start-context and end-answer structure.

If this is right

  • The graph-based construction makes the benchmark reusable and readily extensible to new security logs or environments.
  • Current LLM agents still face substantial difficulty on these multi-hop tasks, as shown by the highest reward of 0.606.
  • Automatic ground truth from explicit node pairs enables scalable, explainable evaluation without manual answer curation.
  • Improved performance on the benchmark would support development of agents that can handle heterogeneous logs in threat investigations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The node-pair question generation method could transfer to other domains that require chaining evidence across structured data records.
  • Practical use would need additional validation against noisier, less controlled log sources that real organizations encounter.
  • Strong results here could shorten the initial evidence-gathering phase for analysts and free time for higher-level decisions.

Load-bearing premise

Expert-crafted detection logic plus LLM-generated questions from graph node pairs accurately capture the difficulty and structure of real-world multi-hop security log analysis by human analysts.

What would settle it

Running the same set of questions on a panel of experienced human security analysts and finding their accuracy or reasoning patterns diverge sharply from what the benchmark assumes would indicate the questions do not reflect actual investigation difficulty.

Figures

Figures reproduced from arXiv: 2507.14201 by Anand Mudgerikar, Andrew Zhao, Julia Kiseleva, Manuel Ra\'ul Mel\'endez Luj\'an, Mauricio Velazco, Michael Albada, Qingyun Wu, Quang Nguyen, Roberto Rodriguez, Srisuma Movva, Yiran Wu, Yogesh K Roy.

Figure 1
Figure 1. Figure 1: Overview of Our Benchmark Build Workflow. 1. (Left Triangle) We collect the raw logs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the database. We collect a total of 57 tables. The number of columns from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example Question Generation from graph. The start alert and entities will be used as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example trajectory of Baseline Agent (with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Reward vs. Number of Turns. (b) Reward vs. Cost. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on database setup. Time Window: We also test with a full version of the database (explained in Section 3.1). Moving from the per-incident slices to the full database lowers average reward to 0.248, which is expected since a longer horizon introduces extra noise. We note that degradation from using a longer time span is mild compared to switching the DB scope. Since the questions constructed by LLM… view at source ↗
Figure 8
Figure 8. Figure 8: Reward versus different query performance metrics and submit rate for each model. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Counts and average rewards by path length. Results of questions generated from different length ( [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average rounds and average reward with increasing trials. Tested with base agent + GPT-4o. To further explore the limits of our test-time scal￾ing method, we apply Best-of-N sampling to the baseline GPT-4o agent over 10 independent tri￾als (see [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Base Prompt for Baseline Agent. BASE_PROMPT + You should only give one thought-action per response. The action from your response will be executed and the result will be shown to you. Follow the format "Thought: .... nAction: ...." exactly. Do not include any other information in your response. Wait for the response from one action before giving the next thought-action pair. DO NOT make assumptions about … view at source ↗
Figure 11
Figure 11. Figure 11: Additional prompt added when testing with [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Five example rules extracted with Expel. An Expel consists of the base prompt, all the extracted rules, and 1 demonstration trajectory. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Strategy Prompt. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The full example of agent (with GPT-4o) solving a question. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: ReAct Example. For react prompt, we use the base prompt + 3 examples. Here we show one of the examples used. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Question Generation Prompt. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Solution Generation Prompt. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Graph of Incident 5. The bigger blue nodes represent alerts, and the smaller red nodes [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Graph of Incident 34. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Graph of Incident 38. ID: 0 Possible use of the Rubeus kerberoasting tool ID: 1 Name: fk8mq ID: 2 Sid: S-1-5-21-1540151660-3530000288-105586595-1517 ID: 3 AadUserId: 97e6a954-b6bd-48a5-808c-bd8464cce677 ID: 4 ProcessId__CreatedTimeUtc__CommandLine: 5512__2024-06-27t14:3 ID: 5 ExtractedFileName: psexesvc.exe ID: 6 ProcessId__CreatedTimeUtc__CommandLine: 7644__2024-06-27t14:3 ID: 7 ExtractedFileName: rubeus… view at source ↗
Figure 21
Figure 21. Figure 21: Graph of Incident 39. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Graph of Incident 55. ID: 0 A potentially malicious URL click was detected ID: 1 Recipient: raphaelt@vnevado.alpineskihouse.co ID: 2 SenderIP: 254.241.243.229 ID: 3 Sender: alyssat@vnevado.alpineskihouse.co ID: 4 MailboxPrimaryAddress: raphaelt@vnevado.alpineskihouse.co ID: 5 Url: http://ms175052280.orangecliff-f53f26fd.eastus.azurecont ID: 6 Name: Nina Park ID: 7 Email: Nina Park@vnevado.alpineskihouse.c… view at source ↗
Figure 23
Figure 23. Figure 23: Graph of Incident 134. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Graph of Incident 166. ID: 0 A potentially malicious URL click was detected ID: 1 Recipient: alyssat@vnevado.alpineskihouse.co ID: 2 Email messages containing malicious URL removed after deliver ID: 3 MailboxPrimaryAddress: alyssat@vnevado.alpineskihouse.co ID: 4 SenderIP: 202.205.215.225 ID: 5 Sender: raphaelt@vnevado.alpineskihouse.co ID: 6 Name: Hailey Johnson ID: 7 Email: Hailey Johnson@vnevado.alpine… view at source ↗
Figure 25
Figure 25. Figure 25: Graph of Incident 322. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Incident 5 Report. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Incident 5 Report (Continued.) 32 [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Incident 34 Report 33 [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Incident 34 Report (Continued.) 34 [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Incident 38 Report 35 [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Incident 39 Report 36 [PITH_FULL_IMAGE:figures/full_fig_p036_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Incident 39 Report (Continued.) 37 [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Incident 55 Report 38 [PITH_FULL_IMAGE:figures/full_fig_p038_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Incident 55 Report (Continued.) 39 [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Incident 134 Report 40 [PITH_FULL_IMAGE:figures/full_fig_p040_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Incident 134 Report (Continued.) Title of the multi-stage attack Business Email Compromise & Data Exfiltration via Inbox Rule Manipulation and SAP Access 1. EXECUTIVE SUMMARY Over a 36-hour period beginning July 22, 2024, an attacker leveraged anonymous IP logons and a password-spray campaign to gain initial access to the corporate Azure AD account of “Jordan P” (laylaw@vnevado.alpineskihouse.co). Once in… view at source ↗
Figure 37
Figure 37. Figure 37: Incident 166 Report. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Incident 166 Report (Continued.) 42 [PITH_FULL_IMAGE:figures/full_fig_p042_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Incident 322 Report 43 [PITH_FULL_IMAGE:figures/full_fig_p043_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Incident 322 Report (Continued.) 44 [PITH_FULL_IMAGE:figures/full_fig_p044_40.png] view at source ↗
read the original abstract

We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent X on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous security logs, follow multi-hop chains of evidence to investigate threats. With the developments of LLMs, building LLM-based agents for automatic threat investigation is a promising direction. We construct a benchmark from a controlled Azure tenant including a SQL environment covering 57 log tables from Microsoft Sentinel and related services, and 7542 generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. Our comprehensive experiments on the test set with different models confirm the difficulty of the task: the best model so far can achieve a reward of 0.606, leaving much headroom for future research. The code is available at https://github.com/microsoft/SecRL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents ExCyTIn-Bench, the first benchmark for evaluating LLM agents on cyber threat investigation. It builds investigation graphs from expert-crafted detection logic applied to 57 Azure/Sentinel log tables in a controlled tenant, then uses LLMs to generate 7542 questions by pairing start and end nodes on the graphs (start node supplies background context; end node supplies the answer). Experiments on the test set show the best model achieves a reward of 0.606, with the authors concluding substantial headroom remains for future work. The pipeline is positioned as reusable and extensible, with code released at https://github.com/microsoft/SecRL.

Significance. If the benchmark construction faithfully reproduces the difficulty of real multi-hop security log analysis, the work supplies a valuable, grounded resource for the security and AI communities. Strengths include the use of external Azure logs and expert detection rules (reducing circularity), automatic explainable ground truth via node pairs, and an open code release that supports reproducibility and extension to new logs. The reported performance gap could usefully guide development of LLM agents for practical threat investigation tasks.

major comments (2)
  1. [abstract and methods (graph construction and question generation)] Construction pipeline (abstract and methods section on graph-to-question generation): The central claim that node-pair questions measure genuine threat investigation capability rests on the unverified assumption that LLM-generated questions do not embed structural cues or reduced ambiguity absent from real analyst queries. No controls, human validation, or artifact analysis are described to test whether the generator supplies explicit relational structure that analysts must discover from raw logs.
  2. [experiments and results] Experiments and results (section reporting the 0.606 reward): The headline performance number and headroom conclusion are load-bearing for the paper's contribution, yet the manuscript provides no details on reward calculation, baseline selection, error analysis, or ablations that would rule out question-generation artifacts as an explanation for the observed scores.
minor comments (3)
  1. [abstract] Abstract: the phrasing 'Evaluate an LLM agent X' appears to contain a placeholder that should be clarified.
  2. [methods] Methods: more explicit description of the 57 log tables, the precise expert detection rules, and the LLM prompts used for question generation would improve reproducibility.
  3. [abstract and conclusion] The paper states code is available but does not enumerate what artifacts (graph construction scripts, prompts, evaluation harness) are included in the repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [abstract and methods (graph construction and question generation)] Construction pipeline (abstract and methods section on graph-to-question generation): The central claim that node-pair questions measure genuine threat investigation capability rests on the unverified assumption that LLM-generated questions do not embed structural cues or reduced ambiguity absent from real analyst queries. No controls, human validation, or artifact analysis are described to test whether the generator supplies explicit relational structure that analysts must discover from raw logs.

    Authors: We agree that the manuscript would benefit from explicit validation of the generated questions. While the node-pair construction supplies automatic, explainable ground truth anchored to expert detection logic, we did not include controls or human evaluation in the original submission. In the revision we will add a dedicated subsection describing a human validation study in which security analysts rate a sample of questions for realism, ambiguity, and presence of unintended structural cues. We will also report quantitative artifact analysis comparing relational explicitness in generated questions versus a small set of publicly documented analyst queries. These additions will directly address the concern. revision: yes

  2. Referee: [experiments and results] Experiments and results (section reporting the 0.606 reward): The headline performance number and headroom conclusion are load-bearing for the paper's contribution, yet the manuscript provides no details on reward calculation, baseline selection, error analysis, or ablations that would rule out question-generation artifacts as an explanation for the observed scores.

    Authors: We accept that the current experimental reporting is insufficiently detailed. The manuscript summarizes model performance but omits the precise reward formulation, full baseline specifications, error categorization, and ablations. In the revised version we will expand the experiments section to include the exact reward metric definition and computation, descriptions of all baselines and their prompting setups, a qualitative error analysis of representative failures, and ablation results on question-generation parameters (e.g., prompt variations and graph depth). These additions will better substantiate the reported 0.606 score and the conclusion that substantial headroom remains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction

full rationale

The paper constructs ExCyTIn-Bench from external Azure/Sentinel logs processed via expert-crafted detection logic into investigation graphs, then generates questions by prompting an LLM to use start-node context and end-node answers for automatic ground truth. This yields 7542 questions whose structure and difficulty are defined independently of any model performance numbers or fitted parameters. The reported best-model reward of 0.606 is a direct empirical measurement on the resulting test set and does not reduce to the construction pipeline by definition or self-reference. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the derivation remains self-contained against the external logs and rules.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about security log analysis and the validity of the graph-to-question pipeline; no free parameters or invented entities are evident from the abstract.

axioms (2)
  • domain assumption Expert-crafted detection logic accurately captures threat patterns and multi-hop evidence chains in the 57 Microsoft Sentinel log tables.
    Invoked to construct the investigation graphs that anchor all questions.
  • ad hoc to paper Questions generated by LLMs from paired start/end nodes on the graphs yield automatic, explainable ground truth that measures genuine threat investigation capability.
    Central to the benchmark's claim of reusability and validity.

pith-pipeline@v0.9.0 · 5795 in / 1446 out tokens · 56024 ms · 2026-05-19T04:34:09.331425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

    cs.CR 2026-04 conditional novelty 8.0

    A new benchmark shows frontier LLMs achieve only 3.8% average recall identifying malicious events from raw logs and fail to meet 50% recall thresholds on most tactics.

  2. GenAI-Driven Threat Detection with Microsoft Security Copilot

    cs.CR 2026-05 unverdicted novelty 5.0

    DTDA is an LLM-powered autonomous agent that investigates Microsoft Defender incidents via planner-executor loops and generates novel alerts, achieving 80.1% precision in 120-day production use and 0.78 F1 offline.

  3. GenAI-Driven Threat Detection with Microsoft Security Copilot

    cs.CR 2026-05 unverdicted novelty 5.0

    DTDA is an LLM agent that produces novel security alerts at 80.1% customer-validated precision and 0.78 F1 on hidden activity while running at production scale inside Microsoft Defender.

  4. Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis

    cs.CR 2026-05 unverdicted novelty 5.0

    Pen-Strategist fine-tunes Qwen-3-14B with RL on a pentesting reasoning dataset and pairs it with a CNN step classifier, reporting 87% better strategy derivation, 47.5% more subtask completions than baselines, and gain...

Reference graph

Works this paper leans on

141 extracted references · 141 canonical work pages · cited by 3 Pith papers · 16 internal anchors

  1. [1]

    Ctibench: A benchmark for evaluating llms in cy- ber threat intelligence,

    Md Tanvirul Alam, Dipkamal Bhushl, Le Nguyen, and Nidhi Rastogi. Ctibench: A benchmark for evaluating llms in cyber threat intelligence. arXiv preprint arXiv:2406.07599, 2024

  2. [2]

    Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline

    Lampis Alevizos and Martijn Dekker. Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline. Electronics, 13(11):2021, 2024

  3. [3]

    Magic: Generating self-correction guideline for in-context text-to-sql

    Arian Askari, Christian Poelitz, and Xinye Tang. Magic: Generating self-correction guideline for in-context text-to-sql. arXiv preprint arXiv:2406.12692, 2024

  4. [4]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  5. [5]

    Secure: Benchmarking generative large language models for cybersecurity advisory

    Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, and Nidhi Rastogi. Secure: Benchmarking generative large language models for cybersecurity advisory. arXiv preprint arXiv:2405.20441, 2024

  6. [6]

    AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023. 10

  7. [7]

    What is cyber threat hunting?, 2023

    CrowdStrike. What is cyber threat hunting?, 2023. URL https://www.crowdstrike.com/ cybersecurity-101/threat-hunting/. Accessed: 14 March 2024

  8. [8]

    2024 global threat report, 2024

    CrowdStrike. 2024 global threat report, 2024. URL https://www.crowdstrike.com/ global-threat-report/

  9. [9]

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

    Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306, 2023

  10. [10]

    Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363, 2023

  11. [11]

    Enabling efficient cyber threat hunting with cyber threat intelligence

    Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, Zheng Qin, Fengyuan Xu, Prateek Mittal, Sanjeev R Kulkarni, and Dawn Song. Enabling efficient cyber threat hunting with cyber threat intelligence. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 193–204. IEEE, 2021

  12. [12]

    Tactical provenance analysis for endpoint detection and response systems

    Wajih Ul Hassan, Adam Bates, and Daniel Marino. Tactical provenance analysis for endpoint detection and response systems. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1172–1189. IEEE, 2020

  13. [13]

    A comprehensive overview of large language models (llms) for cyber defences: Opportunities and direc- tions,

    Mohammed Hassanin and Nour Moustafa. A comprehensive overview of large language models (llms) for cyber defences: Opportunities and directions. arXiv preprint arXiv:2405.14487, 2024

  14. [14]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023

  15. [15]

    {SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data

    Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. {SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data. In 26th USENIX Security Symposium (USENIX Security 17), pages 487–504, 2017

  16. [16]

    Infiagent-dabench: Evaluating agents on data analysis tasks

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024

  17. [17]

    Mlagentbench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, 2024

  18. [18]

    What is threat hunting?, 2023

    IBM. What is threat hunting?, 2023. URL https://www.ibm.com/topics/ threat-hunting. Accessed: 1 Oct 2024

  19. [19]

    X.; and Wen, J.-R

    Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645, 2023

  20. [20]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  21. [21]

    Crimson: Empowering strategic reasoning in cybersecurity through large language models

    Jiandong Jin, Bowen Tang, Mingxuan Ma, Xiao Liu, Yunfei Wang, Qingnan Lai, Jia Yang, and Changling Zhou. Crimson: Empowering strategic reasoning in cybersecurity through large language models. arXiv preprint arXiv:2403.00878, 2024

  22. [22]

    Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity

    Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, and Xiapu Luo. Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity. arXiv preprint arXiv:2412.20787, 2024

  23. [23]

    Cyberpal

    Matan Levi, Yair Allouche, Daniel Ohayon, and Anton Puzanov. Cyberpal. ai: Empowering llms with expert-driven cybersecurity instructions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24402–24412, 2025. 11

  24. [24]

    Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models

    Guancheng Li, Yifeng Li, Wang Guannan, Haoyu Yang, and Yang Yu. Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models. https://github.com/XuanwuAI/SecEval, 2023

  25. [25]

    Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

  26. [26]

    Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36:42330–42357, 2023

  27. [27]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

  28. [28]

    NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System

    Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979, 2018

  29. [29]

    Secqa: A concise question-answering dataset for evaluating large language models in computer security

    Zefang Liu. Secqa: A concise question-answering dataset for evaluating large language models in computer security. arXiv preprint arXiv:2312.15838, 2023

  30. [30]

    Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024

    Zefang Liu, Jialei Shi, and John F Buford. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024

  31. [31]

    Evolving techniques in cyber threat hunting: A systematic review

    Arash Mahboubi, Khanh Luong, Hamed Aboutorab, Hang Thanh Bui, Geoff Jarrad, Mo- hammed Bahutair, Seyit Camtepe, Ganna Pogrebna, Ejaz Ahmed, Bazara Barry, et al. Evolving techniques in cyber threat hunting: A systematic review. Journal of Network and Computer Applications, page 104004, 2024

  32. [32]

    Mitre att&ck, 2025

    MITRE. Mitre att&ck, 2025. URL https://attack.mitre.org/. A knowledge base of adversary tactics and techniques

  33. [33]

    On evaluating the integration of reasoning and action in llm agents with database question answering

    Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, and Arman Cohan. On evaluating the integration of reasoning and action in llm agents with database question answering. arXiv preprint arXiv:2311.09721, 2023

  34. [34]

    OSINT Framework

    Joshua Nordine. OSINT Framework. https://github.com/lockfale/osint-framework (commit 68c904c), 2024. Accessed 2025-05-10

  35. [35]

    Agir: Au- tomating cyber threat intelligence reporting with natural language generation

    Filippo Perrina, Francesco Marchiori, Mauro Conti, and Nino Vincenzo Verde. Agir: Au- tomating cyber threat intelligence reporting with natural language generation. In 2023 IEEE International Conference on Big Data (BigData), pages 3053–3062. IEEE, 2023

  36. [36]

    Din-sql: Decomposed in-context learning of text-to-sql with self-correction

    Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36, 2024

  37. [37]

    Out of the cage: How stochastic parrots win in cyber security environments.arXiv preprint arXiv:2308.12086, 2023

    Maria Rigaki, Ondˇrej Lukáš, Carlos A Catania, and Sebastian Garcia. Out of the cage: How stochastic parrots win in cyber security environments. arXiv preprint arXiv:2308.12086, 2023

  38. [38]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

  39. [39]

    Time for action: Automated analysis of cyber threat intelligence in the wild

    Giuseppe Siracusano, Davide Sanvito, Roberto Gonzalez, Manikantan Srinivasan, Sivakaman Kamatchi, Wataru Takahashi, Masaru Kawakita, Takahiro Kakumaru, and Roberto Bifulco. Time for action: Automated analysis of cyber threat intelligence in the wild. arXiv preprint arXiv:2307.10214, 2023

  40. [40]

    Towards evaluation and un- derstanding of large language models for cyber operation automation

    Madeena Sultana, Adrian Taylor, Li Li, and Suryadipta Majumdar. Towards evaluation and un- derstanding of large language models for cyber operation automation. In 2023 IEEE Conference on Communications and Network Security (CNS), pages 1–6. IEEE, 2023. 12

  41. [41]

    CHESS: Contextual Harnessing for Efficient SQL Synthesis

    Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis. arXiv preprint arXiv:2405.16755, 2024

  42. [42]

    Common Vulnerabilities and Exposures (CVE) Program

    The MITRE Corporation. Common Vulnerabilities and Exposures (CVE) Program. https: //www.cve.org/, 2025. Accessed 2025-05-10

  43. [43]

    MITRE ATT&CK ® Knowledge Base

    The MITRE Corporation. MITRE ATT&CK ® Knowledge Base. https://attack.mitre. org/, 2025. Version 17.1. Accessed 2025-05-10

  44. [44]

    Mac- sql: A multi-agent collaborative framework for text-to-sql,

    Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. arXiv preprint arXiv:2312.11242, 2024

  45. [45]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  46. [46]

    A survey on large language model based autonomous agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

  47. [47]

    Avalon’s game of thoughts: Battle against deception through recursive contemplation, 2023

    Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalon’s game of thoughts: Battle against deception through recursive contemplation. ArXiv, abs/2310.01320, 2023. URL https://api.semanticscholar.org/CorpusID:263605971

  48. [48]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  49. [49]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023

  50. [50]

    Mathchat: Converse to tackle challenging math problems with llm agents

    Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, and Chi Wang. Mathchat: Converse to tackle challenging math problems with llm agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  51. [51]

    State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

    Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. Stateflow: Enhancing llm task-solving through state-driven workflows. arXiv preprint arXiv:2403.11322, 2024

  52. [52]

    Crab: Cross-environment agent benchmark for multimodal language model agents

    Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, et al. Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024

  53. [53]

    Intercode: Standardizing and benchmarking interactive coding with execution feedback

    John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.arXiv preprint arXiv:2306.14898, 2023

  54. [54]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024

  55. [55]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  56. [56]

    Attackseqbench: Benchmarking large language models’ understanding of sequential patterns in cyber attacks

    Javier Yong, Haokai Ma, Yunshan Ma, Anis Yusof, Zhenkai Liang, and Ee-Chien Chang. Attackseqbench: Benchmarking large language models’ understanding of sequential patterns in cyber attacks. arXiv preprint arXiv:2503.03170, 2025

  57. [57]

    Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018. 13

  58. [58]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

  59. [59]

    Autodefense: Multi-agent llm defense against jailbreak attacks,

    Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi- agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024

  60. [60]

    K., et al

    Andy K Zhang, Neil Perry, Riya Dulepet, Eliot Jones, Justin W Lin, Joey Ji, Celeste Menders, Gashon Hussein, Samantha Liu, Donovan Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models. arXiv preprint arXiv:2408.08926, 2024

  61. [61]

    When llms meet cybersecurity: A systematic literature review,

    Jie Zhang, Haoyu Bu, Hui Wen, Yu Chen, Lun Li, and Hongsong Zhu. When llms meet cybersecurity: A systematic literature review. arXiv preprint arXiv:2405.03644, 2024

  62. [62]

    When llms meet cybersecurity: A systematic literature review

    Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. When llms meet cybersecurity: A systematic literature review. Cybersecurity, 8(1):1–41, 2025

  63. [63]

    Reactable: Enhancing react for table question answering

    Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, and Jignesh M Pa- tel. Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815, 2023

  64. [64]

    Expel: Llm agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

  65. [65]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025. A Limitations and Broader Impacts Limitations. While ExCyTIn-Bench represents a significant step toward evaluating LLM agents on real...

  66. [66]

    For every column we draw a random sample of five values and prompt a (LLM) to decide whether the column contains PII

    Identification of PII Columns Each table is scanned column-by-column. For every column we draw a random sample of five values and prompt a (LLM) to decide whether the column contains PII. Columns provisionally flagged in this first pass are examined once more with three focused prompts:

  67. [67]

    Confirm whether the column indeed holds PII

  68. [68]

    Decide whether the column stores a dictionary/ JSON structure

  69. [69]

    The union of both LLM passes is then reviewed by domain experts, yielding a curated list of PII-bearing columns that serves as ground truth for the remainder of the pipeline

    If it does, enumerate which keys inside the structure contain PII. The union of both LLM passes is then reviewed by domain experts, yielding a curated list of PII-bearing columns that serves as ground truth for the remainder of the pipeline

  70. [70]

    John” → “Javier

    Creation of PII Value Mappings For every confirmed PII column we gather its set of unique values. If the column encodes a dictionary, only the keys identified in the previous stage are considered. • Regex-based substitution. We manually go througth the tables to recognize common PII patterns, and each candidate value is matched against them (IPv4/IPv6 add...

  71. [71]

    This guarantees referential consistency—queries that join on an anonymised IP address still succeed—and eliminates residual PII leakage while preserving analytical utility

    Dataset -wide Replacement In the final stage we stream every table in the dataset, globally replacing each source PII value with its surrogate. This guarantees referential consistency—queries that join on an anonymised IP address still succeed—and eliminates residual PII leakage while preserving analytical utility. D Additional Question Generation Details...

  72. [72]

    master–slave

    Under these criteria and filtering after question generation, we collected a total of 589 questions as the test set (See Figure 4). We also created a strategy for sampling questions to split the training and test. Since we are building questions from the graph, and the train and test sets are all from one graph, we want the train samples to have less over...

  73. [73]

    Thought: .... nAction:

    This suggests that fine-tuning amplified the model’s bias toward the training incidents, degrading its ability to generalize. Given our small sample size, additional studies are needed to characterize the impact of fine-tuning more precisely. However, these initial results imply that naïve fine-tuning may be ill-suited to this task and motivate exploring ...

  74. [74]

    The question should be natural and relevant to the context, and it should be clear and have a deterministic answer

  75. [75]

    If the start and end alert are the same, you should be more careful since the given entities may have overlapping information

    But it should not leak the answer. If the start and end alert are the same, you should be more careful since the given entities may have overlapping information

  76. [76]

    powershell.exe

    The question should be specific of the answer you are looking for, and the answer should match the question. - "answer": the answer to the question. You may be given one or more entities from the end alert, select the most meaningful entity and make sure it is not leaked in the context or question. - "context": the context from the start alert. you should...

  77. [77]

    Suspicious access to LSASS service

    2024-06-20 07:36 UTC – CredentialAccess: “Suspicious access to LSASS service” on vnevado-win10v via mimikatz.exe (Account: tgs2z)

  78. [78]

    Possible attempt to access Primary Refresh Token (PRT)

    2024-06-20 08:51 – CredentialAccess: “Possible attempt to access Primary Refresh Token (PRT)” on vnevado-win10v by get-userprttoken.ps1 (tgs2z)

  79. [79]

    Mimikatz credential theft tool

    2024-06-20 08:58 – Malware: “Mimikatz credential theft tool” detected on vnevado-win10v

  80. [80]

    Malicious credential theft tool execution detected

    2024-06-20 09:00 – CredentialAccess: “Malicious credential theft tool execution detected” on vnevado-win10v

Showing first 80 references.