ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

Anand Mudgerikar; Andrew Zhao; Julia Kiseleva; Manuel Ra\'ul Mel\'endez Luj\'an; Mauricio Velazco; Michael Albada; Qingyun Wu; Quang Nguyen; Roberto Rodriguez; Srisuma Movva

arxiv: 2507.14201 · v3 · submitted 2025-07-14 · 💻 cs.CR · cs.AI· cs.CL

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

Yiran Wu , Mauricio Velazco , Andrew Zhao , Manuel Ra\'ul Mel\'endez Luj\'an , Srisuma Movva , Yogesh K Roy , Quang Nguyen , Roberto Rodriguez

show 4 more authors

Qingyun Wu Michael Albada Julia Kiseleva Anand Mudgerikar

This is my paper

Pith reviewed 2026-05-19 04:34 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords LLM agentscyber threat investigationsecurity benchmarksinvestigation graphssecurity logsmulti-hop reasoningMicrosoft Sentinel

0 comments

The pith

A new benchmark tests LLM agents on cyber threat investigation using questions from security log graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ExCyTIn-Bench to evaluate how well LLM agents can investigate cyber threats by answering questions that require tracing evidence across multiple security logs. It builds the benchmark from logs in a controlled Azure environment covering 57 tables, first creating investigation graphs with expert detection logic and then generating 7542 questions by pairing nodes so that one supplies context and the other supplies the answer. Experiments across models show the strongest result reaches only a 0.606 reward score. A sympathetic reader would care because this setup offers a scalable way to measure progress toward agents that can assist human analysts with complex, multi-hop log analysis.

Core claim

ExCyTIn-Bench is built by applying expert-crafted detection logic to security logs from Microsoft Sentinel and related services to form threat investigation graphs, then using LLMs to generate questions from node pairs on those graphs. Each question anchors the start node as background context and the end node as the verifiable answer, producing automatic and explainable ground truth while supporting 7542 questions across 57 log tables in a reusable pipeline.

What carries the argument

Threat investigation graphs formed from paired nodes in extracted security logs, where expert detection logic supplies the edges and LLM generation converts node pairs into questions with explicit start-context and end-answer structure.

If this is right

The graph-based construction makes the benchmark reusable and readily extensible to new security logs or environments.
Current LLM agents still face substantial difficulty on these multi-hop tasks, as shown by the highest reward of 0.606.
Automatic ground truth from explicit node pairs enables scalable, explainable evaluation without manual answer curation.
Improved performance on the benchmark would support development of agents that can handle heterogeneous logs in threat investigations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The node-pair question generation method could transfer to other domains that require chaining evidence across structured data records.
Practical use would need additional validation against noisier, less controlled log sources that real organizations encounter.
Strong results here could shorten the initial evidence-gathering phase for analysts and free time for higher-level decisions.

Load-bearing premise

Expert-crafted detection logic plus LLM-generated questions from graph node pairs accurately capture the difficulty and structure of real-world multi-hop security log analysis by human analysts.

What would settle it

Running the same set of questions on a panel of experienced human security analysts and finding their accuracy or reasoning patterns diverge sharply from what the benchmark assumes would indicate the questions do not reflect actual investigation difficulty.

Figures

Figures reproduced from arXiv: 2507.14201 by Anand Mudgerikar, Andrew Zhao, Julia Kiseleva, Manuel Ra\'ul Mel\'endez Luj\'an, Mauricio Velazco, Michael Albada, Qingyun Wu, Quang Nguyen, Roberto Rodriguez, Srisuma Movva, Yiran Wu, Yogesh K Roy.

**Figure 2.** Figure 2: Overview of the database. We collect a total of 57 tables. The number of columns from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example Question Generation from graph. The start alert and entities will be used as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: An example trajectory of Baseline Agent (with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Reward vs. Number of Turns. (b) Reward vs. Cost. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation on database setup. Time Window: We also test with a full version of the database (explained in Section 3.1). Moving from the per-incident slices to the full database lowers average reward to 0.248, which is expected since a longer horizon introduces extra noise. We note that degradation from using a longer time span is mild compared to switching the DB scope. Since the questions constructed by LLM… view at source ↗

**Figure 8.** Figure 8: Reward versus different query performance metrics and submit rate for each model. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 7.** Figure 7: Counts and average rewards by path length. Results of questions generated from different length ( [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 9.** Figure 9: Average rounds and average reward with increasing trials. Tested with base agent + GPT-4o. To further explore the limits of our test-time scaling method, we apply Best-of-N sampling to the baseline GPT-4o agent over 10 independent trials (see [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Base Prompt for Baseline Agent. BASE_PROMPT + You should only give one thought-action per response. The action from your response will be executed and the result will be shown to you. Follow the format "Thought: .... nAction: ...." exactly. Do not include any other information in your response. Wait for the response from one action before giving the next thought-action pair. DO NOT make assumptions about … view at source ↗

**Figure 11.** Figure 11: Additional prompt added when testing with [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Five example rules extracted with Expel. An Expel consists of the base prompt, all the extracted rules, and 1 demonstration trajectory. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Strategy Prompt. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: The full example of agent (with GPT-4o) solving a question. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: ReAct Example. For react prompt, we use the base prompt + 3 examples. Here we show one of the examples used. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Question Generation Prompt. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Solution Generation Prompt. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Graph of Incident 5. The bigger blue nodes represent alerts, and the smaller red nodes [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Graph of Incident 34. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Graph of Incident 38. ID: 0 Possible use of the Rubeus kerberoasting tool ID: 1 Name: fk8mq ID: 2 Sid: S-1-5-21-1540151660-3530000288-105586595-1517 ID: 3 AadUserId: 97e6a954-b6bd-48a5-808c-bd8464cce677 ID: 4 ProcessId__CreatedTimeUtc__CommandLine: 5512__2024-06-27t14:3 ID: 5 ExtractedFileName: psexesvc.exe ID: 6 ProcessId__CreatedTimeUtc__CommandLine: 7644__2024-06-27t14:3 ID: 7 ExtractedFileName: rubeus… view at source ↗

**Figure 21.** Figure 21: Graph of Incident 39. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: Graph of Incident 55. ID: 0 A potentially malicious URL click was detected ID: 1 Recipient: raphaelt@vnevado.alpineskihouse.co ID: 2 SenderIP: 254.241.243.229 ID: 3 Sender: alyssat@vnevado.alpineskihouse.co ID: 4 MailboxPrimaryAddress: raphaelt@vnevado.alpineskihouse.co ID: 5 Url: http://ms175052280.orangecliff-f53f26fd.eastus.azurecont ID: 6 Name: Nina Park ID: 7 Email: Nina Park@vnevado.alpineskihouse.c… view at source ↗

**Figure 23.** Figure 23: Graph of Incident 134. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Graph of Incident 166. ID: 0 A potentially malicious URL click was detected ID: 1 Recipient: alyssat@vnevado.alpineskihouse.co ID: 2 Email messages containing malicious URL removed after deliver ID: 3 MailboxPrimaryAddress: alyssat@vnevado.alpineskihouse.co ID: 4 SenderIP: 202.205.215.225 ID: 5 Sender: raphaelt@vnevado.alpineskihouse.co ID: 6 Name: Hailey Johnson ID: 7 Email: Hailey Johnson@vnevado.alpine… view at source ↗

**Figure 25.** Figure 25: Graph of Incident 322. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗

**Figure 26.** Figure 26: Incident 5 Report. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗

**Figure 27.** Figure 27: Incident 5 Report (Continued.) 32 [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗

**Figure 28.** Figure 28: Incident 34 Report 33 [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗

**Figure 29.** Figure 29: Incident 34 Report (Continued.) 34 [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗

**Figure 30.** Figure 30: Incident 38 Report 35 [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗

**Figure 31.** Figure 31: Incident 39 Report 36 [PITH_FULL_IMAGE:figures/full_fig_p036_31.png] view at source ↗

**Figure 32.** Figure 32: Incident 39 Report (Continued.) 37 [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗

**Figure 33.** Figure 33: Incident 55 Report 38 [PITH_FULL_IMAGE:figures/full_fig_p038_33.png] view at source ↗

**Figure 34.** Figure 34: Incident 55 Report (Continued.) 39 [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗

**Figure 35.** Figure 35: Incident 134 Report 40 [PITH_FULL_IMAGE:figures/full_fig_p040_35.png] view at source ↗

**Figure 36.** Figure 36: Incident 134 Report (Continued.) Title of the multi-stage attack Business Email Compromise & Data Exfiltration via Inbox Rule Manipulation and SAP Access 1. EXECUTIVE SUMMARY Over a 36-hour period beginning July 22, 2024, an attacker leveraged anonymous IP logons and a password-spray campaign to gain initial access to the corporate Azure AD account of “Jordan P” (laylaw@vnevado.alpineskihouse.co). Once in… view at source ↗

**Figure 37.** Figure 37: Incident 166 Report. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_37.png] view at source ↗

**Figure 38.** Figure 38: Incident 166 Report (Continued.) 42 [PITH_FULL_IMAGE:figures/full_fig_p042_38.png] view at source ↗

**Figure 39.** Figure 39: Incident 322 Report 43 [PITH_FULL_IMAGE:figures/full_fig_p043_39.png] view at source ↗

**Figure 40.** Figure 40: Incident 322 Report (Continued.) 44 [PITH_FULL_IMAGE:figures/full_fig_p044_40.png] view at source ↗

read the original abstract

We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent X on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous security logs, follow multi-hop chains of evidence to investigate threats. With the developments of LLMs, building LLM-based agents for automatic threat investigation is a promising direction. We construct a benchmark from a controlled Azure tenant including a SQL environment covering 57 log tables from Microsoft Sentinel and related services, and 7542 generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. Our comprehensive experiments on the test set with different models confirm the difficulty of the task: the best model so far can achieve a reward of 0.606, leaving much headroom for future research. The code is available at https://github.com/microsoft/SecRL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExCyTIn-Bench gives the first graph-based benchmark for LLM agents on cyber threat investigation from real Sentinel logs, but question generation from node pairs risks understating task difficulty.

read the letter

ExCyTIn-Bench is the first benchmark that evaluates LLM agents on cyber threat investigation by building investigation graphs from Microsoft Sentinel logs and turning node pairs into questions. They start with logs from a controlled Azure tenant across 57 tables, apply expert-crafted detection logic to create the graphs, then prompt an LLM to generate questions where the start node supplies background and the end node supplies the answer. This produces 7542 questions with automatic ground truth, and they report the best model reaching a reward of 0.606. The code release on GitHub is a practical plus that lets others extend the setup to new logs. The construction is grounded in external data and expert rules rather than pure synthesis, which gives the benchmark some independent structure. The soft spot sits in the question generation step. Because the LLM sees the explicit graph relations when writing the questions, the outputs can carry structural cues or lower ambiguity that a human analyst would not have when working directly from raw logs. This could make the measured performance look better than the real multi-hop difficulty and weaken the claim of substantial headroom. More checks against actual analyst query patterns and a clearer breakdown of the reward calculation would strengthen it. The work targets researchers building LLM agents for security operations and incident response. It shows honest engagement with the applied problem and enough reproducibility to merit peer review, even if the evaluation needs tightening on realism.

Referee Report

2 major / 3 minor

Summary. The paper presents ExCyTIn-Bench, the first benchmark for evaluating LLM agents on cyber threat investigation. It builds investigation graphs from expert-crafted detection logic applied to 57 Azure/Sentinel log tables in a controlled tenant, then uses LLMs to generate 7542 questions by pairing start and end nodes on the graphs (start node supplies background context; end node supplies the answer). Experiments on the test set show the best model achieves a reward of 0.606, with the authors concluding substantial headroom remains for future work. The pipeline is positioned as reusable and extensible, with code released at https://github.com/microsoft/SecRL.

Significance. If the benchmark construction faithfully reproduces the difficulty of real multi-hop security log analysis, the work supplies a valuable, grounded resource for the security and AI communities. Strengths include the use of external Azure logs and expert detection rules (reducing circularity), automatic explainable ground truth via node pairs, and an open code release that supports reproducibility and extension to new logs. The reported performance gap could usefully guide development of LLM agents for practical threat investigation tasks.

major comments (2)

[abstract and methods (graph construction and question generation)] Construction pipeline (abstract and methods section on graph-to-question generation): The central claim that node-pair questions measure genuine threat investigation capability rests on the unverified assumption that LLM-generated questions do not embed structural cues or reduced ambiguity absent from real analyst queries. No controls, human validation, or artifact analysis are described to test whether the generator supplies explicit relational structure that analysts must discover from raw logs.
[experiments and results] Experiments and results (section reporting the 0.606 reward): The headline performance number and headroom conclusion are load-bearing for the paper's contribution, yet the manuscript provides no details on reward calculation, baseline selection, error analysis, or ablations that would rule out question-generation artifacts as an explanation for the observed scores.

minor comments (3)

[abstract] Abstract: the phrasing 'Evaluate an LLM agent X' appears to contain a placeholder that should be clarified.
[methods] Methods: more explicit description of the 57 log tables, the precise expert detection rules, and the LLM prompts used for question generation would improve reproducibility.
[abstract and conclusion] The paper states code is available but does not enumerate what artifacts (graph construction scripts, prompts, evaluation harness) are included in the repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [abstract and methods (graph construction and question generation)] Construction pipeline (abstract and methods section on graph-to-question generation): The central claim that node-pair questions measure genuine threat investigation capability rests on the unverified assumption that LLM-generated questions do not embed structural cues or reduced ambiguity absent from real analyst queries. No controls, human validation, or artifact analysis are described to test whether the generator supplies explicit relational structure that analysts must discover from raw logs.

Authors: We agree that the manuscript would benefit from explicit validation of the generated questions. While the node-pair construction supplies automatic, explainable ground truth anchored to expert detection logic, we did not include controls or human evaluation in the original submission. In the revision we will add a dedicated subsection describing a human validation study in which security analysts rate a sample of questions for realism, ambiguity, and presence of unintended structural cues. We will also report quantitative artifact analysis comparing relational explicitness in generated questions versus a small set of publicly documented analyst queries. These additions will directly address the concern. revision: yes
Referee: [experiments and results] Experiments and results (section reporting the 0.606 reward): The headline performance number and headroom conclusion are load-bearing for the paper's contribution, yet the manuscript provides no details on reward calculation, baseline selection, error analysis, or ablations that would rule out question-generation artifacts as an explanation for the observed scores.

Authors: We accept that the current experimental reporting is insufficiently detailed. The manuscript summarizes model performance but omits the precise reward formulation, full baseline specifications, error categorization, and ablations. In the revised version we will expand the experiments section to include the exact reward metric definition and computation, descriptions of all baselines and their prompting setups, a qualitative error analysis of representative failures, and ablation results on question-generation parameters (e.g., prompt variations and graph depth). These additions will better substantiate the reported 0.606 score and the conclusion that substantial headroom remains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction

full rationale

The paper constructs ExCyTIn-Bench from external Azure/Sentinel logs processed via expert-crafted detection logic into investigation graphs, then generates questions by prompting an LLM to use start-node context and end-node answers for automatic ground truth. This yields 7542 questions whose structure and difficulty are defined independently of any model performance numbers or fitted parameters. The reported best-model reward of 0.606 is a direct empirical measurement on the resulting test set and does not reduce to the construction pipeline by definition or self-reference. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the derivation remains self-contained against the external logs and rules.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about security log analysis and the validity of the graph-to-question pipeline; no free parameters or invented entities are evident from the abstract.

axioms (2)

domain assumption Expert-crafted detection logic accurately captures threat patterns and multi-hop evidence chains in the 57 Microsoft Sentinel log tables.
Invoked to construct the investigation graphs that anchor all questions.
ad hoc to paper Questions generated by LLMs from paired start/end nodes on the graphs yield automatic, explainable ground truth that measures genuine threat investigation capability.
Central to the benchmark's claim of reusability and validity.

pith-pipeline@v0.9.0 · 5795 in / 1446 out tokens · 56024 ms · 2026-05-19T04:34:09.331425+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the best model so far can achieve a reward of 0.606

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
cs.CR 2026-04 conditional novelty 8.0

A new benchmark shows frontier LLMs achieve only 3.8% average recall identifying malicious events from raw logs and fail to meet 50% recall thresholds on most tactics.
GenAI-Driven Threat Detection with Microsoft Security Copilot
cs.CR 2026-05 unverdicted novelty 5.0

DTDA is an LLM-powered autonomous agent that investigates Microsoft Defender incidents via planner-executor loops and generates novel alerts, achieving 80.1% precision in 120-day production use and 0.78 F1 offline.
GenAI-Driven Threat Detection with Microsoft Security Copilot
cs.CR 2026-05 unverdicted novelty 5.0

DTDA is an LLM agent that produces novel security alerts at 80.1% customer-validated precision and 0.78 F1 on hidden activity while running at production scale inside Microsoft Defender.
Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis
cs.CR 2026-05 unverdicted novelty 5.0

Pen-Strategist fine-tunes Qwen-3-14B with RL on a pentesting reasoning dataset and pairs it with a CNN step classifier, reporting 87% better strategy derivation, 47.5% more subtask completions than baselines, and gain...

Reference graph

Works this paper leans on

141 extracted references · 141 canonical work pages · cited by 3 Pith papers · 16 internal anchors

[1]

Ctibench: A benchmark for evaluating llms in cy- ber threat intelligence,

Md Tanvirul Alam, Dipkamal Bhushl, Le Nguyen, and Nidhi Rastogi. Ctibench: A benchmark for evaluating llms in cyber threat intelligence. arXiv preprint arXiv:2406.07599, 2024

work page arXiv 2024
[2]

Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline

Lampis Alevizos and Martijn Dekker. Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline. Electronics, 13(11):2021, 2024

work page 2021
[3]

Magic: Generating self-correction guideline for in-context text-to-sql

Arian Askari, Christian Poelitz, and Xinye Tang. Magic: Generating self-correction guideline for in-context text-to-sql. arXiv preprint arXiv:2406.12692, 2024

work page arXiv 2024
[4]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Secure: Benchmarking generative large language models for cybersecurity advisory

Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, and Nidhi Rastogi. Secure: Benchmarking generative large language models for cybersecurity advisory. arXiv preprint arXiv:2405.20441, 2024

work page arXiv 2024
[6]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

What is cyber threat hunting?, 2023

CrowdStrike. What is cyber threat hunting?, 2023. URL https://www.crowdstrike.com/ cybersecurity-101/threat-hunting/. Accessed: 14 March 2024

work page 2023
[8]

2024 global threat report, 2024

CrowdStrike. 2024 global threat report, 2024. URL https://www.crowdstrike.com/ global-threat-report/

work page 2024
[9]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306, 2023

work page arXiv 2023
[10]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363, 2023

work page arXiv 2023
[11]

Enabling efficient cyber threat hunting with cyber threat intelligence

Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, Zheng Qin, Fengyuan Xu, Prateek Mittal, Sanjeev R Kulkarni, and Dawn Song. Enabling efficient cyber threat hunting with cyber threat intelligence. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 193–204. IEEE, 2021

work page 2021
[12]

Tactical provenance analysis for endpoint detection and response systems

Wajih Ul Hassan, Adam Bates, and Daniel Marino. Tactical provenance analysis for endpoint detection and response systems. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1172–1189. IEEE, 2020

work page 2020
[13]

A comprehensive overview of large language models (llms) for cyber defences: Opportunities and direc- tions,

Mohammed Hassanin and Nour Moustafa. A comprehensive overview of large language models (llms) for cyber defences: Opportunities and directions. arXiv preprint arXiv:2405.14487, 2024

work page arXiv 2024
[14]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

{SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data

Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. {SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data. In 26th USENIX Security Symposium (USENIX Security 17), pages 487–504, 2017

work page 2017
[16]

Infiagent-dabench: Evaluating agents on data analysis tasks

Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024

work page arXiv 2024
[17]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, 2024

work page 2024
[18]

What is threat hunting?, 2023

IBM. What is threat hunting?, 2023. URL https://www.ibm.com/topics/ threat-hunting. Accessed: 1 Oct 2024

work page 2023
[19]

X.; and Wen, J.-R

Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645, 2023

work page arXiv 2023
[20]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Crimson: Empowering strategic reasoning in cybersecurity through large language models

Jiandong Jin, Bowen Tang, Mingxuan Ma, Xiao Liu, Yunfei Wang, Qingnan Lai, Jia Yang, and Changling Zhou. Crimson: Empowering strategic reasoning in cybersecurity through large language models. arXiv preprint arXiv:2403.00878, 2024

work page arXiv 2024
[22]

Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity

Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, and Xiapu Luo. Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity. arXiv preprint arXiv:2412.20787, 2024

work page arXiv 2024
[23]

Cyberpal

Matan Levi, Yair Allouche, Daniel Ohayon, and Anton Puzanov. Cyberpal. ai: Empowering llms with expert-driven cybersecurity instructions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24402–24412, 2025. 11

work page 2025
[24]

Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models

Guancheng Li, Yifeng Li, Wang Guannan, Haoyu Yang, and Yang Yu. Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models. https://github.com/XuanwuAI/SecEval, 2023

work page 2023
[25]

Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

work page 2023
[26]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36:42330–42357, 2023

work page 2023
[27]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Secqa: A concise question-answering dataset for evaluating large language models in computer security

Zefang Liu. Secqa: A concise question-answering dataset for evaluating large language models in computer security. arXiv preprint arXiv:2312.15838, 2023

work page arXiv 2023
[30]

Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024

Zefang Liu, Jialei Shi, and John F Buford. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024

work page 2024
[31]

Evolving techniques in cyber threat hunting: A systematic review

Arash Mahboubi, Khanh Luong, Hamed Aboutorab, Hang Thanh Bui, Geoff Jarrad, Mo- hammed Bahutair, Seyit Camtepe, Ganna Pogrebna, Ejaz Ahmed, Bazara Barry, et al. Evolving techniques in cyber threat hunting: A systematic review. Journal of Network and Computer Applications, page 104004, 2024

work page 2024
[32]

Mitre att&ck, 2025

MITRE. Mitre att&ck, 2025. URL https://attack.mitre.org/. A knowledge base of adversary tactics and techniques

work page 2025
[33]

On evaluating the integration of reasoning and action in llm agents with database question answering

Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, and Arman Cohan. On evaluating the integration of reasoning and action in llm agents with database question answering. arXiv preprint arXiv:2311.09721, 2023

work page arXiv 2023
[34]

OSINT Framework

Joshua Nordine. OSINT Framework. https://github.com/lockfale/osint-framework (commit 68c904c), 2024. Accessed 2025-05-10

work page 2024
[35]

Agir: Au- tomating cyber threat intelligence reporting with natural language generation

Filippo Perrina, Francesco Marchiori, Mauro Conti, and Nino Vincenzo Verde. Agir: Au- tomating cyber threat intelligence reporting with natural language generation. In 2023 IEEE International Conference on Big Data (BigData), pages 3053–3062. IEEE, 2023

work page 2023
[36]

Din-sql: Decomposed in-context learning of text-to-sql with self-correction

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[37]

Out of the cage: How stochastic parrots win in cyber security environments.arXiv preprint arXiv:2308.12086, 2023

Maria Rigaki, Ondˇrej Lukáš, Carlos A Catania, and Sebastian Garcia. Out of the cage: How stochastic parrots win in cyber security environments. arXiv preprint arXiv:2308.12086, 2023

work page arXiv 2023
[38]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Time for action: Automated analysis of cyber threat intelligence in the wild

Giuseppe Siracusano, Davide Sanvito, Roberto Gonzalez, Manikantan Srinivasan, Sivakaman Kamatchi, Wataru Takahashi, Masaru Kawakita, Takahiro Kakumaru, and Roberto Bifulco. Time for action: Automated analysis of cyber threat intelligence in the wild. arXiv preprint arXiv:2307.10214, 2023

work page arXiv 2023
[40]

Towards evaluation and un- derstanding of large language models for cyber operation automation

Madeena Sultana, Adrian Taylor, Li Li, and Suryadipta Majumdar. Towards evaluation and un- derstanding of large language models for cyber operation automation. In 2023 IEEE Conference on Communications and Network Security (CNS), pages 1–6. IEEE, 2023. 12

work page 2023
[41]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis. arXiv preprint arXiv:2405.16755, 2024

work page internal anchor Pith review arXiv 2024
[42]

Common Vulnerabilities and Exposures (CVE) Program

The MITRE Corporation. Common Vulnerabilities and Exposures (CVE) Program. https: //www.cve.org/, 2025. Accessed 2025-05-10

work page 2025
[43]

MITRE ATT&CK ® Knowledge Base

The MITRE Corporation. MITRE ATT&CK ® Knowledge Base. https://attack.mitre. org/, 2025. Version 17.1. Accessed 2025-05-10

work page 2025
[44]

Mac- sql: A multi-agent collaborative framework for text-to-sql,

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. arXiv preprint arXiv:2312.11242, 2024

work page arXiv 2024
[45]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[47]

Avalon’s game of thoughts: Battle against deception through recursive contemplation, 2023

Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalon’s game of thoughts: Battle against deception through recursive contemplation. ArXiv, abs/2310.01320, 2023. URL https://api.semanticscholar.org/CorpusID:263605971

work page arXiv 2023
[48]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Mathchat: Converse to tackle challenging math problems with llm agents

Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, and Chi Wang. Mathchat: Converse to tackle challenging math problems with llm agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

work page 2024
[51]

State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. Stateflow: Enhancing llm task-solving through state-driven workflows. arXiv preprint arXiv:2403.11322, 2024

work page arXiv 2024
[52]

Crab: Cross-environment agent benchmark for multimodal language model agents

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, et al. Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024

work page arXiv 2024
[53]

Intercode: Standardizing and benchmarking interactive coding with execution feedback

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.arXiv preprint arXiv:2306.14898, 2023

work page arXiv 2023
[54]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Attackseqbench: Benchmarking large language models’ understanding of sequential patterns in cyber attacks

Javier Yong, Haokai Ma, Yunshan Ma, Anis Yusof, Zhenkai Liang, and Ee-Chien Chang. Attackseqbench: Benchmarking large language models’ understanding of sequential patterns in cyber attacks. arXiv preprint arXiv:2503.03170, 2025

work page arXiv 2025
[57]

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018. 13

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Autodefense: Multi-agent llm defense against jailbreak attacks,

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi- agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024

work page arXiv 2024
[60]

K., et al

Andy K Zhang, Neil Perry, Riya Dulepet, Eliot Jones, Justin W Lin, Joey Ji, Celeste Menders, Gashon Hussein, Samantha Liu, Donovan Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models. arXiv preprint arXiv:2408.08926, 2024

work page arXiv 2024
[61]

When llms meet cybersecurity: A systematic literature review,

Jie Zhang, Haoyu Bu, Hui Wen, Yu Chen, Lun Li, and Hongsong Zhu. When llms meet cybersecurity: A systematic literature review. arXiv preprint arXiv:2405.03644, 2024

work page arXiv 2024
[62]

When llms meet cybersecurity: A systematic literature review

Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. When llms meet cybersecurity: A systematic literature review. Cybersecurity, 8(1):1–41, 2025

work page 2025
[63]

Reactable: Enhancing react for table question answering

Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, and Jignesh M Pa- tel. Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815, 2023

work page arXiv 2023
[64]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

work page 2024
[65]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025. A Limitations and Broader Impacts Limitations. While ExCyTIn-Bench represents a significant step toward evaluating LLM agents on real...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

For every column we draw a random sample of five values and prompt a (LLM) to decide whether the column contains PII

Identification of PII Columns Each table is scanned column-by-column. For every column we draw a random sample of five values and prompt a (LLM) to decide whether the column contains PII. Columns provisionally flagged in this first pass are examined once more with three focused prompts:

work page
[67]

Confirm whether the column indeed holds PII

work page
[68]

Decide whether the column stores a dictionary/ JSON structure

work page
[69]

The union of both LLM passes is then reviewed by domain experts, yielding a curated list of PII-bearing columns that serves as ground truth for the remainder of the pipeline

If it does, enumerate which keys inside the structure contain PII. The union of both LLM passes is then reviewed by domain experts, yielding a curated list of PII-bearing columns that serves as ground truth for the remainder of the pipeline

work page
[70]

John” → “Javier

Creation of PII Value Mappings For every confirmed PII column we gather its set of unique values. If the column encodes a dictionary, only the keys identified in the previous stage are considered. • Regex-based substitution. We manually go througth the tables to recognize common PII patterns, and each candidate value is matched against them (IPv4/IPv6 add...

work page
[71]

This guarantees referential consistency—queries that join on an anonymised IP address still succeed—and eliminates residual PII leakage while preserving analytical utility

Dataset -wide Replacement In the final stage we stream every table in the dataset, globally replacing each source PII value with its surrogate. This guarantees referential consistency—queries that join on an anonymised IP address still succeed—and eliminates residual PII leakage while preserving analytical utility. D Additional Question Generation Details...

work page
[72]

master–slave

Under these criteria and filtering after question generation, we collected a total of 589 questions as the test set (See Figure 4). We also created a strategy for sampling questions to split the training and test. Since we are building questions from the graph, and the train and test sets are all from one graph, we want the train samples to have less over...

work page 2024
[73]

Thought: .... nAction:

This suggests that fine-tuning amplified the model’s bias toward the training incidents, degrading its ability to generalize. Given our small sample size, additional studies are needed to characterize the impact of fine-tuning more precisely. However, these initial results imply that naïve fine-tuning may be ill-suited to this task and motivate exploring ...

work page 2024
[74]

The question should be natural and relevant to the context, and it should be clear and have a deterministic answer

work page
[75]

If the start and end alert are the same, you should be more careful since the given entities may have overlapping information

But it should not leak the answer. If the start and end alert are the same, you should be more careful since the given entities may have overlapping information

work page
[76]

powershell.exe

The question should be specific of the answer you are looking for, and the answer should match the question. - "answer": the answer to the question. You may be given one or more entities from the end alert, select the most meaningful entity and make sure it is not leaked in the context or question. - "context": the context from the start alert. you should...

work page 2024
[77]

Suspicious access to LSASS service

2024-06-20 07:36 UTC – CredentialAccess: “Suspicious access to LSASS service” on vnevado-win10v via mimikatz.exe (Account: tgs2z)

work page 2024
[78]

Possible attempt to access Primary Refresh Token (PRT)

2024-06-20 08:51 – CredentialAccess: “Possible attempt to access Primary Refresh Token (PRT)” on vnevado-win10v by get-userprttoken.ps1 (tgs2z)

work page 2024
[79]

Mimikatz credential theft tool

2024-06-20 08:58 – Malware: “Mimikatz credential theft tool” detected on vnevado-win10v

work page 2024
[80]

Malicious credential theft tool execution detected

2024-06-20 09:00 – CredentialAccess: “Malicious credential theft tool execution detected” on vnevado-win10v

work page 2024

Showing first 80 references.

[1] [1]

Ctibench: A benchmark for evaluating llms in cy- ber threat intelligence,

Md Tanvirul Alam, Dipkamal Bhushl, Le Nguyen, and Nidhi Rastogi. Ctibench: A benchmark for evaluating llms in cyber threat intelligence. arXiv preprint arXiv:2406.07599, 2024

work page arXiv 2024

[2] [2]

Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline

Lampis Alevizos and Martijn Dekker. Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline. Electronics, 13(11):2021, 2024

work page 2021

[3] [3]

Magic: Generating self-correction guideline for in-context text-to-sql

Arian Askari, Christian Poelitz, and Xinye Tang. Magic: Generating self-correction guideline for in-context text-to-sql. arXiv preprint arXiv:2406.12692, 2024

work page arXiv 2024

[4] [4]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Secure: Benchmarking generative large language models for cybersecurity advisory

Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, and Nidhi Rastogi. Secure: Benchmarking generative large language models for cybersecurity advisory. arXiv preprint arXiv:2405.20441, 2024

work page arXiv 2024

[6] [6]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

What is cyber threat hunting?, 2023

CrowdStrike. What is cyber threat hunting?, 2023. URL https://www.crowdstrike.com/ cybersecurity-101/threat-hunting/. Accessed: 14 March 2024

work page 2023

[8] [8]

2024 global threat report, 2024

CrowdStrike. 2024 global threat report, 2024. URL https://www.crowdstrike.com/ global-threat-report/

work page 2024

[9] [9]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306, 2023

work page arXiv 2023

[10] [10]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363, 2023

work page arXiv 2023

[11] [11]

Enabling efficient cyber threat hunting with cyber threat intelligence

Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, Zheng Qin, Fengyuan Xu, Prateek Mittal, Sanjeev R Kulkarni, and Dawn Song. Enabling efficient cyber threat hunting with cyber threat intelligence. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 193–204. IEEE, 2021

work page 2021

[12] [12]

Tactical provenance analysis for endpoint detection and response systems

Wajih Ul Hassan, Adam Bates, and Daniel Marino. Tactical provenance analysis for endpoint detection and response systems. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1172–1189. IEEE, 2020

work page 2020

[13] [13]

A comprehensive overview of large language models (llms) for cyber defences: Opportunities and direc- tions,

Mohammed Hassanin and Nour Moustafa. A comprehensive overview of large language models (llms) for cyber defences: Opportunities and directions. arXiv preprint arXiv:2405.14487, 2024

work page arXiv 2024

[14] [14]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

{SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data

Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. {SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data. In 26th USENIX Security Symposium (USENIX Security 17), pages 487–504, 2017

work page 2017

[16] [16]

Infiagent-dabench: Evaluating agents on data analysis tasks

Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024

work page arXiv 2024

[17] [17]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, 2024

work page 2024

[18] [18]

What is threat hunting?, 2023

IBM. What is threat hunting?, 2023. URL https://www.ibm.com/topics/ threat-hunting. Accessed: 1 Oct 2024

work page 2023

[19] [19]

X.; and Wen, J.-R

Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645, 2023

work page arXiv 2023

[20] [20]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Crimson: Empowering strategic reasoning in cybersecurity through large language models

Jiandong Jin, Bowen Tang, Mingxuan Ma, Xiao Liu, Yunfei Wang, Qingnan Lai, Jia Yang, and Changling Zhou. Crimson: Empowering strategic reasoning in cybersecurity through large language models. arXiv preprint arXiv:2403.00878, 2024

work page arXiv 2024

[22] [22]

Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity

Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, and Xiapu Luo. Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity. arXiv preprint arXiv:2412.20787, 2024

work page arXiv 2024

[23] [23]

Cyberpal

Matan Levi, Yair Allouche, Daniel Ohayon, and Anton Puzanov. Cyberpal. ai: Empowering llms with expert-driven cybersecurity instructions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24402–24412, 2025. 11

work page 2025

[24] [24]

Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models

Guancheng Li, Yifeng Li, Wang Guannan, Haoyu Yang, and Yang Yu. Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models. https://github.com/XuanwuAI/SecEval, 2023

work page 2023

[25] [25]

Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

work page 2023

[26] [26]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36:42330–42357, 2023

work page 2023

[27] [27]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Secqa: A concise question-answering dataset for evaluating large language models in computer security

Zefang Liu. Secqa: A concise question-answering dataset for evaluating large language models in computer security. arXiv preprint arXiv:2312.15838, 2023

work page arXiv 2023

[30] [30]

Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024

Zefang Liu, Jialei Shi, and John F Buford. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024

work page 2024

[31] [31]

Evolving techniques in cyber threat hunting: A systematic review

Arash Mahboubi, Khanh Luong, Hamed Aboutorab, Hang Thanh Bui, Geoff Jarrad, Mo- hammed Bahutair, Seyit Camtepe, Ganna Pogrebna, Ejaz Ahmed, Bazara Barry, et al. Evolving techniques in cyber threat hunting: A systematic review. Journal of Network and Computer Applications, page 104004, 2024

work page 2024

[32] [32]

Mitre att&ck, 2025

MITRE. Mitre att&ck, 2025. URL https://attack.mitre.org/. A knowledge base of adversary tactics and techniques

work page 2025

[33] [33]

On evaluating the integration of reasoning and action in llm agents with database question answering

Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, and Arman Cohan. On evaluating the integration of reasoning and action in llm agents with database question answering. arXiv preprint arXiv:2311.09721, 2023

work page arXiv 2023

[34] [34]

OSINT Framework

Joshua Nordine. OSINT Framework. https://github.com/lockfale/osint-framework (commit 68c904c), 2024. Accessed 2025-05-10

work page 2024

[35] [35]

Agir: Au- tomating cyber threat intelligence reporting with natural language generation

Filippo Perrina, Francesco Marchiori, Mauro Conti, and Nino Vincenzo Verde. Agir: Au- tomating cyber threat intelligence reporting with natural language generation. In 2023 IEEE International Conference on Big Data (BigData), pages 3053–3062. IEEE, 2023

work page 2023

[36] [36]

Din-sql: Decomposed in-context learning of text-to-sql with self-correction

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[37] [37]

Out of the cage: How stochastic parrots win in cyber security environments.arXiv preprint arXiv:2308.12086, 2023

Maria Rigaki, Ondˇrej Lukáš, Carlos A Catania, and Sebastian Garcia. Out of the cage: How stochastic parrots win in cyber security environments. arXiv preprint arXiv:2308.12086, 2023

work page arXiv 2023

[38] [38]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Time for action: Automated analysis of cyber threat intelligence in the wild

Giuseppe Siracusano, Davide Sanvito, Roberto Gonzalez, Manikantan Srinivasan, Sivakaman Kamatchi, Wataru Takahashi, Masaru Kawakita, Takahiro Kakumaru, and Roberto Bifulco. Time for action: Automated analysis of cyber threat intelligence in the wild. arXiv preprint arXiv:2307.10214, 2023

work page arXiv 2023

[40] [40]

Towards evaluation and un- derstanding of large language models for cyber operation automation

Madeena Sultana, Adrian Taylor, Li Li, and Suryadipta Majumdar. Towards evaluation and un- derstanding of large language models for cyber operation automation. In 2023 IEEE Conference on Communications and Network Security (CNS), pages 1–6. IEEE, 2023. 12

work page 2023

[41] [41]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis. arXiv preprint arXiv:2405.16755, 2024

work page internal anchor Pith review arXiv 2024

[42] [42]

Common Vulnerabilities and Exposures (CVE) Program

The MITRE Corporation. Common Vulnerabilities and Exposures (CVE) Program. https: //www.cve.org/, 2025. Accessed 2025-05-10

work page 2025

[43] [43]

MITRE ATT&CK ® Knowledge Base

The MITRE Corporation. MITRE ATT&CK ® Knowledge Base. https://attack.mitre. org/, 2025. Version 17.1. Accessed 2025-05-10

work page 2025

[44] [44]

Mac- sql: A multi-agent collaborative framework for text-to-sql,

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. arXiv preprint arXiv:2312.11242, 2024

work page arXiv 2024

[45] [45]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[47] [47]

Avalon’s game of thoughts: Battle against deception through recursive contemplation, 2023

Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalon’s game of thoughts: Battle against deception through recursive contemplation. ArXiv, abs/2310.01320, 2023. URL https://api.semanticscholar.org/CorpusID:263605971

work page arXiv 2023

[48] [48]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Mathchat: Converse to tackle challenging math problems with llm agents

Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, and Chi Wang. Mathchat: Converse to tackle challenging math problems with llm agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

work page 2024

[51] [51]

State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. Stateflow: Enhancing llm task-solving through state-driven workflows. arXiv preprint arXiv:2403.11322, 2024

work page arXiv 2024

[52] [52]

Crab: Cross-environment agent benchmark for multimodal language model agents

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, et al. Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024

work page arXiv 2024

[53] [53]

Intercode: Standardizing and benchmarking interactive coding with execution feedback

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.arXiv preprint arXiv:2306.14898, 2023

work page arXiv 2023

[54] [54]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [56]

Attackseqbench: Benchmarking large language models’ understanding of sequential patterns in cyber attacks

Javier Yong, Haokai Ma, Yunshan Ma, Anis Yusof, Zhenkai Liang, and Ee-Chien Chang. Attackseqbench: Benchmarking large language models’ understanding of sequential patterns in cyber attacks. arXiv preprint arXiv:2503.03170, 2025

work page arXiv 2025

[57] [57]

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018. 13

work page internal anchor Pith review Pith/arXiv arXiv 2018

[58] [58]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Autodefense: Multi-agent llm defense against jailbreak attacks,

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi- agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024

work page arXiv 2024

[60] [60]

K., et al

Andy K Zhang, Neil Perry, Riya Dulepet, Eliot Jones, Justin W Lin, Joey Ji, Celeste Menders, Gashon Hussein, Samantha Liu, Donovan Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models. arXiv preprint arXiv:2408.08926, 2024

work page arXiv 2024

[61] [61]

When llms meet cybersecurity: A systematic literature review,

Jie Zhang, Haoyu Bu, Hui Wen, Yu Chen, Lun Li, and Hongsong Zhu. When llms meet cybersecurity: A systematic literature review. arXiv preprint arXiv:2405.03644, 2024

work page arXiv 2024

[62] [62]

When llms meet cybersecurity: A systematic literature review

Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. When llms meet cybersecurity: A systematic literature review. Cybersecurity, 8(1):1–41, 2025

work page 2025

[63] [63]

Reactable: Enhancing react for table question answering

Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, and Jignesh M Pa- tel. Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815, 2023

work page arXiv 2023

[64] [64]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

work page 2024

[65] [65]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025. A Limitations and Broader Impacts Limitations. While ExCyTIn-Bench represents a significant step toward evaluating LLM agents on real...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

For every column we draw a random sample of five values and prompt a (LLM) to decide whether the column contains PII

Identification of PII Columns Each table is scanned column-by-column. For every column we draw a random sample of five values and prompt a (LLM) to decide whether the column contains PII. Columns provisionally flagged in this first pass are examined once more with three focused prompts:

work page

[67] [67]

Confirm whether the column indeed holds PII

work page

[68] [68]

Decide whether the column stores a dictionary/ JSON structure

work page

[69] [69]

The union of both LLM passes is then reviewed by domain experts, yielding a curated list of PII-bearing columns that serves as ground truth for the remainder of the pipeline

If it does, enumerate which keys inside the structure contain PII. The union of both LLM passes is then reviewed by domain experts, yielding a curated list of PII-bearing columns that serves as ground truth for the remainder of the pipeline

work page

[70] [70]

John” → “Javier

Creation of PII Value Mappings For every confirmed PII column we gather its set of unique values. If the column encodes a dictionary, only the keys identified in the previous stage are considered. • Regex-based substitution. We manually go througth the tables to recognize common PII patterns, and each candidate value is matched against them (IPv4/IPv6 add...

work page

[71] [71]

This guarantees referential consistency—queries that join on an anonymised IP address still succeed—and eliminates residual PII leakage while preserving analytical utility

Dataset -wide Replacement In the final stage we stream every table in the dataset, globally replacing each source PII value with its surrogate. This guarantees referential consistency—queries that join on an anonymised IP address still succeed—and eliminates residual PII leakage while preserving analytical utility. D Additional Question Generation Details...

work page

[72] [72]

master–slave

Under these criteria and filtering after question generation, we collected a total of 589 questions as the test set (See Figure 4). We also created a strategy for sampling questions to split the training and test. Since we are building questions from the graph, and the train and test sets are all from one graph, we want the train samples to have less over...

work page 2024

[73] [73]

Thought: .... nAction:

This suggests that fine-tuning amplified the model’s bias toward the training incidents, degrading its ability to generalize. Given our small sample size, additional studies are needed to characterize the impact of fine-tuning more precisely. However, these initial results imply that naïve fine-tuning may be ill-suited to this task and motivate exploring ...

work page 2024

[74] [74]

The question should be natural and relevant to the context, and it should be clear and have a deterministic answer

work page

[75] [75]

If the start and end alert are the same, you should be more careful since the given entities may have overlapping information

But it should not leak the answer. If the start and end alert are the same, you should be more careful since the given entities may have overlapping information

work page

[76] [76]

powershell.exe

The question should be specific of the answer you are looking for, and the answer should match the question. - "answer": the answer to the question. You may be given one or more entities from the end alert, select the most meaningful entity and make sure it is not leaked in the context or question. - "context": the context from the start alert. you should...

work page 2024

[77] [77]

Suspicious access to LSASS service

2024-06-20 07:36 UTC – CredentialAccess: “Suspicious access to LSASS service” on vnevado-win10v via mimikatz.exe (Account: tgs2z)

work page 2024

[78] [78]

Possible attempt to access Primary Refresh Token (PRT)

2024-06-20 08:51 – CredentialAccess: “Possible attempt to access Primary Refresh Token (PRT)” on vnevado-win10v by get-userprttoken.ps1 (tgs2z)

work page 2024

[79] [79]

Mimikatz credential theft tool

2024-06-20 08:58 – Malware: “Mimikatz credential theft tool” detected on vnevado-win10v

work page 2024

[80] [80]

Malicious credential theft tool execution detected

2024-06-20 09:00 – CredentialAccess: “Malicious credential theft tool execution detected” on vnevado-win10v

work page 2024