pith. machine review for the scientific record. sign in

arxiv: 2604.19533 · v3 · submitted 2026-04-21 · 💻 cs.CR · cs.AI

Recognition: unknown

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:20 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords cyber defense benchmarkLLM agentsthreat huntingsecurity operationsWindows event logsMITRE ATT&CKSQL queryingagent evaluation
0
0 comments X

The pith

Large language models fail to identify malicious events in raw security logs, achieving only 3.8 percent average recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the Cyber Defense Benchmark to measure LLM agents on the core task of threat hunting: examining large databases of raw Windows event logs with no hints and correctly flagging the exact timestamps of malicious activity. It wraps 106 real attack procedures from public security datasets into a Gymnasium environment where agents must use iterative SQL queries on time-shifted and obfuscated logs to discover evidence. Testing five frontier models on 26 campaigns shows all of them perform poorly, with the best model finding only a tiny fraction of the malicious events and none completing a full campaign. This gap persists even though the models do well on standard question-answering security tests, indicating that open-ended evidence gathering remains out of reach.

Core claim

The Cyber Defense Benchmark requires LLM agents to query an in-memory SQLite database of 75,000 to 135,000 Windows event logs per episode and explicitly flag malicious timestamps. Ground truth comes from Sigma rules derived from 106 attack procedures spanning 86 MITRE ATT&CK sub-techniques. Across evaluations of Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash, the top model reaches only 3.8 percent average recall, no run finds all flags, and no model meets the 50 percent recall threshold on every tactic.

What carries the argument

The Gymnasium reinforcement-learning environment that converts real attack recordings into time-shifted, entity-obfuscated SQLite databases and scores agents on iterative SQL queries against Sigma-rule ground truth.

If this is right

  • No current frontier LLM meets the minimum performance bar for unsupervised threat hunting in a security operations center.
  • Results on curated security question-and-answer benchmarks do not predict success on open-ended log analysis tasks.
  • Agents must develop better strategies for evidence gathering and iterative querying to handle real security data volumes.
  • The benchmark provides a concrete, reproducible test that future models or agent designs can be measured against.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be expanded to test agents with access to additional tools such as log search interfaces or external knowledge bases.
  • Persistent memory across episodes or better long-term planning might be required before LLMs can handle extended investigations.
  • Hybrid setups that combine LLM reasoning with traditional rule-based detectors could bridge the current performance gap.

Load-bearing premise

The time-shifted and entity-obfuscated log database with SQL-only access accurately represents the main difficulties of real-world security operations center threat hunting.

What would settle it

An LLM agent that achieves at least 50 percent recall on malicious events for every one of the 13 tested ATT&CK tactics across multiple full campaigns in the benchmark.

Figures

Figures reproduced from arXiv: 2604.19533 by Alankrit Chona, Ambuj Kumar, Igor Kozlov.

Figure 1
Figure 1. Figure 1: Cost-performance Pareto frontier. Each point shows mean ± [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Coverage Score per turn, per LLM group (sample rollouts). Claude Opus 4.6 exploration is deeper [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Radar chart of normalized tactic recall for all five models. High-severity (critical/high), high-relevance [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Cyber Defense Benchmark, a Gymnasium RL environment that presents LLM agents with in-memory SQLite databases of 75k–135k time-shifted, entity-obfuscated Windows event logs drawn from 106 real attack procedures (86 MITRE ATT&CK sub-techniques). Agents must iteratively issue SQL queries to locate malicious timestamps and explicitly flag them; performance is scored CTF-style against Sigma-rule ground truth. On 26 campaigns covering 105 procedures, five frontier models (Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, Gemini 3 Flash) achieve at most 3.8 % average recall (Claude), with zero full successes; no model meets the proposed passing threshold of ≥50 % recall on every tactic.

Significance. If the benchmark environment is accepted as a faithful proxy for core threat-hunting reasoning, the results supply a concrete, reproducible demonstration that current LLM agents remain far from unsupervised SecOps deployment, despite strong curated-Q&A performance. The use of real attack recordings, deterministic simulation, and external Sigma ground truth constitutes a clear empirical contribution that future agent work can build upon.

major comments (2)
  1. [Abstract and §3] Abstract (final paragraph) and §3 (Benchmark Design, implied): the claim that the observed failures demonstrate LLMs are “poorly suited for open-ended, evidence-driven threat hunting” treats the Gymnasium SQL-only interface as load-bearing. Real SOC workflows routinely supply SIEM search UIs, log aggregation views, external IOC enrichment, and free-text notes; the present setup’s restriction to iterative SQL against obfuscated entities may primarily test query-generation and JOIN proficiency rather than attack-pattern recognition. Without ablation runs on non-obfuscated logs or richer tool interfaces, the generalization rests on an untested environment-fidelity assumption.
  2. [§4] §4 (Evaluation Protocol, implied): the reported 3.8 % recall and “no run ever finds all flags” figures are presented without stated numbers of independent trials per campaign, temperature settings, or prompt templates. Given the stochastic nature of LLM agents and the large state space (75k–135k records), these omissions prevent assessment of whether the dramatic failure rates are statistically stable or sensitive to minor prompting variations.
minor comments (2)
  1. [Abstract] Abstract: the relationship between the 106 procedures, 26 campaigns, and 13 tactics is stated but not tabulated; a small table or sentence clarifying coverage would improve readability.
  2. [Abstract] The passing criterion (“≥50 % recall on every ATT&CK tactic”) is introduced without justification against operational SOC thresholds; a brief reference or sensitivity analysis would strengthen the interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] The claim that observed failures demonstrate LLMs are poorly suited for open-ended, evidence-driven threat hunting treats the Gymnasium SQL-only interface as load-bearing. Real SOC workflows supply SIEM UIs, IOC enrichment, and richer views; without ablations on non-obfuscated logs or richer tool interfaces, the generalization rests on an untested environment-fidelity assumption.

    Authors: We agree that the benchmark deliberately restricts agents to iterative SQL queries against time-shifted and entity-obfuscated logs, which foregrounds query formulation, JOIN reasoning, and evidence chaining rather than the full spectrum of SOC tooling. This design isolates a core, reproducible component of threat hunting. At the same time, we recognize that the strong wording in the abstract and §3 risks overgeneralization without supporting ablations. In the revised version we will qualify the relevant claims to specify that the results apply to this controlled SQL-only setting, add explicit discussion of the interface limitations, and outline planned extensions to richer interfaces and non-obfuscated data. This is a partial revision consisting of textual clarifications and expanded limitations text. revision: partial

  2. Referee: [§4] The reported 3.8 % recall and “no run ever finds all flags” figures are presented without stated numbers of independent trials per campaign, temperature settings, or prompt templates. Given LLM stochasticity and the large state space, these omissions prevent assessment of statistical stability.

    Authors: The referee correctly notes that the evaluation protocol section lacks these reproducibility details. The original submission omitted them primarily for length reasons. We will revise §4 to report the number of independent trials conducted per campaign, the temperature settings used, the prompt templates, and any observed variance across runs. These additions will allow readers to evaluate the stability of the reported performance figures. This is a full revision to improve methodological transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on new benchmark

full rationale

The paper introduces a Gymnasium environment wrapping OTRF Security-Datasets into SQLite DBs with time-shifting and entity obfuscation, defines ground truth via Sigma rules, and reports direct performance metrics (recall, full-success rate) for five LLMs across 26 campaigns. The central claim follows immediately from these observed values (Claude Opus at 3.8% average recall, zero full successes, no model passing the 50% recall bar on all tactics) without any derivation, fitted parameter, self-referential definition, or load-bearing self-citation. The protocol is externally verifiable against the cited public datasets and rules; no step reduces the result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the OTRF datasets and the fidelity of the benchmark environment to real SOC tasks.

axioms (2)
  • domain assumption The 106 attack procedures from the OTRF Security-Datasets corpus are representative of real-world threats.
    The benchmark wraps these procedures into episodes, assuming they cover relevant attack patterns across tactics.
  • domain assumption Sigma-rule-derived ground truth accurately labels malicious events in the simulated logs.
    All scoring depends on this external rule set matching the campaign simulator output.

pith-pipeline@v0.9.0 · 5618 in / 1239 out tokens · 54396 ms · 2026-05-10T02:20:24.478935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    B., et al

    Rodriguez, J. B., et al. (2020). Mordor: Pre-recorded Security Events. OTRF Security-Datasets. github.com/OTRF/Security-Datasets

  2. [2]

    Peng, B., et al. (2024). SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity. arXiv:2412.20787

  3. [3]

    Tihanyi, N., et al. (2023). CyberMetric: A Benchmark Dataset for Evaluating LLMs Knowledge in Cybersecurity. arXiv:2402.07688

  4. [4]

    Liu, Y., et al. (2024). CTI-Bench: Evaluating LLMs in Cyber Threat Intelligence. ACL Findings 2024

  5. [5]

    Tann, W., et al. (2023). Using LLMs for Cybersecurity CTF Challenges and Certification Questions. IEEE SSCI 2023. Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps Page 15 Simbian AI · Technical Report v1.0 · April 2026 · research@simbian.ai

  6. [6]

    Yang, J., et al. (2023). InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback. NeurIPS 2023

  7. [7]

    Shao, R., et al. (2024). NYU CTF Bench: A Scalable Open-Source Benchmark for LLMs in Offensive Security. arXiv:2406.05590

  8. [8]

    Happe, A., et al. (2023). Getting pwn'd by AI: Penetration Testing with Large Language Models. ESEC/FSE 2023

  9. [9]

    (2015-2024)

    Sigma Project. (2015-2024). SigmaHQ: Generic Signature Format for SIEM Systems. github.com/SigmaHQ/sigma

  10. [10]

    WithSecureLabs. (2022). Chainsaw: Rapidly Search and Hunt Through Windows Forensic Artefacts. github.com/WithSecureLabs/chainsaw

  11. [11]

    Strom, B., et al. (2018). MITRE ATT&CK;: Design and Philosophy. MITRE Technical Report MTR180110

  12. [12]

    Towers, M., et al. (2023). Gymnasium: A Standard Interface for Reinforcement Learning Environments. arXiv:2407.17032

  13. [13]

    Berrevoets, J., et al. (2024). CAI-Bench: Benchmarks for AI Capabilities in Cybersecurity. Preprint

  14. [14]

    LiteLLM. (2023). Call all LLM APIs using the OpenAI format. github.com/BerriAI/litellm

  15. [15]

    Wu Y., et al. (2025). ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation. arxiv.org/abs/2507.14201

  16. [16]

    Begimher D., et al. (2026). SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents. arxiv.org/abs/2604.12040

  17. [17]

    Simbian AI. (2025). AI SOC LLM Leaderboard: Benchmarking LLMs on End-to-End Agentic Alert Investigation. simbian.ai/best-ai-for-cybersecurity

  18. [18]

    Deason, L., et al. (2025). CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning. arXiv:2509.20166. APPENDIX A: BENCHMARK SEEDS We sampled 26 seeds from the DIVERSE_INTRUSION chain generator and projected each resulting campaign into the same binary seed × procedure space used elsewhere in the report (106 proc...