pith. sign in

arxiv: 2605.21773 · v1 · pith:5P2OU43Gnew · submitted 2026-05-20 · 💻 cs.CR · cs.LG

HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords large language modelshost-based intrusion detectionsystem logsbenchmarkcybersecurityperformance evaluationintrusion detection
0
0 comments X

The pith

LLMs achieve high precision on simple host logs for intrusion detection but degrade sharply as logs grow noisier and more complex.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HIDBench to test large language models on detecting intrusions from system logs, a task that demands fine-grained reasoning over large, noisy, and imbalanced data. It unifies three public datasets and supplies a pipeline to prepare raw telemetry for LLM input. Evaluation of frontier models shows they often reach precision above 0.8 on easier collections yet see Matthews correlation coefficient fall below 0.5 and false-positive rates climb on harder ones. A sympathetic reader would care because reliable host-based intrusion detection matters for everyday cybersecurity and knowing where current LLMs break helps decide when they can be trusted in practice.

Core claim

The paper claims that frontier LLMs exhibit substantial performance gaps across the unified datasets. While many models achieve high precision often above 0.8 on simpler datasets, performance degrades significantly as system logs become noisier and more complex, with MCC frequently dropping below 0.5 and false positive rates increasing sharply. Models fall into distinct regimes such as conservative detectors with low false positive rates and over-sensitive models that generate excessive alerts. The results indicate that LLMs hold strong potential for host-based intrusion detection yet their effectiveness remains highly sensitive to data complexity, making robust system design essential for可靠

What carries the argument

The HIDBench benchmark and its data construction pipeline, which unifies DARPA-E3, DARPA-E5, and NodLink datasets and converts raw host telemetry into LLM-compatible inputs for systematic evaluation under realistic settings.

If this is right

  • LLMs can support host-based intrusion detection but need robust surrounding systems to handle varying data complexity.
  • Model behavior splits into conservative low-alert regimes and over-sensitive high-alert regimes depending on the input logs.
  • High precision on clean datasets does not predict success when logs contain more overlapping benign and malicious activity.
  • Deployment decisions for LLMs in intrusion detection must explicitly account for expected noise and imbalance levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid detectors that route simple logs to LLMs and complex logs to traditional rule-based methods could mitigate the observed drops.
  • The benchmark setup could be reused to compare fine-tuned versus zero-shot LLMs on the same log collections.
  • Similar evaluation pipelines might reveal whether the same complexity sensitivity appears in other log-driven security tasks such as anomaly detection.

Load-bearing premise

The data construction pipeline that transforms raw host telemetry into LLM-compatible inputs preserves the complex interactions between benign and malicious activities and does not introduce artifacts that artificially change detection difficulty.

What would settle it

If the same models were tested on a deliberately more complex and noisier variant of the same log collections and still maintained MCC above 0.5 with stable false-positive rates, the claimed sensitivity to data complexity would be directly challenged.

Figures

Figures reproduced from arXiv: 2605.21773 by Danyu Sun, Jinghuai Zhang, Yuan Tian, Zhou Li.

Figure 1
Figure 1. Figure 1: illustrates the workflow of our benchmark HIDBENCH. It consists of four components: (i) Data Construction & Segmentation, which transforms high-volume, highly-imbalanced raw host logs into structured inputs suitable for evaluation by LLMs; (ii) Malicious Evidence Identification, which extracts attack evidence using the LLM; (iii) Attack Graph Expansion, which augments the identified evidence with its surro… view at source ↗
Figure 2
Figure 2. Figure 2: Per-sub-dataset performance averaged across all LLMs. Higher MCC and precision indicate better detection quality and alert accuracy, while lower FPR indicates fewer false alarms [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Precision–FPR tradeoff across models and dataset families. Each point represents a model– [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-model FPR across all sub-datasets, with models sorted by average FPR. Color intensity [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of test-time scaling on detection performance. (a) Precision and (b) MCC versus the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs' capabilities in supporting host-based intrusion detection systems (HIDS). This task requires fine-grained reasoning over large-scale, noisy, and highly imbalanced system logs, where complex interactions between benign and malicious activities make reliable detection challenging. Our benchmark unifies three public system log datasets, DARPA-E3, DARPA-E5, and NodLink, and introduces a data construction pipeline that transforms raw host telemetry into LLM-compatible inputs, enabling systematic evaluation under realistic intrusion detection settings. Our evaluation of frontier LLMs reveals substantial performance gaps across datasets. While many models achieve high precision (often above 0.8) on simpler datasets, their performance degrades significantly as system logs become noisier and more complex, with MCC frequently dropping below 0.5 and false positive rates increasing sharply. We further analyze model behavior and identify distinct regimes, including conservative detectors with low false positive rates and over-sensitive models that generate excessive alerts. Overall, our results highlight that while LLMs show strong potential for HIDS, their effectiveness is highly sensitive to data complexity, and robust system design is essential for reliable deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces HIDBench, a new benchmark for assessing large language models (LLMs) on host-based intrusion detection (HIDS) from system logs. It unifies three public datasets (DARPA-E3, DARPA-E5, and NodLink), presents a data construction pipeline to convert raw host telemetry into LLM-compatible inputs, and evaluates frontier LLMs. The central empirical finding is that many models achieve high precision (often >0.8) on simpler datasets but degrade sharply on noisier and more complex logs, with MCC frequently dropping below 0.5 and false positive rates rising; the work also identifies distinct behavioral regimes such as conservative low-FPR detectors versus over-sensitive models.

Significance. If the results hold after addressing pipeline validation, the benchmark fills an important gap in LLM evaluation for cybersecurity by focusing on fine-grained reasoning over noisy, imbalanced logs. The unification of multiple datasets and the identification of performance sensitivity to complexity provide actionable insights for deploying LLMs in HIDS, highlighting the need for robust system design rather than direct model use.

major comments (1)
  1. [Abstract; data construction pipeline] Abstract and data construction pipeline description: The central claim that performance degrades due to increasing log complexity (MCC <0.5, rising FPR) depends on the pipeline faithfully preserving benign-malicious interactions without introducing artifacts via truncation, event selection, windowing, or formatting to fit context limits. No explicit validation (e.g., raw-input baselines or comparisons to traditional detectors) is described to rule out construction choices as the source of cross-dataset differences; this assumption is load-bearing for attributing results to inherent data properties rather than pipeline design.
minor comments (1)
  1. [Abstract] The abstract mentions 'distinct regimes' of model behavior but does not specify the exact metrics or thresholds used to classify conservative vs. over-sensitive detectors; adding a brief definition or reference to the relevant results table would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We have carefully considered the referee's major comment regarding the data construction pipeline and provide our response below, along with planned revisions to address the concerns.

read point-by-point responses
  1. Referee: [Abstract; data construction pipeline] Abstract and data construction pipeline description: The central claim that performance degrades due to increasing log complexity (MCC <0.5, rising FPR) depends on the pipeline faithfully preserving benign-malicious interactions without introducing artifacts via truncation, event selection, windowing, or formatting to fit context limits. No explicit validation (e.g., raw-input baselines or comparisons to traditional detectors) is described to rule out construction choices as the source of cross-dataset differences; this assumption is load-bearing for attributing results to inherent data properties rather than pipeline design.

    Authors: We agree that validating the pipeline is crucial to ensure that the observed performance differences across datasets can be attributed to variations in log complexity rather than artifacts introduced during data construction. Our pipeline applies consistent processing steps to all datasets to enable fair comparison, and the datasets themselves are established in the literature with known differences in noise and complexity levels. To address this point directly, we will revise the manuscript to include additional validation experiments. Specifically, we will report results from traditional HIDS approaches, such as signature-based or anomaly detection methods, applied to the same processed inputs. This will help demonstrate that the degradation in LLM performance on more complex datasets aligns with the inherent challenges of those datasets. We will also expand the description of the pipeline to detail how truncation and windowing were chosen to minimize information loss while respecting context constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark on public datasets

full rationale

The paper presents an empirical evaluation of LLMs on host-based intrusion detection using three public datasets (DARPA-E3, DARPA-E5, NodLink) transformed via a described pipeline. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes appear in the abstract or described structure. Performance metrics (precision, MCC, FPR) are computed directly from model outputs on the constructed inputs, with no reduction of results to inputs by construction. The central claim of performance degradation on noisier data rests on observable differences across datasets rather than any self-referential loop. This is a standard benchmark study whose results are falsifiable against the same public data and independent of any internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the chosen datasets and the fidelity of the transformation pipeline; no free parameters or new entities are introduced beyond standard benchmark construction choices.

axioms (1)
  • domain assumption The three public datasets (DARPA-E3, DARPA-E5, NodLink) together capture realistic levels of noise, imbalance, and benign-malicious interactions for host-based intrusion detection.
    The benchmark's conclusions about performance sensitivity rest on these datasets being sufficiently representative.

pith-pipeline@v0.9.0 · 5782 in / 1329 out tokens · 52741 ms · 2026-05-22T08:48:55.701990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

  1. [1]

    GLUE: A multi-task benchmark and analysis platform for natural language understanding,

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, and A. Alishahi, Eds. Brussels, Belgium: Association for Computational Li...

  2. [2]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jianget al., “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,” Advances in Neural Information Processing Systems, vol. 37, pp. 95 266–95 290, 2024

  3. [3]

    arXiv:2312.04724 [cs]

    M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontanaet al., “Purple llama cyberseceval: A secure coding benchmark for language models,”arXiv preprint arXiv:2312.04724, 2023

  4. [4]

    Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,

    Z. Liu, J. Shi, and J. F. Buford, “Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,” inAAAI 2024 Workshop on Artificial Intelligence for Cyber Security, 2024

  5. [5]

    Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,

    A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askaryar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang, “Cybench: A framework for evaluating...

  6. [6]

    Cybergym: Evaluating ai agents’ cybersecurity capabilities with real-world vulnerabilities at scale,

    Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song, “Cybergym: Evaluating ai agents’ cybersecurity capabilities with real-world vulnerabilities at scale,”arXiv e-prints, pp. arXiv– 2506, 2025

  7. [7]

    AUTOATTACKER: A large language model guided system to implement automatic cyber-attacks,

    J. Xu, J. W. Stokes, G. McDonald, X. Bai, D. Marshall, S. Wang, A. Swaminathan, and Z. Li, “Autoattacker: A large language model guided system to implement automatic cyber-attacks,” arXiv preprint arXiv:2403.01038, 2024

  8. [8]

    {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing,

    G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “ {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing,” inUSENIX Security Symposium, 2024

  9. [9]

    Large language model guided protocol fuzzing,

    R. Meng, M. Mirchev, M. Böhme, and A. Roychoudhury, “Large language model guided protocol fuzzing,” inAnnual Network and Distributed System Security Symposium, 2024

  10. [10]

    Exploring {ChatGPT’s} capabilities on vulnerability management,

    P. Liu, J. Liu, L. Fu, K. Lu, Y . Xia, X. Zhang, W. Chen, H. Weng, S. Ji, and W. Wang, “Exploring {ChatGPT’s} capabilities on vulnerability management,” inUSENIX Security Symposium, 2024

  11. [11]

    Constructing knowledge graph from cyber threat intelligence using large language model,

    J. Liu and J. Zhan, “Constructing knowledge graph from cyber threat intelligence using large language model,” in2023 IEEE International Conference on Big Data, 2023

  12. [12]

    Ctikg: Llm-powered knowledge graph construction from cyber threat intelligence,

    L. Huang and X. Xiao, “Ctikg: Llm-powered knowledge graph construction from cyber threat intelligence,” inFirst Conference on Language Modeling, 2024

  13. [13]

    Towards a scalable ai-driven framework for data-independent cyber threat intelligence information extraction,

    O. Sorokoletova, E. Antonioni, and G. Colò, “Towards a scalable ai-driven framework for data-independent cyber threat intelligence information extraction,” inInternational Conference on Foundation and Large Language Models, 2024

  14. [14]

    Raconteur: A knowl- edgeable, insightful, and portable llm-powered shell command explainer,

    J. Deng, X. Li, Y . Chen, Y . Bai, H. Weng, Y . Liu, T. Wei, and W. Xu, “Raconteur: A knowl- edgeable, insightful, and portable llm-powered shell command explainer,”arXiv preprint arXiv:2409.02074, 2024

  15. [15]

    Host-based intrusion detection system with system calls: Review and future trends,

    M. Liu, Z. Xue, X. Xu, C. Zhong, and J. Chen, “Host-based intrusion detection system with system calls: Review and future trends,”ACM computing surveys (CSUR), vol. 51, no. 5, pp. 1–36, 2018. 10

  16. [16]

    Transparent computing engagement 3,

    DARPA I2O, “Transparent computing engagement 3,” https://github.com/darpa-i2o/ Transparent-Computing/blob/master/README-E3.md, accessed: 2023-03-13

  17. [17]

    Transparent computing engagement 5 data release,

    ——, “Transparent computing engagement 5 data release,” https://github.com/darpa-i2o/ Transparent-Computing, 2026, accessed: 2026-02-16

  18. [18]

    Ids-agent: An llm agent for explainable intrusion detection in iot networks,

    Y . Li, Z. Xiang, N. D. Bastian, D. Song, and B. Li, “Ids-agent: An llm agent for explainable intrusion detection in iot networks,” 2024

  19. [19]

    Omnisec: Llm-driven provenance-based intrusion detection via retrieval-augmented behavior prompting,

    W. Cheng, T. Zhu, S. Jing, J.-P. Mei, M. Ma, J. Jin, and Z. Weng, “Omnisec: Llm-driven provenance-based intrusion detection via retrieval-augmented behavior prompting,”arXiv preprint arXiv:2503.03108, 2025

  20. [20]

    Hybrid llm-enhanced intrusion detec- tion for zero-day threats in iot networks,

    M. F. Al-Hammouri, Y . Otoum, R. Atwa, and A. Nayak, “Hybrid llm-enhanced intrusion detec- tion for zero-day threats in iot networks,” in2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2025, pp. 864–869

  21. [21]

    Pidsmaker: Building and evaluating provenance-based intrusion detection systems,

    T. Bilot, B. Jiang, and T. Pasquier, “Pidsmaker: Building and evaluating provenance-based intrusion detection systems,”arXiv preprint arXiv:2601.22983, 2026

  22. [22]

    Sok: History is a vast early warning system: Auditing the provenance of system intrusions,

    M. A. Inam, Y . Chen, A. Goyal, J. Liu, J. Mink, N. Michael, S. Gaur, A. Bates, and W. U. Hassan, “Sok: History is a vast early warning system: Auditing the provenance of system intrusions,” in 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, pp. 2620–2638

  23. [23]

    Flash: A comprehensive approach to intrusion detection via provenance graph representation learning,

    M. Ur Rehman, H. Ahmadi, and W. Ul Hassan, “Flash: A comprehensive approach to intrusion detection via provenance graph representation learning,” inIEEE Symposium on Security and Privacy, 2024

  24. [24]

    ORTHRUS: Achieving High Quality of Attribution in Provenance-based Intrusion Detection Systems,

    B. Jiang, T. Bilot, N. El Madhoun, K. Al Agha, A. Zouaoui, S. Iqbal, X. Han, and T. Pasquier, “ORTHRUS: Achieving High Quality of Attribution in Provenance-based Intrusion Detection Systems,” inUSENIX Security Symposium, 2025

  25. [25]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”Transactions of the Association for Computational Linguistics, 2024

  26. [26]

    Kairos: Practical intrusion detection and investigation using whole-system provenance,

    Z. Cheng, Q. Lv, J. Liang, Y . Wang, D. Sun, T. Pasquier, and X. Han, “Kairos: Practical intrusion detection and investigation using whole-system provenance,” 2023. [Online]. Available: https://arxiv.org/abs/2308.05034

  27. [27]

    Threat detection and investigation with system-level provenance graph: A survey,

    Z. Liet al., “Threat detection and investigation with system-level provenance graph: A survey,” Computers & Security, 2021

  28. [28]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snellet al., “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

  29. [29]

    Self-Refine: Iterative Refinement with Self-Feedback

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iterative refinement with self-feedback, 2023,”URL https://arxiv. org/abs/2303.17651, vol. 2303, 2023

  30. [30]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Lan- guage agents with verbal reinforcement learning, 2023,”URL https://arxiv. org/abs/2303.11366, vol. 8, 2024

  31. [31]

    Nodlink: An online system for fine-grained apt attack detection and investigation,

    S. Li, F. Dong, X. Xiao, H. Wang, F. Shao, J. Chen, Y . Guo, X. Chen, and D. Li, “Nodlink: An online system for fine-grained apt attack detection and investigation,” inNetwork and Distributed System Security Symposium, 2024

  32. [32]

    Darpa e3 cadets ground-truth,

    Orthrus, “Darpa e3 cadets ground-truth,” https://github.com/ubc-provenance/ground-truth/ blob/012f321f46137650496e639b0ad7e0a66db07a73/darpa/E3-CADETS/node_Nginx_ Backdoor_12.csv, 2025. 11

  33. [33]

    Sometimes simpler is better: A comprehensive analysis of State-of-the-Art Provenance-Based intrusion detection systems,

    T. Bilot, B. Jiang, Z. Li, N. E. Madhoun, K. A. Agha, A. Zouaoui, and T. Pasquier, “Sometimes simpler is better: A comprehensive analysis of State-of-the-Art Provenance-Based intrusion detection systems,” in34th USENIX Security Symposium (USENIX Security 25). Seattle, WA: USENIX Association, Aug. 2025, pp. 7193–7212. [Online]. Available: https://www.useni...

  34. [34]

    {MAGIC}: Detecting advanced persistent threats via masked graph representation learning,

    Z. Jia, Y . Xiong, Y . Nan, Y . Zhang, J. Zhao, and M. Wen, “{MAGIC}: Detecting advanced persistent threats via masked graph representation learning,” inUSENIX Security Symposium, 2024

  35. [35]

    Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning,

    S. Wang, Z. Wang, T. Zhou, H. Sun, X. Yin, D. Han, H. Zhang, X. Shi, and J. Yang, “Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning,” IEEE Transactions on Information Forensics and Security, 2022

  36. [36]

    Nodlink repo,

    PKU-ASAL, “Nodlink repo,” https://github.com/PKU-ASAL/Simulated-Data, 2023

  37. [37]

    Introducing claude sonnet 4.6,

    Anthropic, “Introducing claude sonnet 4.6,” https://www.anthropic.com/news/ claude-sonnet-4-6, 2026, accessed: 2026-02-28

  38. [38]

    Claude sonnet 4.5,

    ——, “Claude sonnet 4.5,” https://www.anthropic.com/claude/sonnet, 2025, accessed: 2026-02- 28

  39. [39]

    Introducing claude 4,

    ——, “Introducing claude 4,” https://www.anthropic.com/news/claude-4, 2025, accessed: 2026- 02-28

  40. [40]

    Openai api model documentation,

    OpenAI, “Openai api model documentation,” https://platform.openai.com/docs/models, 2026, gPT-5.2, Accessed: 2026-02-28

  41. [41]

    Openai api model documentation,

    ——, “Openai api model documentation,” https://platform.openai.com/docs/models, 2025, gPT-4.1, Accessed: 2026-02-28

  42. [42]

    Gemini 2.5 flash,

    G. DeepMind, “Gemini 2.5 flash,” https://deepmind.google/technologies/gemini/flash/, 2025, accessed: 2026-02-28

  43. [43]

    Gpt-oss-120b,

    OpenAI, “Gpt-oss-120b,” https://platform.openai.com/docs/models, 2025, accessed: 2026-02- 28

  44. [44]

    Deepseek-v3.2,

    DeepSeek-AI, “Deepseek-v3.2,” https://api-docs.deepseek.com/, 2026, accessed: 2026-02-28

  45. [45]

    Qwen3.6-plus,

    Q. Team, “Qwen3.6-plus,” https://qwenlm.github.io/blog/qwen3/, 2026, accessed: 2026-02-28

  46. [46]

    When can llms actually correct their own mistakes,

    R. Kamoi, Y . Zhang, N. Zhang, J. Han, and R. Zhang, “When can llms actually correct their own mistakes,”A critical survey of self-correction of LLMs, 2024

  47. [47]

    Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,

    L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y . Wang, “Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,”arXiv preprint arXiv:2308.03188, 2023

  48. [48]

    Sok: History is a vast early warning system: Auditing the provenance of system intrusions,

    M. A. Inam, Y . Chen, A. Goyal, J. Liu, J. Mink, N. Michael, S. Gaur, A. Bates, and W. U. Hassan, “Sok: History is a vast early warning system: Auditing the provenance of system intrusions,” inIEEE Symposium on Security and Privacy, 2023

  49. [49]

    Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    Y . Wuet al., “Inference scaling laws: An empirical analysis of compute-optimal inference in large language models,”arXiv preprint arXiv:2408.00724, 2024

  50. [50]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wanget al., “Self-consistency improves chain of thought reasoning in language models,” in ICLR, 2023. 12 A Impact Statement This paper presents HIDBENCH, a benchmark for evaluating large language models in host-based intrusion detection. Our work is intended to advance the scientific understanding of LLM capabilities in cybersecurity, with the goal of ...

  51. [51]

    [Tactic name]: description of the attack step

  52. [52]

    – IoCs: * IPs: [Suspicious IPs] * Processes: [Suspicious process names] * Files: [Suspicious file modifications or deletions] C.8 Example LLM Investigation Output

    [Tactic name]: description of the attack step ... – IoCs: * IPs: [Suspicious IPs] * Processes: [Suspicious process names] * Files: [Suspicious file modifications or deletions] C.8 Example LLM Investigation Output. Example LLM Investigation Output === RESPONSE === Analysis I’ll systematically analyze all processes in the provenance graph, comparing their b...