HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection
Pith reviewed 2026-05-22 08:48 UTC · model grok-4.3
The pith
LLMs achieve high precision on simple host logs for intrusion detection but degrade sharply as logs grow noisier and more complex.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that frontier LLMs exhibit substantial performance gaps across the unified datasets. While many models achieve high precision often above 0.8 on simpler datasets, performance degrades significantly as system logs become noisier and more complex, with MCC frequently dropping below 0.5 and false positive rates increasing sharply. Models fall into distinct regimes such as conservative detectors with low false positive rates and over-sensitive models that generate excessive alerts. The results indicate that LLMs hold strong potential for host-based intrusion detection yet their effectiveness remains highly sensitive to data complexity, making robust system design essential for可靠
What carries the argument
The HIDBench benchmark and its data construction pipeline, which unifies DARPA-E3, DARPA-E5, and NodLink datasets and converts raw host telemetry into LLM-compatible inputs for systematic evaluation under realistic settings.
If this is right
- LLMs can support host-based intrusion detection but need robust surrounding systems to handle varying data complexity.
- Model behavior splits into conservative low-alert regimes and over-sensitive high-alert regimes depending on the input logs.
- High precision on clean datasets does not predict success when logs contain more overlapping benign and malicious activity.
- Deployment decisions for LLMs in intrusion detection must explicitly account for expected noise and imbalance levels.
Where Pith is reading between the lines
- Hybrid detectors that route simple logs to LLMs and complex logs to traditional rule-based methods could mitigate the observed drops.
- The benchmark setup could be reused to compare fine-tuned versus zero-shot LLMs on the same log collections.
- Similar evaluation pipelines might reveal whether the same complexity sensitivity appears in other log-driven security tasks such as anomaly detection.
Load-bearing premise
The data construction pipeline that transforms raw host telemetry into LLM-compatible inputs preserves the complex interactions between benign and malicious activities and does not introduce artifacts that artificially change detection difficulty.
What would settle it
If the same models were tested on a deliberately more complex and noisier variant of the same log collections and still maintained MCC above 0.5 with stable false-positive rates, the claimed sensitivity to data complexity would be directly challenged.
Figures
read the original abstract
Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs' capabilities in supporting host-based intrusion detection systems (HIDS). This task requires fine-grained reasoning over large-scale, noisy, and highly imbalanced system logs, where complex interactions between benign and malicious activities make reliable detection challenging. Our benchmark unifies three public system log datasets, DARPA-E3, DARPA-E5, and NodLink, and introduces a data construction pipeline that transforms raw host telemetry into LLM-compatible inputs, enabling systematic evaluation under realistic intrusion detection settings. Our evaluation of frontier LLMs reveals substantial performance gaps across datasets. While many models achieve high precision (often above 0.8) on simpler datasets, their performance degrades significantly as system logs become noisier and more complex, with MCC frequently dropping below 0.5 and false positive rates increasing sharply. We further analyze model behavior and identify distinct regimes, including conservative detectors with low false positive rates and over-sensitive models that generate excessive alerts. Overall, our results highlight that while LLMs show strong potential for HIDS, their effectiveness is highly sensitive to data complexity, and robust system design is essential for reliable deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HIDBench, a new benchmark for assessing large language models (LLMs) on host-based intrusion detection (HIDS) from system logs. It unifies three public datasets (DARPA-E3, DARPA-E5, and NodLink), presents a data construction pipeline to convert raw host telemetry into LLM-compatible inputs, and evaluates frontier LLMs. The central empirical finding is that many models achieve high precision (often >0.8) on simpler datasets but degrade sharply on noisier and more complex logs, with MCC frequently dropping below 0.5 and false positive rates rising; the work also identifies distinct behavioral regimes such as conservative low-FPR detectors versus over-sensitive models.
Significance. If the results hold after addressing pipeline validation, the benchmark fills an important gap in LLM evaluation for cybersecurity by focusing on fine-grained reasoning over noisy, imbalanced logs. The unification of multiple datasets and the identification of performance sensitivity to complexity provide actionable insights for deploying LLMs in HIDS, highlighting the need for robust system design rather than direct model use.
major comments (1)
- [Abstract; data construction pipeline] Abstract and data construction pipeline description: The central claim that performance degrades due to increasing log complexity (MCC <0.5, rising FPR) depends on the pipeline faithfully preserving benign-malicious interactions without introducing artifacts via truncation, event selection, windowing, or formatting to fit context limits. No explicit validation (e.g., raw-input baselines or comparisons to traditional detectors) is described to rule out construction choices as the source of cross-dataset differences; this assumption is load-bearing for attributing results to inherent data properties rather than pipeline design.
minor comments (1)
- [Abstract] The abstract mentions 'distinct regimes' of model behavior but does not specify the exact metrics or thresholds used to classify conservative vs. over-sensitive detectors; adding a brief definition or reference to the relevant results table would improve clarity.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We have carefully considered the referee's major comment regarding the data construction pipeline and provide our response below, along with planned revisions to address the concerns.
read point-by-point responses
-
Referee: [Abstract; data construction pipeline] Abstract and data construction pipeline description: The central claim that performance degrades due to increasing log complexity (MCC <0.5, rising FPR) depends on the pipeline faithfully preserving benign-malicious interactions without introducing artifacts via truncation, event selection, windowing, or formatting to fit context limits. No explicit validation (e.g., raw-input baselines or comparisons to traditional detectors) is described to rule out construction choices as the source of cross-dataset differences; this assumption is load-bearing for attributing results to inherent data properties rather than pipeline design.
Authors: We agree that validating the pipeline is crucial to ensure that the observed performance differences across datasets can be attributed to variations in log complexity rather than artifacts introduced during data construction. Our pipeline applies consistent processing steps to all datasets to enable fair comparison, and the datasets themselves are established in the literature with known differences in noise and complexity levels. To address this point directly, we will revise the manuscript to include additional validation experiments. Specifically, we will report results from traditional HIDS approaches, such as signature-based or anomaly detection methods, applied to the same processed inputs. This will help demonstrate that the degradation in LLM performance on more complex datasets aligns with the inherent challenges of those datasets. We will also expand the description of the pipeline to detail how truncation and windowing were chosen to minimize information loss while respecting context constraints. revision: yes
Circularity Check
No circularity: direct empirical benchmark on public datasets
full rationale
The paper presents an empirical evaluation of LLMs on host-based intrusion detection using three public datasets (DARPA-E3, DARPA-E5, NodLink) transformed via a described pipeline. No mathematical derivations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes appear in the abstract or described structure. Performance metrics (precision, MCC, FPR) are computed directly from model outputs on the constructed inputs, with no reduction of results to inputs by construction. The central claim of performance degradation on noisier data rests on observable differences across datasets rather than any self-referential loop. This is a standard benchmark study whose results are falsifiable against the same public data and independent of any internal definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three public datasets (DARPA-E3, DARPA-E5, NodLink) together capture realistic levels of noise, imbalance, and benign-malicious interactions for host-based intrusion detection.
Reference graph
Works this paper leans on
-
[1]
GLUE: A multi-task benchmark and analysis platform for natural language understanding,
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, T. Linzen, G. Chrupała, and A. Alishahi, Eds. Brussels, Belgium: Association for Computational Li...
work page 2018
-
[2]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,
Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jianget al., “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,” Advances in Neural Information Processing Systems, vol. 37, pp. 95 266–95 290, 2024
work page 2024
-
[3]
M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontanaet al., “Purple llama cyberseceval: A secure coding benchmark for language models,”arXiv preprint arXiv:2312.04724, 2023
-
[4]
Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,
Z. Liu, J. Shi, and J. F. Buford, “Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity,” inAAAI 2024 Workshop on Artificial Intelligence for Cyber Security, 2024
work page 2024
-
[5]
Cybench: A framework for evaluating cybersecurity capabilities and risks of language models,
A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askaryar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang, “Cybench: A framework for evaluating...
-
[6]
Cybergym: Evaluating ai agents’ cybersecurity capabilities with real-world vulnerabilities at scale,
Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song, “Cybergym: Evaluating ai agents’ cybersecurity capabilities with real-world vulnerabilities at scale,”arXiv e-prints, pp. arXiv– 2506, 2025
work page 2025
-
[7]
AUTOATTACKER: A large language model guided system to implement automatic cyber-attacks,
J. Xu, J. W. Stokes, G. McDonald, X. Bai, D. Marshall, S. Wang, A. Swaminathan, and Z. Li, “Autoattacker: A large language model guided system to implement automatic cyber-attacks,” arXiv preprint arXiv:2403.01038, 2024
-
[8]
{PentestGPT}: Evaluating and harnessing large language models for automated penetration testing,
G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “ {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing,” inUSENIX Security Symposium, 2024
work page 2024
-
[9]
Large language model guided protocol fuzzing,
R. Meng, M. Mirchev, M. Böhme, and A. Roychoudhury, “Large language model guided protocol fuzzing,” inAnnual Network and Distributed System Security Symposium, 2024
work page 2024
-
[10]
Exploring {ChatGPT’s} capabilities on vulnerability management,
P. Liu, J. Liu, L. Fu, K. Lu, Y . Xia, X. Zhang, W. Chen, H. Weng, S. Ji, and W. Wang, “Exploring {ChatGPT’s} capabilities on vulnerability management,” inUSENIX Security Symposium, 2024
work page 2024
-
[11]
Constructing knowledge graph from cyber threat intelligence using large language model,
J. Liu and J. Zhan, “Constructing knowledge graph from cyber threat intelligence using large language model,” in2023 IEEE International Conference on Big Data, 2023
work page 2023
-
[12]
Ctikg: Llm-powered knowledge graph construction from cyber threat intelligence,
L. Huang and X. Xiao, “Ctikg: Llm-powered knowledge graph construction from cyber threat intelligence,” inFirst Conference on Language Modeling, 2024
work page 2024
-
[13]
O. Sorokoletova, E. Antonioni, and G. Colò, “Towards a scalable ai-driven framework for data-independent cyber threat intelligence information extraction,” inInternational Conference on Foundation and Large Language Models, 2024
work page 2024
-
[14]
Raconteur: A knowl- edgeable, insightful, and portable llm-powered shell command explainer,
J. Deng, X. Li, Y . Chen, Y . Bai, H. Weng, Y . Liu, T. Wei, and W. Xu, “Raconteur: A knowl- edgeable, insightful, and portable llm-powered shell command explainer,”arXiv preprint arXiv:2409.02074, 2024
-
[15]
Host-based intrusion detection system with system calls: Review and future trends,
M. Liu, Z. Xue, X. Xu, C. Zhong, and J. Chen, “Host-based intrusion detection system with system calls: Review and future trends,”ACM computing surveys (CSUR), vol. 51, no. 5, pp. 1–36, 2018. 10
work page 2018
-
[16]
Transparent computing engagement 3,
DARPA I2O, “Transparent computing engagement 3,” https://github.com/darpa-i2o/ Transparent-Computing/blob/master/README-E3.md, accessed: 2023-03-13
work page 2023
-
[17]
Transparent computing engagement 5 data release,
——, “Transparent computing engagement 5 data release,” https://github.com/darpa-i2o/ Transparent-Computing, 2026, accessed: 2026-02-16
work page 2026
-
[18]
Ids-agent: An llm agent for explainable intrusion detection in iot networks,
Y . Li, Z. Xiang, N. D. Bastian, D. Song, and B. Li, “Ids-agent: An llm agent for explainable intrusion detection in iot networks,” 2024
work page 2024
-
[19]
Omnisec: Llm-driven provenance-based intrusion detection via retrieval-augmented behavior prompting,
W. Cheng, T. Zhu, S. Jing, J.-P. Mei, M. Ma, J. Jin, and Z. Weng, “Omnisec: Llm-driven provenance-based intrusion detection via retrieval-augmented behavior prompting,”arXiv preprint arXiv:2503.03108, 2025
-
[20]
Hybrid llm-enhanced intrusion detec- tion for zero-day threats in iot networks,
M. F. Al-Hammouri, Y . Otoum, R. Atwa, and A. Nayak, “Hybrid llm-enhanced intrusion detec- tion for zero-day threats in iot networks,” in2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2025, pp. 864–869
work page 2025
-
[21]
Pidsmaker: Building and evaluating provenance-based intrusion detection systems,
T. Bilot, B. Jiang, and T. Pasquier, “Pidsmaker: Building and evaluating provenance-based intrusion detection systems,”arXiv preprint arXiv:2601.22983, 2026
-
[22]
Sok: History is a vast early warning system: Auditing the provenance of system intrusions,
M. A. Inam, Y . Chen, A. Goyal, J. Liu, J. Mink, N. Michael, S. Gaur, A. Bates, and W. U. Hassan, “Sok: History is a vast early warning system: Auditing the provenance of system intrusions,” in 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, pp. 2620–2638
work page 2023
-
[23]
Flash: A comprehensive approach to intrusion detection via provenance graph representation learning,
M. Ur Rehman, H. Ahmadi, and W. Ul Hassan, “Flash: A comprehensive approach to intrusion detection via provenance graph representation learning,” inIEEE Symposium on Security and Privacy, 2024
work page 2024
-
[24]
ORTHRUS: Achieving High Quality of Attribution in Provenance-based Intrusion Detection Systems,
B. Jiang, T. Bilot, N. El Madhoun, K. Al Agha, A. Zouaoui, S. Iqbal, X. Han, and T. Pasquier, “ORTHRUS: Achieving High Quality of Attribution in Provenance-based Intrusion Detection Systems,” inUSENIX Security Symposium, 2025
work page 2025
-
[25]
Lost in the middle: How language models use long contexts,
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”Transactions of the Association for Computational Linguistics, 2024
work page 2024
-
[26]
Kairos: Practical intrusion detection and investigation using whole-system provenance,
Z. Cheng, Q. Lv, J. Liang, Y . Wang, D. Sun, T. Pasquier, and X. Han, “Kairos: Practical intrusion detection and investigation using whole-system provenance,” 2023. [Online]. Available: https://arxiv.org/abs/2308.05034
-
[27]
Threat detection and investigation with system-level provenance graph: A survey,
Z. Liet al., “Threat detection and investigation with system-level provenance graph: A survey,” Computers & Security, 2021
work page 2021
-
[28]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
C. Snellet al., “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Self-Refine: Iterative Refinement with Self-Feedback
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iterative refinement with self-feedback, 2023,”URL https://arxiv. org/abs/2303.17651, vol. 2303, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Reflexion: Language Agents with Verbal Reinforcement Learning
N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Lan- guage agents with verbal reinforcement learning, 2023,”URL https://arxiv. org/abs/2303.11366, vol. 8, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Nodlink: An online system for fine-grained apt attack detection and investigation,
S. Li, F. Dong, X. Xiao, H. Wang, F. Shao, J. Chen, Y . Guo, X. Chen, and D. Li, “Nodlink: An online system for fine-grained apt attack detection and investigation,” inNetwork and Distributed System Security Symposium, 2024
work page 2024
-
[32]
Orthrus, “Darpa e3 cadets ground-truth,” https://github.com/ubc-provenance/ground-truth/ blob/012f321f46137650496e639b0ad7e0a66db07a73/darpa/E3-CADETS/node_Nginx_ Backdoor_12.csv, 2025. 11
work page 2025
-
[33]
T. Bilot, B. Jiang, Z. Li, N. E. Madhoun, K. A. Agha, A. Zouaoui, and T. Pasquier, “Sometimes simpler is better: A comprehensive analysis of State-of-the-Art Provenance-Based intrusion detection systems,” in34th USENIX Security Symposium (USENIX Security 25). Seattle, WA: USENIX Association, Aug. 2025, pp. 7193–7212. [Online]. Available: https://www.useni...
work page 2025
-
[34]
{MAGIC}: Detecting advanced persistent threats via masked graph representation learning,
Z. Jia, Y . Xiong, Y . Nan, Y . Zhang, J. Zhao, and M. Wen, “{MAGIC}: Detecting advanced persistent threats via masked graph representation learning,” inUSENIX Security Symposium, 2024
work page 2024
-
[35]
S. Wang, Z. Wang, T. Zhou, H. Sun, X. Yin, D. Han, H. Zhang, X. Shi, and J. Yang, “Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning,” IEEE Transactions on Information Forensics and Security, 2022
work page 2022
-
[36]
PKU-ASAL, “Nodlink repo,” https://github.com/PKU-ASAL/Simulated-Data, 2023
work page 2023
-
[37]
Introducing claude sonnet 4.6,
Anthropic, “Introducing claude sonnet 4.6,” https://www.anthropic.com/news/ claude-sonnet-4-6, 2026, accessed: 2026-02-28
work page 2026
-
[38]
——, “Claude sonnet 4.5,” https://www.anthropic.com/claude/sonnet, 2025, accessed: 2026-02- 28
work page 2025
-
[39]
——, “Introducing claude 4,” https://www.anthropic.com/news/claude-4, 2025, accessed: 2026- 02-28
work page 2025
-
[40]
Openai api model documentation,
OpenAI, “Openai api model documentation,” https://platform.openai.com/docs/models, 2026, gPT-5.2, Accessed: 2026-02-28
work page 2026
-
[41]
Openai api model documentation,
——, “Openai api model documentation,” https://platform.openai.com/docs/models, 2025, gPT-4.1, Accessed: 2026-02-28
work page 2025
-
[42]
G. DeepMind, “Gemini 2.5 flash,” https://deepmind.google/technologies/gemini/flash/, 2025, accessed: 2026-02-28
work page 2025
-
[43]
OpenAI, “Gpt-oss-120b,” https://platform.openai.com/docs/models, 2025, accessed: 2026-02- 28
work page 2025
-
[44]
DeepSeek-AI, “Deepseek-v3.2,” https://api-docs.deepseek.com/, 2026, accessed: 2026-02-28
work page 2026
-
[45]
Q. Team, “Qwen3.6-plus,” https://qwenlm.github.io/blog/qwen3/, 2026, accessed: 2026-02-28
work page 2026
-
[46]
When can llms actually correct their own mistakes,
R. Kamoi, Y . Zhang, N. Zhang, J. Han, and R. Zhang, “When can llms actually correct their own mistakes,”A critical survey of self-correction of LLMs, 2024
work page 2024
-
[47]
L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y . Wang, “Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,”arXiv preprint arXiv:2308.03188, 2023
-
[48]
Sok: History is a vast early warning system: Auditing the provenance of system intrusions,
M. A. Inam, Y . Chen, A. Goyal, J. Liu, J. Mink, N. Michael, S. Gaur, A. Bates, and W. U. Hassan, “Sok: History is a vast early warning system: Auditing the provenance of system intrusions,” inIEEE Symposium on Security and Privacy, 2023
work page 2023
-
[49]
Y . Wuet al., “Inference scaling laws: An empirical analysis of compute-optimal inference in large language models,”arXiv preprint arXiv:2408.00724, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Self-consistency improves chain of thought reasoning in language models,
X. Wanget al., “Self-consistency improves chain of thought reasoning in language models,” in ICLR, 2023. 12 A Impact Statement This paper presents HIDBENCH, a benchmark for evaluating large language models in host-based intrusion detection. Our work is intended to advance the scientific understanding of LLM capabilities in cybersecurity, with the goal of ...
work page 2023
-
[51]
[Tactic name]: description of the attack step
-
[52]
[Tactic name]: description of the attack step ... – IoCs: * IPs: [Suspicious IPs] * Processes: [Suspicious process names] * Files: [Suspicious file modifications or deletions] C.8 Example LLM Investigation Output. Example LLM Investigation Output === RESPONSE === Analysis I’ll systematically analyze all processes in the provenance graph, comparing their b...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.