LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics
Pith reviewed 2026-05-10 15:36 UTC · model grok-4.3
The pith
Prompt-based large language models can detect anomalies in system logs with competitive accuracy without any labeled training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through evaluation on the HDFS, BGL, Thunderbird, and Spirit log datasets, the work shows that prompt-based LLM approaches in zero-shot and few-shot settings achieve F1 scores from 0.82 to 0.91, close to the 0.96-0.99 of fine-tuned transformers like BERT and RoBERTa, yet without requiring any labeled training data for the anomaly detection task.
What carries the argument
Prompt-based application of LLMs such as GPT-3.5, GPT-4, and LLaMA-3 to classify parsed log sequences as anomalous or normal.
If this is right
- Zero-shot LLM methods become practical for environments where collecting labeled anomaly data is difficult or expensive.
- Fine-tuned models remain preferable when high accuracy is critical and sufficient labels exist.
- Practitioners gain guidelines to select methods based on accuracy needs, latency limits, cost budgets, and label availability.
- Understanding failure modes allows better mitigation strategies for each detection approach.
Where Pith is reading between the lines
- This suggests LLMs may handle evolving log formats better since no retraining is needed.
- Testing on live production logs with real-time constraints could reveal additional performance aspects.
- Hybrid systems that combine LLM prompts with traditional parsers might yield even stronger results.
- Broader adoption could reduce reliance on domain-specific feature engineering in log analysis.
Load-bearing premise
That the performance observed on the four chosen public datasets with the specific prompts used will generalize to other log data sources and prompt variations in actual deployments.
What would settle it
Evaluating the zero-shot LLM prompts on a different set of system logs not included in the benchmark and finding substantially lower F1 scores would indicate the results may not hold more broadly.
read the original abstract
System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our experiments reveal that while fine-tuned transformers achieve the highest F1-scores (0.96-0.99), prompt-based LLMs demonstrate remarkablezero-shot capabilities (F1: 0.82-0.91) without requiring any labeled training data -- a significant advantage for real-world deployment where labeled anomalies are scarce. We further analyze the cost-accuracy trade-offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a benchmark study of log anomaly detection methods on four public datasets (HDFS, BGL, Thunderbird, Spirit). It evaluates three categories: classical log parsers (Drain, Spell, AEL) paired with ML classifiers, fine-tuned transformers (BERT, RoBERTa), and prompt-based LLMs (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. The central claims are that fine-tuned transformers achieve the highest F1 scores (0.96-0.99), while prompt-based LLMs deliver strong zero-shot performance (F1 0.82-0.91) without any labeled training data, offering practical advantages where labels are scarce; the work also analyzes cost-accuracy trade-offs, latency, and failure modes, supplies practitioner guidelines, and releases code for reproducibility.
Significance. If the experimental claims hold under scrutiny, the paper makes a useful contribution by filling the gap in systematic comparisons of LLM-based log anomaly detection against established baselines. The emphasis on zero-shot LLM performance in label-scarce regimes is practically relevant for real-world system diagnostics, and the cost/latency analysis plus public code release strengthen its utility for practitioners. The benchmark framing could inform method selection under different constraints, though its impact depends on the robustness and transparency of the prompt and evaluation protocols.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and associated prompt templates: The headline claim that prompt-based LLMs achieve F1 0.82-0.91 in true zero-shot settings without labeled data is load-bearing for the paper's practical advantage argument, yet the manuscript provides no explicit prompt templates, selection procedure, or confirmation that no iterative refinement occurred on the evaluation splits of HDFS/BGL/Thunderbird/Spirit. Without this disclosure, the zero-shot characterization cannot be verified and the reported advantage over label-dependent methods is not yet defensible.
- [§4.1 (Datasets and splits)] §4.1 (Datasets and splits): No information is given on train/test splits, whether any validation set was used for prompt or hyperparameter choices, or statistical significance testing of F1 differences across methods. These details are required to establish that the LLM zero-shot results are not artifacts of particular data partitions or post-hoc selection, directly affecting the reliability of the cross-method comparison.
minor comments (2)
- [Abstract] Abstract: Typo in 'remarkablezero-shot' (missing space).
- [§5 (Analysis)] The description of failure-mode analysis is mentioned but lacks concrete examples or quantitative breakdown in the provided text; adding a dedicated subsection or table would improve clarity.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback, which highlights important aspects for ensuring the transparency and verifiability of our benchmark results. We have carefully considered each comment and provide point-by-point responses below. We will incorporate revisions to address the concerns regarding experimental details and reproducibility.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and associated prompt templates: The headline claim that prompt-based LLMs achieve F1 0.82-0.91 in true zero-shot settings without labeled data is load-bearing for the paper's practical advantage argument, yet the manuscript provides no explicit prompt templates, selection procedure, or confirmation that no iterative refinement occurred on the evaluation splits of HDFS/BGL/Thunderbird/Spirit. Without this disclosure, the zero-shot characterization cannot be verified and the reported advantage over label-dependent methods is not yet defensible.
Authors: We agree that explicit disclosure of the prompt templates and selection procedure is necessary to substantiate the zero-shot performance claims. In the submitted manuscript, we provided a high-level overview of the prompting approach in §4 and made the full code, including exact prompt templates, publicly available as noted in the abstract. No iterative refinement on the evaluation splits was conducted; the prompts were developed based on general principles of log parsing and anomaly indicators without any access to or tuning on the test data. To enhance verifiability, we will include the complete prompt templates for all LLMs in an appendix of the revised version, along with a step-by-step description of how they were constructed and a clear statement confirming the absence of any post-hoc adjustments on the evaluation sets. This revision will directly address the concern and allow independent verification of the zero-shot setting. revision: yes
-
Referee: [§4.1 (Datasets and splits)] §4.1 (Datasets and splits): No information is given on train/test splits, whether any validation set was used for prompt or hyperparameter choices, or statistical significance testing of F1 differences across methods. These details are required to establish that the LLM zero-shot results are not artifacts of particular data partitions or post-hoc selection, directly affecting the reliability of the cross-method comparison.
Authors: We acknowledge that the manuscript could have been more explicit about the data splits and related procedures. The experiments adhere to the standard splits established in the original dataset papers and widely used in prior log anomaly detection studies (e.g., specific file-based divisions for HDFS and chronological splits for BGL, Thunderbird, and Spirit). No validation set was employed for prompt selection or hyperparameter tuning in the LLM experiments, consistent with the zero-shot paradigm. Hyperparameters for classical parsers and fine-tuned transformers were taken from the values reported in their source publications. In the revised manuscript, we will add a detailed subsection in §4.1 explicitly describing the train/test splits, preprocessing, and confirmation that no validation-based selection occurred for the LLM methods. Additionally, we will incorporate statistical significance testing, such as reporting p-values from appropriate tests (e.g., McNemar's test for paired comparisons) or confidence intervals based on multiple runs where feasible, to quantify the reliability of the F1 differences. These additions will improve the robustness of the comparisons. revision: yes
Circularity Check
No circularity: purely empirical benchmark on public datasets
full rationale
This is an empirical benchmark study that directly measures F1 scores of LLM prompt-based methods (zero-shot/few-shot) and baselines on four fixed public datasets (HDFS, BGL, Thunderbird, Spirit) using standard metrics. No mathematical derivations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text. The central claim of zero-shot capability is an observed performance number on held-out test splits, not a quantity derived from or equivalent to any input by construction. All code is stated to be public, making the results externally reproducible without reliance on self-citation chains or internal fits.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption F1-score is a suitable primary metric for evaluating log anomaly detection on imbalanced data
Reference graph
Works this paper leans on
-
[1]
Loghub: A large collection of system log datasets towards automated log analytics,
S. He, J. Zhu, P. He, and M. R. Lyu, “Loghub: A large collection of system log datasets towards automated log analytics,”arXiv preprint arXiv:2008.06448, 2020
-
[2]
Robust log-based anomaly detection on unstable log data,
X. Zhang, Y . Xu, Q. Lin, B. Qiao, H. Zhang, Y . Dang, C. Xie, X. Yang, Q. Cheng, Z. Li, J. Chen, X. He, R. Yao, J. Lou, M. Chintalapati, F. Shen, and D. Zhang, “Robust log-based anomaly detection on unstable log data,” inProc. ESEC/FSE, 2019, pp. 807–817
work page 2019
-
[3]
Tools and benchmarks for automated log parsing,
J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and benchmarks for automated log parsing,” inProc. ICSE, 2019, pp. 121–130
work page 2019
-
[4]
Log-based anomaly detection with deep learning: How far are we?
V . H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” inProc. ICSE, 2022, pp. 1356–1367
work page 2022
-
[5]
Log parsing with prompt-based few-shot learning,
V . H. Le and H. Zhang, “Log parsing with prompt-based few-shot learning,” inProc. ICSE, 2023, pp. 1237–1249
work page 2023
-
[6]
DivLog: Log parsing with prompt enhanced in-context learning,
J. Xu, Y . Gong, Y . Chen, and M. Lyu, “DivLog: Log parsing with prompt enhanced in-context learning,” inProc. ICSE, 2024
work page 2024
-
[7]
Interpretable online log analysis using large language models with prompt strategies,
Y . Liu, X. Zhang, and S. He, “LogPrompt: Prompt engineering towards zero-shot and interpretable log analysis,”arXiv preprint arXiv:2308.07610, 2023
-
[8]
Automatic root cause analysis via large language models for cloud incidents,
J. Chen, D. Chen, and Z. Li, “Automatic root cause analysis via large language models for cloud incidents,” inProc. EuroSys, 2024
work page 2024
-
[9]
UltraLog: Unsupervised log anomaly detection with LLMs,
Y . Liu, S. He, and M. R. Lyu, “UltraLog: Unsupervised log anomaly detection with LLMs,”arXiv preprint, 2024
work page 2024
-
[10]
Drain: An online log parsing approach with fixed depth tree,
P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProc. ICWS, 2017, pp. 33–40
work page 2017
-
[11]
Spell: Streaming parsing of system event logs,
M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in Proc. ICDM, 2016, pp. 859–864
work page 2016
-
[12]
Abstracting execution logs to execution events for enterprise applications,
Z. M. Jiang, A. E. Hassan, P. Flora, and G. Hamann, “Abstracting execution logs to execution events for enterprise applications,” inProc. QSIC, 2008, pp. 181–186
work page 2008
-
[13]
Detecting large-scale system problems by mining console logs,
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale system problems by mining console logs,” inProc. SOSP, 2009, pp. 117–132
work page 2009
-
[14]
F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,” inProc. ICDM, 2008, pp. 413–422
work page 2008
-
[15]
DeepLog: Anomaly detection and diagnosis from system logs through deep learning,
M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” inProc. CCS, 2017, pp. 1285–1298
work page 2017
-
[16]
LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs,
W. Meng, Y . Liu, Y . Zhu, S. Zhang, D. Pei, Y . Liu, Y . Chen, R. Zhang, S. Tao, P. Sun, and R. Zhou, “LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs,” inProc. IJCAI, 2019, pp. 4739–4745
work page 2019
-
[17]
LogBERT: Log anomaly detection via BERT,
H. Guo, S. Yuan, and X. Wu, “LogBERT: Log anomaly detection via BERT,” inProc. IJCNN, 2021, pp. 1–8
work page 2021
-
[18]
Log-based anomaly detection without log parsing,
V . H. Le and H. Zhang, “Log-based anomaly detection without log parsing,” inProc. ASE, 2021, pp. 492–504
work page 2021
-
[19]
BERT: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. NAACL, 2019, pp. 4171–4186
work page 2019
-
[20]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[21]
DeBERTa: Decoding-enhanced BERT with disentangled attention,
P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with disentangled attention,” inProc. ICLR, 2021
work page 2021
-
[22]
Loghub: A large collection of system log datasets for AI-driven log analytics,
J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for AI-driven log analytics,” inProc. ISSRE, 2023
work page 2023
-
[23]
What supercomputers say: A study of five system logs,
A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” inProc. DSN, 2007, pp. 575–584
work page 2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.