pith. sign in

arxiv: 2604.12218 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.SE

LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics

Pith reviewed 2026-05-10 15:36 UTC · model grok-4.3

classification 💻 cs.LG cs.SE
keywords log anomaly detectionlarge language modelszero-shotbenchmarksystem logstransformersfew-shot learning
0
0 comments X

The pith

Prompt-based large language models can detect anomalies in system logs with competitive accuracy without any labeled training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a benchmark to compare methods for finding unusual patterns in software system logs. It tests classical parsing plus machine learning, fine-tuned language models, and prompt-based large language models on four standard datasets. The key result is that prompt-based methods using GPT and LLaMA models reach F1 scores of 0.82 to 0.91 in zero-shot mode, meaning they work without training on labeled examples. This is valuable because many real systems lack such labeled anomaly data, making deployment easier. The study also examines trade-offs in accuracy, speed, and cost to guide practical choices.

Core claim

Through evaluation on the HDFS, BGL, Thunderbird, and Spirit log datasets, the work shows that prompt-based LLM approaches in zero-shot and few-shot settings achieve F1 scores from 0.82 to 0.91, close to the 0.96-0.99 of fine-tuned transformers like BERT and RoBERTa, yet without requiring any labeled training data for the anomaly detection task.

What carries the argument

Prompt-based application of LLMs such as GPT-3.5, GPT-4, and LLaMA-3 to classify parsed log sequences as anomalous or normal.

If this is right

  • Zero-shot LLM methods become practical for environments where collecting labeled anomaly data is difficult or expensive.
  • Fine-tuned models remain preferable when high accuracy is critical and sufficient labels exist.
  • Practitioners gain guidelines to select methods based on accuracy needs, latency limits, cost budgets, and label availability.
  • Understanding failure modes allows better mitigation strategies for each detection approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests LLMs may handle evolving log formats better since no retraining is needed.
  • Testing on live production logs with real-time constraints could reveal additional performance aspects.
  • Hybrid systems that combine LLM prompts with traditional parsers might yield even stronger results.
  • Broader adoption could reduce reliance on domain-specific feature engineering in log analysis.

Load-bearing premise

That the performance observed on the four chosen public datasets with the specific prompts used will generalize to other log data sources and prompt variations in actual deployments.

What would settle it

Evaluating the zero-shot LLM prompts on a different set of system logs not included in the benchmark and finding substantially lower F1 scores would indicate the results may not hold more broadly.

read the original abstract

System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our experiments reveal that while fine-tuned transformers achieve the highest F1-scores (0.96-0.99), prompt-based LLMs demonstrate remarkablezero-shot capabilities (F1: 0.82-0.91) without requiring any labeled training data -- a significant advantage for real-world deployment where labeled anomalies are scarce. We further analyze the cost-accuracy trade-offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a benchmark study of log anomaly detection methods on four public datasets (HDFS, BGL, Thunderbird, Spirit). It evaluates three categories: classical log parsers (Drain, Spell, AEL) paired with ML classifiers, fine-tuned transformers (BERT, RoBERTa), and prompt-based LLMs (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. The central claims are that fine-tuned transformers achieve the highest F1 scores (0.96-0.99), while prompt-based LLMs deliver strong zero-shot performance (F1 0.82-0.91) without any labeled training data, offering practical advantages where labels are scarce; the work also analyzes cost-accuracy trade-offs, latency, and failure modes, supplies practitioner guidelines, and releases code for reproducibility.

Significance. If the experimental claims hold under scrutiny, the paper makes a useful contribution by filling the gap in systematic comparisons of LLM-based log anomaly detection against established baselines. The emphasis on zero-shot LLM performance in label-scarce regimes is practically relevant for real-world system diagnostics, and the cost/latency analysis plus public code release strengthen its utility for practitioners. The benchmark framing could inform method selection under different constraints, though its impact depends on the robustness and transparency of the prompt and evaluation protocols.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and associated prompt templates: The headline claim that prompt-based LLMs achieve F1 0.82-0.91 in true zero-shot settings without labeled data is load-bearing for the paper's practical advantage argument, yet the manuscript provides no explicit prompt templates, selection procedure, or confirmation that no iterative refinement occurred on the evaluation splits of HDFS/BGL/Thunderbird/Spirit. Without this disclosure, the zero-shot characterization cannot be verified and the reported advantage over label-dependent methods is not yet defensible.
  2. [§4.1 (Datasets and splits)] §4.1 (Datasets and splits): No information is given on train/test splits, whether any validation set was used for prompt or hyperparameter choices, or statistical significance testing of F1 differences across methods. These details are required to establish that the LLM zero-shot results are not artifacts of particular data partitions or post-hoc selection, directly affecting the reliability of the cross-method comparison.
minor comments (2)
  1. [Abstract] Abstract: Typo in 'remarkablezero-shot' (missing space).
  2. [§5 (Analysis)] The description of failure-mode analysis is mentioned but lacks concrete examples or quantitative breakdown in the provided text; adding a dedicated subsection or table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback, which highlights important aspects for ensuring the transparency and verifiability of our benchmark results. We have carefully considered each comment and provide point-by-point responses below. We will incorporate revisions to address the concerns regarding experimental details and reproducibility.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments) and associated prompt templates: The headline claim that prompt-based LLMs achieve F1 0.82-0.91 in true zero-shot settings without labeled data is load-bearing for the paper's practical advantage argument, yet the manuscript provides no explicit prompt templates, selection procedure, or confirmation that no iterative refinement occurred on the evaluation splits of HDFS/BGL/Thunderbird/Spirit. Without this disclosure, the zero-shot characterization cannot be verified and the reported advantage over label-dependent methods is not yet defensible.

    Authors: We agree that explicit disclosure of the prompt templates and selection procedure is necessary to substantiate the zero-shot performance claims. In the submitted manuscript, we provided a high-level overview of the prompting approach in §4 and made the full code, including exact prompt templates, publicly available as noted in the abstract. No iterative refinement on the evaluation splits was conducted; the prompts were developed based on general principles of log parsing and anomaly indicators without any access to or tuning on the test data. To enhance verifiability, we will include the complete prompt templates for all LLMs in an appendix of the revised version, along with a step-by-step description of how they were constructed and a clear statement confirming the absence of any post-hoc adjustments on the evaluation sets. This revision will directly address the concern and allow independent verification of the zero-shot setting. revision: yes

  2. Referee: [§4.1 (Datasets and splits)] §4.1 (Datasets and splits): No information is given on train/test splits, whether any validation set was used for prompt or hyperparameter choices, or statistical significance testing of F1 differences across methods. These details are required to establish that the LLM zero-shot results are not artifacts of particular data partitions or post-hoc selection, directly affecting the reliability of the cross-method comparison.

    Authors: We acknowledge that the manuscript could have been more explicit about the data splits and related procedures. The experiments adhere to the standard splits established in the original dataset papers and widely used in prior log anomaly detection studies (e.g., specific file-based divisions for HDFS and chronological splits for BGL, Thunderbird, and Spirit). No validation set was employed for prompt selection or hyperparameter tuning in the LLM experiments, consistent with the zero-shot paradigm. Hyperparameters for classical parsers and fine-tuned transformers were taken from the values reported in their source publications. In the revised manuscript, we will add a detailed subsection in §4.1 explicitly describing the train/test splits, preprocessing, and confirmation that no validation-based selection occurred for the LLM methods. Additionally, we will incorporate statistical significance testing, such as reporting p-values from appropriate tests (e.g., McNemar's test for paired comparisons) or confidence intervals based on multiple runs where feasible, to quantify the reliability of the F1 differences. These additions will improve the robustness of the comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark on public datasets

full rationale

This is an empirical benchmark study that directly measures F1 scores of LLM prompt-based methods (zero-shot/few-shot) and baselines on four fixed public datasets (HDFS, BGL, Thunderbird, Spirit) using standard metrics. No mathematical derivations, fitted parameters, ansatzes, or uniqueness theorems appear in the provided text. The central claim of zero-shot capability is an observed performance number on held-out test splits, not a quantity derived from or equivalent to any input by construction. All code is stated to be public, making the results externally reproducible without reliance on self-citation chains or internal fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmark relying on standard machine-learning evaluation practices and publicly available datasets; no new theoretical constructs or fitted parameters are introduced.

axioms (1)
  • domain assumption F1-score is a suitable primary metric for evaluating log anomaly detection on imbalanced data
    Common practice in the field but assumes equal weighting of precision and recall is appropriate for all operational contexts.

pith-pipeline@v0.9.0 · 5573 in / 1223 out tokens · 59271 ms · 2026-05-10T15:36:52.453536+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    Loghub: A large collection of system log datasets towards automated log analytics,

    S. He, J. Zhu, P. He, and M. R. Lyu, “Loghub: A large collection of system log datasets towards automated log analytics,”arXiv preprint arXiv:2008.06448, 2020

  2. [2]

    Robust log-based anomaly detection on unstable log data,

    X. Zhang, Y . Xu, Q. Lin, B. Qiao, H. Zhang, Y . Dang, C. Xie, X. Yang, Q. Cheng, Z. Li, J. Chen, X. He, R. Yao, J. Lou, M. Chintalapati, F. Shen, and D. Zhang, “Robust log-based anomaly detection on unstable log data,” inProc. ESEC/FSE, 2019, pp. 807–817

  3. [3]

    Tools and benchmarks for automated log parsing,

    J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and benchmarks for automated log parsing,” inProc. ICSE, 2019, pp. 121–130

  4. [4]

    Log-based anomaly detection with deep learning: How far are we?

    V . H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” inProc. ICSE, 2022, pp. 1356–1367

  5. [5]

    Log parsing with prompt-based few-shot learning,

    V . H. Le and H. Zhang, “Log parsing with prompt-based few-shot learning,” inProc. ICSE, 2023, pp. 1237–1249

  6. [6]

    DivLog: Log parsing with prompt enhanced in-context learning,

    J. Xu, Y . Gong, Y . Chen, and M. Lyu, “DivLog: Log parsing with prompt enhanced in-context learning,” inProc. ICSE, 2024

  7. [7]

    Interpretable online log analysis using large language models with prompt strategies,

    Y . Liu, X. Zhang, and S. He, “LogPrompt: Prompt engineering towards zero-shot and interpretable log analysis,”arXiv preprint arXiv:2308.07610, 2023

  8. [8]

    Automatic root cause analysis via large language models for cloud incidents,

    J. Chen, D. Chen, and Z. Li, “Automatic root cause analysis via large language models for cloud incidents,” inProc. EuroSys, 2024

  9. [9]

    UltraLog: Unsupervised log anomaly detection with LLMs,

    Y . Liu, S. He, and M. R. Lyu, “UltraLog: Unsupervised log anomaly detection with LLMs,”arXiv preprint, 2024

  10. [10]

    Drain: An online log parsing approach with fixed depth tree,

    P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProc. ICWS, 2017, pp. 33–40

  11. [11]

    Spell: Streaming parsing of system event logs,

    M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in Proc. ICDM, 2016, pp. 859–864

  12. [12]

    Abstracting execution logs to execution events for enterprise applications,

    Z. M. Jiang, A. E. Hassan, P. Flora, and G. Hamann, “Abstracting execution logs to execution events for enterprise applications,” inProc. QSIC, 2008, pp. 181–186

  13. [13]

    Detecting large-scale system problems by mining console logs,

    W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale system problems by mining console logs,” inProc. SOSP, 2009, pp. 117–132

  14. [14]

    Isolation forest,

    F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,” inProc. ICDM, 2008, pp. 413–422

  15. [15]

    DeepLog: Anomaly detection and diagnosis from system logs through deep learning,

    M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” inProc. CCS, 2017, pp. 1285–1298

  16. [16]

    LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs,

    W. Meng, Y . Liu, Y . Zhu, S. Zhang, D. Pei, Y . Liu, Y . Chen, R. Zhang, S. Tao, P. Sun, and R. Zhou, “LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs,” inProc. IJCAI, 2019, pp. 4739–4745

  17. [17]

    LogBERT: Log anomaly detection via BERT,

    H. Guo, S. Yuan, and X. Wu, “LogBERT: Log anomaly detection via BERT,” inProc. IJCNN, 2021, pp. 1–8

  18. [18]

    Log-based anomaly detection without log parsing,

    V . H. Le and H. Zhang, “Log-based anomaly detection without log parsing,” inProc. ASE, 2021, pp. 492–504

  19. [19]

    BERT: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. NAACL, 2019, pp. 4171–4186

  20. [20]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

  21. [21]

    DeBERTa: Decoding-enhanced BERT with disentangled attention,

    P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: Decoding-enhanced BERT with disentangled attention,” inProc. ICLR, 2021

  22. [22]

    Loghub: A large collection of system log datasets for AI-driven log analytics,

    J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for AI-driven log analytics,” inProc. ISSRE, 2023

  23. [23]

    What supercomputers say: A study of five system logs,

    A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” inProc. DSN, 2007, pp. 575–584