FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

Alberto Leon-Garcia; Hans-Arno Jacobsen; Huanchi Wang; Kristina Dzeparoska; Yifang Tian; Zihang Huang

arxiv: 2605.22779 · v1 · pith:NIJ7IMECnew · submitted 2026-05-21 · 💻 cs.SE · cs.LG

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

Huanchi Wang , Zihang Huang , Yifang Tian , Kristina Dzeparoska , Hans-Arno Jacobsen , Alberto Leon-Garcia This is my paper

Pith reviewed 2026-05-22 03:34 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords log anomaly detectionmixture of expertsmessage-level detectionlabel-efficient learningfailure domainsLLM-assisted partitioningproduction system logs

0 comments

The pith

FAME trains a router and domain experts on at most K labels per log template plus one LLM-proposed failure-domain partition to detect anomalies at the individual message level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FAME as a way to move log anomaly detection from coarse session or window flags down to precise message-level calls. It keeps an LLM out of the runtime loop by using it once offline to suggest and certify a grouping of templates into failure domains, then annotates a small number of examples per template. A lightweight router learns to send each new message to the right expert, which outputs both an anomaly score and a domain label. This setup targets the practical problems of heterogeneous subsystems and templates that sometimes produce normal output and sometimes failures. If the approach holds, operators would receive alerts that point to the exact problematic line instead of having to scan many routine ones.

Core claim

FAME is a label-efficient message-level mixture-of-experts framework that annotates at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples, lets an LLM propose a partition of templates into failure domains that is then certified, and trains a lightweight router plus domain experts that run on-premise to output anomaly predictions and failure-domain labels, reaching F1 of 98.16 on BGL at K=100 for a 76x reduction in annotation effort while detecting 86.3 percent of anomalies from unseen EventIDs and F1 of 99.95 with perfect recall on Thunderbird.

What carries the argument

A router that directs each incoming log message to one of several domain-specific expert models, where the domains come from an LLM-proposed and certified partition of log templates into failure categories, trained from binary labels on at most K examples per template.

If this is right

Message-level predictions would reduce the number of routine log lines an operator must inspect per alert.
The model would continue to flag anomalies even when they appear under previously unseen EventIDs.
Annotation budgets could drop by roughly 76x while still producing F1 scores above 98 on standard benchmarks.
Failure-domain labels would accompany each detection, giving operators immediate context about the subsystem involved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same router-plus-experts structure could be tested on other heterogeneous log sources such as network device logs or application traces without changing the core training procedure.
If the certification step for the LLM partition is replaced by a simple majority vote from a small set of human reviewers, the framework might still retain most of its accuracy gain.
Running the experts in parallel on a multi-core server would allow real-time scoring of high-volume streams while keeping per-message latency low.

Load-bearing premise

Annotating at most K lines per template plus an LLM-proposed and certified partition of templates into failure domains supplies enough signal to train a router and experts that generalize to message-level detection across heterogeneous subsystems and unseen EventIDs.

What would settle it

A new log dataset in which the trained router and experts miss more than 20 percent of anomalies from previously unseen EventIDs even after using the stated K annotations per template would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.22779 by Alberto Leon-Garcia, Hans-Arno Jacobsen, Huanchi Wang, Kristina Dzeparoska, Yifang Tian, Zihang Huang.

**Figure 2.** Figure 2: FAME system architecture. (a) Offline setup: raw logs are parsed by Drain3, K-shot labels are sampled, an LLM proposes a failure-domain partition that is then certified, and two-phase BERT experts are trained alongside a DistilBERT gate and selector. This stage is executed once. (b) Online inference: the trained router directs each incoming log line to the appropriate expert. All inference runs on-premise … view at source ↗

**Figure 3.** Figure 3: K-sensitivity on BGL and Thunderbird with best result at each K. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: LLM grouping sensitivity on BGL and Thunderbird ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Case study with ambiguous keyword ’FATAL’ [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FAME gives a practical label-efficient pipeline for message-level log anomaly detection by routing through LLM-derived failure domains, but the abstract leaves the robustness of that routing unproven.

read the letter

The main thing to know is that this paper builds a mixture-of-experts model that does message-level anomaly detection on logs after one offline LLM step. The LLM both marks a few examples per template as normal or anomalous and proposes a partition of templates into failure domains; a certification step follows, then a router and per-domain experts are trained to run on-premise. On BGL they report F1 of 98.16 at K=100 with a claimed 76x drop in labeling and 86.3% detection on anomalies from unseen EventIDs; Thunderbird numbers are higher still with perfect recall at F1 99.95. That combination of reduced labeling and finer-grained output is the concrete advance over prior session-level work.

Referee Report

2 major / 2 minor

Summary. The paper introduces FAME, a label-efficient mixture-of-experts framework for message-level log anomaly detection. An LLM is used once offline to propose a partition of log templates into failure domains after annotating at most K lines per template for binary labels and examples. A router and per-domain experts are then trained to produce anomaly predictions and failure-domain labels at inference time. On the BGL dataset, FAME reports F1=98.16 at K=100 (76x annotation reduction) and detects 86.3% of anomalies from unseen EventIDs; on Thunderbird it reaches F1=99.95 with perfect recall.

Significance. If the central claims hold, the work offers a practical advance in log anomaly detection by shifting from coarse session/window-level alerts to message-level granularity while keeping labeling costs low and inference on-premise. The offline-LLM-plus-lightweight-MoE design addresses cost and heterogeneity issues that limit prior approaches.

major comments (2)

[Experimental Evaluation] Experimental section: the headline generalization result (86.3% detection of anomalies from unseen EventIDs on BGL) rests on the LLM-proposed failure-domain partition supplying sufficient signal for the router and experts. No ablation is reported that isolates this partition against random grouping or template-ID-based grouping, leaving open whether the reported transfer performance is attributable to the proposed domains or would arise from any reasonable clustering.
[Methodology] Methodology, certification paragraph: the description of the certification step does not specify what properties are checked (internal consistency within domains versus cross-template transfer to held-out EventIDs) or how failures of the partition would be detected and corrected before training.

minor comments (2)

[Abstract] Abstract and results tables: error bars, exact baseline implementations, and the precise procedure for choosing the number of failure domains are not reported, making it difficult to assess robustness of the F1 numbers.
[Introduction] Notation: the distinction between 'template' and 'EventID' should be clarified in the first use, as the unseen-EventID claim depends on this distinction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our experimental claims and methodological details. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental section: the headline generalization result (86.3% detection of anomalies from unseen EventIDs on BGL) rests on the LLM-proposed failure-domain partition supplying sufficient signal for the router and experts. No ablation is reported that isolates this partition against random grouping or template-ID-based grouping, leaving open whether the reported transfer performance is attributable to the proposed domains or would arise from any reasonable clustering.

Authors: We agree that the absence of an ablation isolating the LLM-proposed failure-domain partition is a limitation in the current experimental section. The reported 86.3% detection rate on unseen EventIDs could potentially be influenced by any form of grouping rather than the specific semantic domains. In the revised manuscript we will add an ablation study that compares the LLM-proposed partition against (i) random grouping of templates and (ii) grouping based solely on template IDs. This will quantify the incremental benefit of the failure-domain structure for router and expert performance on held-out EventIDs. revision: yes
Referee: [Methodology] Methodology, certification paragraph: the description of the certification step does not specify what properties are checked (internal consistency within domains versus cross-template transfer to held-out EventIDs) or how failures of the partition would be detected and corrected before training.

Authors: We acknowledge that the certification paragraph is currently underspecified. We will revise the methodology section to explicitly state the properties verified during certification: (a) internal consistency of normal/anomaly labels and representative examples within each proposed domain, and (b) preliminary evidence of cross-template transfer potential to held-out EventIDs via a small validation split. We will also describe the detection and correction process, which consists of an automated consistency check followed by optional human review of domain boundaries; any failing domains trigger re-partitioning by the LLM or manual adjustment before training proceeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on public datasets are independent of fitted parameters

full rationale

The paper describes an empirical ML pipeline: offline LLM proposes and certifies a template partition into failure domains, at most K lines per template are annotated to obtain binary labels and examples, then a router plus per-domain experts are trained on the resulting data and evaluated on held-out messages including unseen EventIDs. No equations or derivations are presented that reduce a reported metric (F1, recall on unseen EventIDs) to a quantity defined by the fitted parameters themselves. The performance numbers are measured on standard public log datasets (BGL, Thunderbird) after standard train/test splits; the LLM partition is an input to training rather than a post-hoc renaming of the evaluation outcome. Self-citations, if present, are not load-bearing for the central claim because the results remain falsifiable against external benchmarks without relying on prior author work as an unverified uniqueness theorem. This is the normal case for a label-efficient supervised detector reporting concrete F1 scores.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework depends on the LLM producing useful partitions and on K labels per template being enough to represent normal versus anomalous behavior per domain; these are not derived from first principles but introduced to enable the mixture-of-experts training.

free parameters (2)

K = 100
Maximum number of labeled lines annotated per template; set to 100 in reported experiments.
number_of_failure_domains
Size of the partition of templates into failure domains proposed by the LLM.

axioms (2)

domain assumption An LLM can propose a partition of log templates into meaningful failure domains that supports effective expert specialization.
Invoked when the LLM is used to group templates before the certification step.
domain assumption A certification step can reliably validate the LLM-proposed partition for training purposes.
Required to ensure the domains are usable before training the router and experts.

pith-pipeline@v0.9.0 · 5790 in / 1556 out tokens · 46627 ms · 2026-05-22T03:34:47.350363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Logzip: Extract- ing hidden structures via iterative clustering for log compression,

J. Liu, J. Zhu, S. He, P. He, Z. Zheng, and M. R. Lyu, “Logzip: Extract- ing hidden structures via iterative clustering for log compression,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 863–873

work page 2019
[2]

Log parsing evaluation in the era of modern software systems,

S. Petrescu, F. Den Hengst, A. Uta, and J. S. Rellermeyer, “Log parsing evaluation in the era of modern software systems,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 379–390

work page 2023
[3]

LogBERT: Log anomaly detection via BERT,

H. Guo, S. Yuan, and X. Wu, “LogBERT: Log anomaly detection via BERT,” inProceedings of the International Joint Conference on Neural Networks, 2021

work page 2021
[4]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1285–1298

work page 2017
[5]

Loganomaly: Unsupervised detection 11 of sequential and quantitative anomalies in unstructured logs,

W. Meng, Y . Liu, Y . Zhu, S. Zhang, D. Pei, Y . Liu, Y . Chen, R. Zhang, S. Tao, P. Sun, and R. Zhou, “Loganomaly: Unsupervised detection 11 of sequential and quantitative anomalies in unstructured logs,” inPro- ceedings of the International Joint Conference on Artificial Intelligence, 2019, pp. 4739–4745

work page 2019
[6]

Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,

X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 2762–2780, 2022

work page 2022
[7]

Log-based anomaly detection with deep learning: How far are we?

V .-H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” inProceedings of the 44th international conference on software engineering, 2022, pp. 1356–1367

work page 2022
[8]

Diagnosing network-wide traffic anomalies,

A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” inProceedings of ACM SIGCOMM, 2004, pp. 219–230

work page 2004
[9]

Unsupervised log message anomaly detection,

A. Farzad and T. A. Gulliver, “Unsupervised log message anomaly detection,”ICT Express, vol. 6, no. 3, pp. 229–237, 2020

work page 2020
[10]

Hitanomaly: Hierarchical transformers for anomaly detection in system log,

S. Huang, Y . Liu, C. Fung, R. He, Y . Zhao, H. Yang, and Z. Luan, “Hitanomaly: Hierarchical transformers for anomaly detection in system log,”IEEE transactions on network and service management, vol. 17, no. 4, pp. 2064–2076, 2020

work page 2064
[11]

Robust log-based anomaly detection on unstable log data,

X. Zhang, Y . Xu, Q. Lin, B. Qiao, H. Zhang, Y . Dang, C. Xie, X. Yang, Q. Cheng, Z. Liet al., “Robust log-based anomaly detection on unstable log data,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 807–817

work page 2019
[12]

Loggpt: Exploring chatgpt for log-based anomaly detection,

J. Qi, S. Huang, Z. Luan, S. Yang, C. Fung, H. Yang, D. Qian, J. Shang, Z. Xiao, and Z. Wu, “Loggpt: Exploring chatgpt for log-based anomaly detection,” in2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCit...

work page 2023
[13]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zhu, A. Sailer, L. Lozano, C. Bansal, S. Rajmohan, and D. Zhang, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems (EuroSys), 2024, pp. 674–688

work page 2024
[14]

Aetherlog: Log-based root cause analysis by integrating large language models with knowledge graphs,

T. Cui, R. Fu, C. Liu, Y . Ji, W. Gu, S. Zhang, Y . Sun, and D. Pei, “Aetherlog: Log-based root cause analysis by integrating large language models with knowledge graphs,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 49–60

work page 2025
[15]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991
[16]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, pp. 1–39, 2022

work page 2022
[17]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[18]

What supercomputers say: A study of five system logs,

A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” in37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), 2007, pp. 575–584

work page 2007
[19]

Loghub: A large collection of system log datasets for ai-driven log analytics,

J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 355–366

work page 2023
[20]

Drain: An online log parsing approach with fixed depth tree,

P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProceedings of IEEE International Conference on Web Services, 2017, pp. 33–40

work page 2017
[21]

Demix layers: Disentangling domains for modular language modeling,

S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettle- moyer, “Demix layers: Disentangling domains for modular language modeling,” inProceedings of the 2022 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 5557–5576

work page 2022
[22]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[23]

How to fine-tune BERT for text classification?

C. Sun, X. Qiu, Y . Xu, and X. Huang, “How to fine-tune BERT for text classification?” inChina National Conference on Chinese Computational Linguistics, 2019, pp. 194–206

work page 2019
[24]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 3982–3992

work page 2019
[25]

Gpt-5 technical overview,

OpenAI, “Gpt-5 technical overview,” https://openai.com, 2026, accessed: April 2026

work page 2026
[26]

Claude model documentation,

Anthropic, “Claude model documentation,” https://www.anthropic.com, 2026, accessed: April 2026

work page 2026
[27]

Gemini api documentation,

G. DeepMind, “Gemini api documentation,” https://ai.google.dev, 2026, accessed: April 2026

work page 2026
[28]

Pricing — OpenAI Developer Platform,

OpenAI, “Pricing — OpenAI Developer Platform,” https://openai.com/ api/pricing/, accessed: April 2026

work page 2026
[29]

Pricing — Anthropic Developer Documentation,

Anthropic, “Pricing — Anthropic Developer Documentation,” https:// www.anthropic.com/pricing, accessed: April 2026

work page 2026
[30]

Gemini Developer API Pricing,

Google DeepMind, “Gemini Developer API Pricing,” https://ai.google. dev/pricing, accessed: April 2026

work page 2026
[31]

Clustering event logs using iterative partitioning,

A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “Clustering event logs using iterative partitioning,” inProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 1255–1264

work page 2009
[32]

Spell: Streaming parsing of system event logs,

M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016, pp. 859–864

work page 2016
[33]

Logram: Efficient log parsing usingnn-gram dictionaries,

H. Dai, H. Li, C.-S. Chen, W. Shang, and T.-H. Chen, “Logram: Efficient log parsing usingnn-gram dictionaries,”IEEE transactions on software engineering, vol. 48, no. 3, pp. 879–892, 2020

work page 2020
[34]

A survey on automated log analysis for reliability engineering,

S. He, P. He, Z. Chen, T. Yang, Y . Su, and M. R. Lyu, “A survey on automated log analysis for reliability engineering,”ACM computing surveys (CSUR), vol. 54, no. 6, pp. 1–37, 2021

work page 2021
[35]

Allinfolog: Robust diverse anomalies detection based on all log features,

R. Xiao, H. Chen, J. Lu, W. Li, and S. Jin, “Allinfolog: Robust diverse anomalies detection based on all log features,”IEEE Transactions on Network and Service Management, vol. 20, no. 3, pp. 2529–2543, 2022

work page 2022
[36]

Prelog: A pre-trained model for log analytics,

V .-H. Le and H. Zhang, “Prelog: A pre-trained model for log analytics,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–28, 2024

work page 2024
[37]

No more labelled examples? an unsupervised log parser with llms,

J. Huang, Z. Jiang, Z. Chen, and M. Lyu, “No more labelled examples? an unsupervised log parser with llms,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 2406–2429, 2025

work page 2025
[38]

Cslparser: A collaborative framework using small and large language models for log parsing,

W. Hong, Y . Wu, L. Zhang, C. Duan, P. Xiao, M. He, X. Yang, and Y . Li, “Cslparser: A collaborative framework using small and large language models for log parsing,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 61–72

work page 2025
[39]

Logkg: Log failure diagnosis through knowledge graph,

Y . Sui, Y . Zhang, J. Sun, T. Xu, S. Zhang, Z. Li, Y . Sun, F. Guo, J. Shen, Y . Zhanget al., “Logkg: Log failure diagnosis through knowledge graph,”IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3493–3507, 2023

work page 2023
[40]

Large language models can provide accurate and interpretable incident triage,

Z. Wang, J. Li, M. Ma, Z. Li, Y . Kang, C. Zhang, C. Bansal, M. Chintalapati, S. Rajmohan, Q. Linet al., “Large language models can provide accurate and interpretable incident triage,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 523–534

work page 2024
[41]

From logs to causal inference: di- agnosing large systems,

M. Markakis, B. Youngmann, T. Gao, Z. Zhang, R. Shahout, P. B. Chen, C. Liu, I. Sabek, and M. Cafarella, “From logs to causal inference: di- agnosing large systems,”Proceedings of the VLDB Endowment, vol. 18, no. 2, pp. 158–172, 2024

work page 2024
[42]

Adaptivelog: An adaptive log analysis framework with the collaboration of large and small language model,

L. Ma, W. Yang, Y . Li, B. Fei, M. Zhou, S. Li, S. Jiang, B. Xu, and Y . Xiao, “Adaptivelog: An adaptive log analysis framework with the collaboration of large and small language model,”ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[43]

The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,

Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943

work page 2024
[44]

Logmoe: Lightweight expert mixture for cross- system log anomaly detection,

J. Qi, Z. Luan, S. Huang, C. Fung, Y . Wang, A. Wang, H. Zhang, H. Yang, and D. Qian, “Logmoe: Lightweight expert mixture for cross- system log anomaly detection,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2025, pp. 330– 341

work page 2025
[45]

Plelog: Semi-supervised log-based anomaly detection via probabilistic label estimation,

L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, and W. Zhang, “Plelog: Semi-supervised log-based anomaly detection via probabilistic label estimation,” in2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2021, pp. 230–231. 12

work page 2021

[1] [1]

Logzip: Extract- ing hidden structures via iterative clustering for log compression,

J. Liu, J. Zhu, S. He, P. He, Z. Zheng, and M. R. Lyu, “Logzip: Extract- ing hidden structures via iterative clustering for log compression,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 863–873

work page 2019

[2] [2]

Log parsing evaluation in the era of modern software systems,

S. Petrescu, F. Den Hengst, A. Uta, and J. S. Rellermeyer, “Log parsing evaluation in the era of modern software systems,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 379–390

work page 2023

[3] [3]

LogBERT: Log anomaly detection via BERT,

H. Guo, S. Yuan, and X. Wu, “LogBERT: Log anomaly detection via BERT,” inProceedings of the International Joint Conference on Neural Networks, 2021

work page 2021

[4] [4]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1285–1298

work page 2017

[5] [5]

Loganomaly: Unsupervised detection 11 of sequential and quantitative anomalies in unstructured logs,

W. Meng, Y . Liu, Y . Zhu, S. Zhang, D. Pei, Y . Liu, Y . Chen, R. Zhang, S. Tao, P. Sun, and R. Zhou, “Loganomaly: Unsupervised detection 11 of sequential and quantitative anomalies in unstructured logs,” inPro- ceedings of the International Joint Conference on Artificial Intelligence, 2019, pp. 4739–4745

work page 2019

[6] [6]

Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,

X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 2762–2780, 2022

work page 2022

[7] [7]

Log-based anomaly detection with deep learning: How far are we?

V .-H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” inProceedings of the 44th international conference on software engineering, 2022, pp. 1356–1367

work page 2022

[8] [8]

Diagnosing network-wide traffic anomalies,

A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” inProceedings of ACM SIGCOMM, 2004, pp. 219–230

work page 2004

[9] [9]

Unsupervised log message anomaly detection,

A. Farzad and T. A. Gulliver, “Unsupervised log message anomaly detection,”ICT Express, vol. 6, no. 3, pp. 229–237, 2020

work page 2020

[10] [10]

Hitanomaly: Hierarchical transformers for anomaly detection in system log,

S. Huang, Y . Liu, C. Fung, R. He, Y . Zhao, H. Yang, and Z. Luan, “Hitanomaly: Hierarchical transformers for anomaly detection in system log,”IEEE transactions on network and service management, vol. 17, no. 4, pp. 2064–2076, 2020

work page 2064

[11] [11]

Robust log-based anomaly detection on unstable log data,

X. Zhang, Y . Xu, Q. Lin, B. Qiao, H. Zhang, Y . Dang, C. Xie, X. Yang, Q. Cheng, Z. Liet al., “Robust log-based anomaly detection on unstable log data,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 807–817

work page 2019

[12] [12]

Loggpt: Exploring chatgpt for log-based anomaly detection,

J. Qi, S. Huang, Z. Luan, S. Yang, C. Fung, H. Yang, D. Qian, J. Shang, Z. Xiao, and Z. Wu, “Loggpt: Exploring chatgpt for log-based anomaly detection,” in2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCit...

work page 2023

[13] [13]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zhu, A. Sailer, L. Lozano, C. Bansal, S. Rajmohan, and D. Zhang, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems (EuroSys), 2024, pp. 674–688

work page 2024

[14] [14]

Aetherlog: Log-based root cause analysis by integrating large language models with knowledge graphs,

T. Cui, R. Fu, C. Liu, Y . Ji, W. Gu, S. Zhang, Y . Sun, and D. Pei, “Aetherlog: Log-based root cause analysis by integrating large language models with knowledge graphs,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 49–60

work page 2025

[15] [15]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991

[16] [16]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, pp. 1–39, 2022

work page 2022

[17] [17]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[18] [18]

What supercomputers say: A study of five system logs,

A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” in37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), 2007, pp. 575–584

work page 2007

[19] [19]

Loghub: A large collection of system log datasets for ai-driven log analytics,

J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 355–366

work page 2023

[20] [20]

Drain: An online log parsing approach with fixed depth tree,

P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProceedings of IEEE International Conference on Web Services, 2017, pp. 33–40

work page 2017

[21] [21]

Demix layers: Disentangling domains for modular language modeling,

S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettle- moyer, “Demix layers: Disentangling domains for modular language modeling,” inProceedings of the 2022 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 5557–5576

work page 2022

[22] [22]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[23] [23]

How to fine-tune BERT for text classification?

C. Sun, X. Qiu, Y . Xu, and X. Huang, “How to fine-tune BERT for text classification?” inChina National Conference on Chinese Computational Linguistics, 2019, pp. 194–206

work page 2019

[24] [24]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 3982–3992

work page 2019

[25] [25]

Gpt-5 technical overview,

OpenAI, “Gpt-5 technical overview,” https://openai.com, 2026, accessed: April 2026

work page 2026

[26] [26]

Claude model documentation,

Anthropic, “Claude model documentation,” https://www.anthropic.com, 2026, accessed: April 2026

work page 2026

[27] [27]

Gemini api documentation,

G. DeepMind, “Gemini api documentation,” https://ai.google.dev, 2026, accessed: April 2026

work page 2026

[28] [28]

Pricing — OpenAI Developer Platform,

OpenAI, “Pricing — OpenAI Developer Platform,” https://openai.com/ api/pricing/, accessed: April 2026

work page 2026

[29] [29]

Pricing — Anthropic Developer Documentation,

Anthropic, “Pricing — Anthropic Developer Documentation,” https:// www.anthropic.com/pricing, accessed: April 2026

work page 2026

[30] [30]

Gemini Developer API Pricing,

Google DeepMind, “Gemini Developer API Pricing,” https://ai.google. dev/pricing, accessed: April 2026

work page 2026

[31] [31]

Clustering event logs using iterative partitioning,

A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “Clustering event logs using iterative partitioning,” inProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 1255–1264

work page 2009

[32] [32]

Spell: Streaming parsing of system event logs,

M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016, pp. 859–864

work page 2016

[33] [33]

Logram: Efficient log parsing usingnn-gram dictionaries,

H. Dai, H. Li, C.-S. Chen, W. Shang, and T.-H. Chen, “Logram: Efficient log parsing usingnn-gram dictionaries,”IEEE transactions on software engineering, vol. 48, no. 3, pp. 879–892, 2020

work page 2020

[34] [34]

A survey on automated log analysis for reliability engineering,

S. He, P. He, Z. Chen, T. Yang, Y . Su, and M. R. Lyu, “A survey on automated log analysis for reliability engineering,”ACM computing surveys (CSUR), vol. 54, no. 6, pp. 1–37, 2021

work page 2021

[35] [35]

Allinfolog: Robust diverse anomalies detection based on all log features,

R. Xiao, H. Chen, J. Lu, W. Li, and S. Jin, “Allinfolog: Robust diverse anomalies detection based on all log features,”IEEE Transactions on Network and Service Management, vol. 20, no. 3, pp. 2529–2543, 2022

work page 2022

[36] [36]

Prelog: A pre-trained model for log analytics,

V .-H. Le and H. Zhang, “Prelog: A pre-trained model for log analytics,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–28, 2024

work page 2024

[37] [37]

No more labelled examples? an unsupervised log parser with llms,

J. Huang, Z. Jiang, Z. Chen, and M. Lyu, “No more labelled examples? an unsupervised log parser with llms,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 2406–2429, 2025

work page 2025

[38] [38]

Cslparser: A collaborative framework using small and large language models for log parsing,

W. Hong, Y . Wu, L. Zhang, C. Duan, P. Xiao, M. He, X. Yang, and Y . Li, “Cslparser: A collaborative framework using small and large language models for log parsing,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 61–72

work page 2025

[39] [39]

Logkg: Log failure diagnosis through knowledge graph,

Y . Sui, Y . Zhang, J. Sun, T. Xu, S. Zhang, Z. Li, Y . Sun, F. Guo, J. Shen, Y . Zhanget al., “Logkg: Log failure diagnosis through knowledge graph,”IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3493–3507, 2023

work page 2023

[40] [40]

Large language models can provide accurate and interpretable incident triage,

Z. Wang, J. Li, M. Ma, Z. Li, Y . Kang, C. Zhang, C. Bansal, M. Chintalapati, S. Rajmohan, Q. Linet al., “Large language models can provide accurate and interpretable incident triage,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 523–534

work page 2024

[41] [41]

From logs to causal inference: di- agnosing large systems,

M. Markakis, B. Youngmann, T. Gao, Z. Zhang, R. Shahout, P. B. Chen, C. Liu, I. Sabek, and M. Cafarella, “From logs to causal inference: di- agnosing large systems,”Proceedings of the VLDB Endowment, vol. 18, no. 2, pp. 158–172, 2024

work page 2024

[42] [42]

Adaptivelog: An adaptive log analysis framework with the collaboration of large and small language model,

L. Ma, W. Yang, Y . Li, B. Fei, M. Zhou, S. Li, S. Jiang, B. Xu, and Y . Xiao, “Adaptivelog: An adaptive log analysis framework with the collaboration of large and small language model,”ACM Transactions on Software Engineering and Methodology, 2025

work page 2025

[43] [43]

The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,

Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943

work page 2024

[44] [44]

Logmoe: Lightweight expert mixture for cross- system log anomaly detection,

J. Qi, Z. Luan, S. Huang, C. Fung, Y . Wang, A. Wang, H. Zhang, H. Yang, and D. Qian, “Logmoe: Lightweight expert mixture for cross- system log anomaly detection,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2025, pp. 330– 341

work page 2025

[45] [45]

Plelog: Semi-supervised log-based anomaly detection via probabilistic label estimation,

L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, and W. Zhang, “Plelog: Semi-supervised log-based anomaly detection via probabilistic label estimation,” in2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2021, pp. 230–231. 12

work page 2021