FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection
Pith reviewed 2026-05-22 03:34 UTC · model grok-4.3
The pith
FAME trains a router and domain experts on at most K labels per log template plus one LLM-proposed failure-domain partition to detect anomalies at the individual message level.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FAME is a label-efficient message-level mixture-of-experts framework that annotates at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples, lets an LLM propose a partition of templates into failure domains that is then certified, and trains a lightweight router plus domain experts that run on-premise to output anomaly predictions and failure-domain labels, reaching F1 of 98.16 on BGL at K=100 for a 76x reduction in annotation effort while detecting 86.3 percent of anomalies from unseen EventIDs and F1 of 99.95 with perfect recall on Thunderbird.
What carries the argument
A router that directs each incoming log message to one of several domain-specific expert models, where the domains come from an LLM-proposed and certified partition of log templates into failure categories, trained from binary labels on at most K examples per template.
If this is right
- Message-level predictions would reduce the number of routine log lines an operator must inspect per alert.
- The model would continue to flag anomalies even when they appear under previously unseen EventIDs.
- Annotation budgets could drop by roughly 76x while still producing F1 scores above 98 on standard benchmarks.
- Failure-domain labels would accompany each detection, giving operators immediate context about the subsystem involved.
Where Pith is reading between the lines
- The same router-plus-experts structure could be tested on other heterogeneous log sources such as network device logs or application traces without changing the core training procedure.
- If the certification step for the LLM partition is replaced by a simple majority vote from a small set of human reviewers, the framework might still retain most of its accuracy gain.
- Running the experts in parallel on a multi-core server would allow real-time scoring of high-volume streams while keeping per-message latency low.
Load-bearing premise
Annotating at most K lines per template plus an LLM-proposed and certified partition of templates into failure domains supplies enough signal to train a router and experts that generalize to message-level detection across heterogeneous subsystems and unseen EventIDs.
What would settle it
A new log dataset in which the trained router and experts miss more than 20 percent of anomalies from previously unseen EventIDs even after using the stated K annotations per template would falsify the generalization claim.
Figures
read the original abstract
Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FAME, a label-efficient mixture-of-experts framework for message-level log anomaly detection. An LLM is used once offline to propose a partition of log templates into failure domains after annotating at most K lines per template for binary labels and examples. A router and per-domain experts are then trained to produce anomaly predictions and failure-domain labels at inference time. On the BGL dataset, FAME reports F1=98.16 at K=100 (76x annotation reduction) and detects 86.3% of anomalies from unseen EventIDs; on Thunderbird it reaches F1=99.95 with perfect recall.
Significance. If the central claims hold, the work offers a practical advance in log anomaly detection by shifting from coarse session/window-level alerts to message-level granularity while keeping labeling costs low and inference on-premise. The offline-LLM-plus-lightweight-MoE design addresses cost and heterogeneity issues that limit prior approaches.
major comments (2)
- [Experimental Evaluation] Experimental section: the headline generalization result (86.3% detection of anomalies from unseen EventIDs on BGL) rests on the LLM-proposed failure-domain partition supplying sufficient signal for the router and experts. No ablation is reported that isolates this partition against random grouping or template-ID-based grouping, leaving open whether the reported transfer performance is attributable to the proposed domains or would arise from any reasonable clustering.
- [Methodology] Methodology, certification paragraph: the description of the certification step does not specify what properties are checked (internal consistency within domains versus cross-template transfer to held-out EventIDs) or how failures of the partition would be detected and corrected before training.
minor comments (2)
- [Abstract] Abstract and results tables: error bars, exact baseline implementations, and the precise procedure for choosing the number of failure domains are not reported, making it difficult to assess robustness of the F1 numbers.
- [Introduction] Notation: the distinction between 'template' and 'EventID' should be clarified in the first use, as the unseen-EventID claim depends on this distinction.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our experimental claims and methodological details. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental section: the headline generalization result (86.3% detection of anomalies from unseen EventIDs on BGL) rests on the LLM-proposed failure-domain partition supplying sufficient signal for the router and experts. No ablation is reported that isolates this partition against random grouping or template-ID-based grouping, leaving open whether the reported transfer performance is attributable to the proposed domains or would arise from any reasonable clustering.
Authors: We agree that the absence of an ablation isolating the LLM-proposed failure-domain partition is a limitation in the current experimental section. The reported 86.3% detection rate on unseen EventIDs could potentially be influenced by any form of grouping rather than the specific semantic domains. In the revised manuscript we will add an ablation study that compares the LLM-proposed partition against (i) random grouping of templates and (ii) grouping based solely on template IDs. This will quantify the incremental benefit of the failure-domain structure for router and expert performance on held-out EventIDs. revision: yes
-
Referee: [Methodology] Methodology, certification paragraph: the description of the certification step does not specify what properties are checked (internal consistency within domains versus cross-template transfer to held-out EventIDs) or how failures of the partition would be detected and corrected before training.
Authors: We acknowledge that the certification paragraph is currently underspecified. We will revise the methodology section to explicitly state the properties verified during certification: (a) internal consistency of normal/anomaly labels and representative examples within each proposed domain, and (b) preliminary evidence of cross-template transfer potential to held-out EventIDs via a small validation split. We will also describe the detection and correction process, which consists of an automated consistency check followed by optional human review of domain boundaries; any failing domains trigger re-partitioning by the LLM or manual adjustment before training proceeds. revision: yes
Circularity Check
No significant circularity; empirical results on public datasets are independent of fitted parameters
full rationale
The paper describes an empirical ML pipeline: offline LLM proposes and certifies a template partition into failure domains, at most K lines per template are annotated to obtain binary labels and examples, then a router plus per-domain experts are trained on the resulting data and evaluated on held-out messages including unseen EventIDs. No equations or derivations are presented that reduce a reported metric (F1, recall on unseen EventIDs) to a quantity defined by the fitted parameters themselves. The performance numbers are measured on standard public log datasets (BGL, Thunderbird) after standard train/test splits; the LLM partition is an input to training rather than a post-hoc renaming of the evaluation outcome. Self-citations, if present, are not load-bearing for the central claim because the results remain falsifiable against external benchmarks without relying on prior author work as an unverified uniqueness theorem. This is the normal case for a label-efficient supervised detector reporting concrete F1 scores.
Axiom & Free-Parameter Ledger
free parameters (2)
- K =
100
- number_of_failure_domains
axioms (2)
- domain assumption An LLM can propose a partition of log templates into meaningful failure domains that supports effective expert specialization.
- domain assumption A certification step can reliably validate the LLM-proposed partition for training purposes.
Reference graph
Works this paper leans on
-
[1]
Logzip: Extract- ing hidden structures via iterative clustering for log compression,
J. Liu, J. Zhu, S. He, P. He, Z. Zheng, and M. R. Lyu, “Logzip: Extract- ing hidden structures via iterative clustering for log compression,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 863–873
work page 2019
-
[2]
Log parsing evaluation in the era of modern software systems,
S. Petrescu, F. Den Hengst, A. Uta, and J. S. Rellermeyer, “Log parsing evaluation in the era of modern software systems,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 379–390
work page 2023
-
[3]
LogBERT: Log anomaly detection via BERT,
H. Guo, S. Yuan, and X. Wu, “LogBERT: Log anomaly detection via BERT,” inProceedings of the International Joint Conference on Neural Networks, 2021
work page 2021
-
[4]
Deeplog: Anomaly detection and diagnosis from system logs through deep learning,
M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1285–1298
work page 2017
-
[5]
Loganomaly: Unsupervised detection 11 of sequential and quantitative anomalies in unstructured logs,
W. Meng, Y . Liu, Y . Zhu, S. Zhang, D. Pei, Y . Liu, Y . Chen, R. Zhang, S. Tao, P. Sun, and R. Zhou, “Loganomaly: Unsupervised detection 11 of sequential and quantitative anomalies in unstructured logs,” inPro- ceedings of the International Joint Conference on Artificial Intelligence, 2019, pp. 4739–4745
work page 2019
-
[6]
Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,
X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 2762–2780, 2022
work page 2022
-
[7]
Log-based anomaly detection with deep learning: How far are we?
V .-H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” inProceedings of the 44th international conference on software engineering, 2022, pp. 1356–1367
work page 2022
-
[8]
Diagnosing network-wide traffic anomalies,
A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,” inProceedings of ACM SIGCOMM, 2004, pp. 219–230
work page 2004
-
[9]
Unsupervised log message anomaly detection,
A. Farzad and T. A. Gulliver, “Unsupervised log message anomaly detection,”ICT Express, vol. 6, no. 3, pp. 229–237, 2020
work page 2020
-
[10]
Hitanomaly: Hierarchical transformers for anomaly detection in system log,
S. Huang, Y . Liu, C. Fung, R. He, Y . Zhao, H. Yang, and Z. Luan, “Hitanomaly: Hierarchical transformers for anomaly detection in system log,”IEEE transactions on network and service management, vol. 17, no. 4, pp. 2064–2076, 2020
work page 2064
-
[11]
Robust log-based anomaly detection on unstable log data,
X. Zhang, Y . Xu, Q. Lin, B. Qiao, H. Zhang, Y . Dang, C. Xie, X. Yang, Q. Cheng, Z. Liet al., “Robust log-based anomaly detection on unstable log data,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 807–817
work page 2019
-
[12]
Loggpt: Exploring chatgpt for log-based anomaly detection,
J. Qi, S. Huang, Z. Luan, S. Yang, C. Fung, H. Yang, D. Qian, J. Shang, Z. Xiao, and Z. Wu, “Loggpt: Exploring chatgpt for log-based anomaly detection,” in2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCit...
work page 2023
-
[13]
Automatic root cause analysis via large language models for cloud incidents,
Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zhu, A. Sailer, L. Lozano, C. Bansal, S. Rajmohan, and D. Zhang, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems (EuroSys), 2024, pp. 674–688
work page 2024
-
[14]
Aetherlog: Log-based root cause analysis by integrating large language models with knowledge graphs,
T. Cui, R. Fu, C. Liu, Y . Ji, W. Gu, S. Zhang, Y . Sun, and D. Pei, “Aetherlog: Log-based root cause analysis by integrating large language models with knowledge graphs,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 49–60
work page 2025
-
[15]
Adaptive mixtures of local experts,
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, vol. 3, no. 1, pp. 79–87, 1991
work page 1991
-
[16]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, pp. 1–39, 2022
work page 2022
-
[17]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186
work page 2019
-
[18]
What supercomputers say: A study of five system logs,
A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” in37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), 2007, pp. 575–584
work page 2007
-
[19]
Loghub: A large collection of system log datasets for ai-driven log analytics,
J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 355–366
work page 2023
-
[20]
Drain: An online log parsing approach with fixed depth tree,
P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProceedings of IEEE International Conference on Web Services, 2017, pp. 33–40
work page 2017
-
[21]
Demix layers: Disentangling domains for modular language modeling,
S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettle- moyer, “Demix layers: Disentangling domains for modular language modeling,” inProceedings of the 2022 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 5557–5576
work page 2022
-
[22]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
V . Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[23]
How to fine-tune BERT for text classification?
C. Sun, X. Qiu, Y . Xu, and X. Huang, “How to fine-tune BERT for text classification?” inChina National Conference on Chinese Computational Linguistics, 2019, pp. 194–206
work page 2019
-
[24]
Sentence-bert: Sentence embeddings using siamese bert-networks,
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), 2019, pp. 3982–3992
work page 2019
-
[25]
OpenAI, “Gpt-5 technical overview,” https://openai.com, 2026, accessed: April 2026
work page 2026
-
[26]
Anthropic, “Claude model documentation,” https://www.anthropic.com, 2026, accessed: April 2026
work page 2026
-
[27]
G. DeepMind, “Gemini api documentation,” https://ai.google.dev, 2026, accessed: April 2026
work page 2026
-
[28]
Pricing — OpenAI Developer Platform,
OpenAI, “Pricing — OpenAI Developer Platform,” https://openai.com/ api/pricing/, accessed: April 2026
work page 2026
-
[29]
Pricing — Anthropic Developer Documentation,
Anthropic, “Pricing — Anthropic Developer Documentation,” https:// www.anthropic.com/pricing, accessed: April 2026
work page 2026
-
[30]
Google DeepMind, “Gemini Developer API Pricing,” https://ai.google. dev/pricing, accessed: April 2026
work page 2026
-
[31]
Clustering event logs using iterative partitioning,
A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “Clustering event logs using iterative partitioning,” inProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 1255–1264
work page 2009
-
[32]
Spell: Streaming parsing of system event logs,
M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016, pp. 859–864
work page 2016
-
[33]
Logram: Efficient log parsing usingnn-gram dictionaries,
H. Dai, H. Li, C.-S. Chen, W. Shang, and T.-H. Chen, “Logram: Efficient log parsing usingnn-gram dictionaries,”IEEE transactions on software engineering, vol. 48, no. 3, pp. 879–892, 2020
work page 2020
-
[34]
A survey on automated log analysis for reliability engineering,
S. He, P. He, Z. Chen, T. Yang, Y . Su, and M. R. Lyu, “A survey on automated log analysis for reliability engineering,”ACM computing surveys (CSUR), vol. 54, no. 6, pp. 1–37, 2021
work page 2021
-
[35]
Allinfolog: Robust diverse anomalies detection based on all log features,
R. Xiao, H. Chen, J. Lu, W. Li, and S. Jin, “Allinfolog: Robust diverse anomalies detection based on all log features,”IEEE Transactions on Network and Service Management, vol. 20, no. 3, pp. 2529–2543, 2022
work page 2022
-
[36]
Prelog: A pre-trained model for log analytics,
V .-H. Le and H. Zhang, “Prelog: A pre-trained model for log analytics,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–28, 2024
work page 2024
-
[37]
No more labelled examples? an unsupervised log parser with llms,
J. Huang, Z. Jiang, Z. Chen, and M. Lyu, “No more labelled examples? an unsupervised log parser with llms,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 2406–2429, 2025
work page 2025
-
[38]
Cslparser: A collaborative framework using small and large language models for log parsing,
W. Hong, Y . Wu, L. Zhang, C. Duan, P. Xiao, M. He, X. Yang, and Y . Li, “Cslparser: A collaborative framework using small and large language models for log parsing,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 61–72
work page 2025
-
[39]
Logkg: Log failure diagnosis through knowledge graph,
Y . Sui, Y . Zhang, J. Sun, T. Xu, S. Zhang, Z. Li, Y . Sun, F. Guo, J. Shen, Y . Zhanget al., “Logkg: Log failure diagnosis through knowledge graph,”IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3493–3507, 2023
work page 2023
-
[40]
Large language models can provide accurate and interpretable incident triage,
Z. Wang, J. Li, M. Ma, Z. Li, Y . Kang, C. Zhang, C. Bansal, M. Chintalapati, S. Rajmohan, Q. Linet al., “Large language models can provide accurate and interpretable incident triage,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 523–534
work page 2024
-
[41]
From logs to causal inference: di- agnosing large systems,
M. Markakis, B. Youngmann, T. Gao, Z. Zhang, R. Shahout, P. B. Chen, C. Liu, I. Sabek, and M. Cafarella, “From logs to causal inference: di- agnosing large systems,”Proceedings of the VLDB Endowment, vol. 18, no. 2, pp. 158–172, 2024
work page 2024
-
[42]
L. Ma, W. Yang, Y . Li, B. Fei, M. Zhou, S. Li, S. Jiang, B. Xu, and Y . Xiao, “Adaptivelog: An adaptive log analysis framework with the collaboration of large and small language model,”ACM Transactions on Software Engineering and Methodology, 2025
work page 2025
-
[43]
Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943
work page 2024
-
[44]
Logmoe: Lightweight expert mixture for cross- system log anomaly detection,
J. Qi, Z. Luan, S. Huang, C. Fung, Y . Wang, A. Wang, H. Zhang, H. Yang, and D. Qian, “Logmoe: Lightweight expert mixture for cross- system log anomaly detection,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2025, pp. 330– 341
work page 2025
-
[45]
Plelog: Semi-supervised log-based anomaly detection via probabilistic label estimation,
L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, and W. Zhang, “Plelog: Semi-supervised log-based anomaly detection via probabilistic label estimation,” in2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2021, pp. 230–231. 12
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.