Log-based vs Graph-based Approaches to Fault Diagnosis
Pith reviewed 2026-05-10 12:23 UTC · model grok-4.3
The pith
Integrating representations from log encoders into graph models yields the strongest performance for automated fault diagnosis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper conducts a comparative study of log-based encoder architectures and graph-based models for automated fault diagnosis. It evaluates the models on TraceBench, a trace-oriented log dataset, and on BGL, a traditional system log dataset, covering both anomaly detection and fault type classification. The results show that graph-only models fail to outperform encoder baselines. However, integrating learned representations from log encoders into graph-based models achieves the strongest overall performance.
What carries the argument
Hybrid architectures that integrate learned representations from log encoders into graph-based models to capture both sequential event patterns and structural log relationships.
Load-bearing premise
The observed performance differences stem primarily from the choice of log-sequence versus graph representations rather than from specific model implementations, hyperparameter choices, or characteristics of the datasets.
What would settle it
Finding that a graph-only model, after equivalent hyperparameter tuning and training, matches or exceeds the hybrid model's accuracy on anomaly detection and fault classification in both TraceBench and BGL would falsify the central claim.
Figures
read the original abstract
Modern distributed systems generate large volumes of logs that can be analyzed to support essential AIOps tasks such as fault diagnosis, which plays a crucial role in maintaining system reliability. Most existing approaches rely on log-based models that treat logs as linear sequences of events. However, such representations discard the structural context between events that are often present in execution logs, such as parent-child dependencies, fan-out (branching), or temporal features. To better capture these relationships, recent works on Graph Neural Networks (GNNs) suggest that representing logs as graphs offers a promising alternative. Building on these observations, this paper conducts a comparative study of log-based encoder architectures (e.g., BERT) and graph-based models (e.g., GNNs) for automated fault diagnosis. We evaluate our models on TraceBench, a trace-oriented log dataset, and on BGL, a more traditional system log dataset, covering both anomaly detection and fault type classification. Our results show that graph-only models fail to outperform encoder baselines. However, integrating learned representations from log encoders into graph-based models achieves the strongest overall performance. These findings highlight conditions under which graph-augmented architectures can outperform traditional log-based approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical comparative study of log-based encoder architectures (e.g., BERT) versus graph-based models (e.g., GNNs) for fault diagnosis in distributed systems. It evaluates both approaches, along with hybrids that integrate encoder representations into graph models, on the TraceBench trace-oriented dataset and the BGL system log dataset for the tasks of anomaly detection and fault type classification. The central claims are that graph-only models fail to outperform encoder baselines, while the hybrid models achieve the strongest overall performance, highlighting conditions under which graph augmentation is beneficial.
Significance. If the results hold after addressing controls for model capacity, the work would offer practical guidance to the AIOps community on representation choices for log-based fault diagnosis. It would clarify that graph structure alone is insufficient but can enhance encoder features, potentially shaping future model designs for system reliability tasks.
major comments (2)
- [Experimental Evaluation] The hybrid models integrate learned representations from log encoders into GNNs and are reported to achieve the strongest performance on TraceBench and BGL. However, the experimental design lacks an ablation that feeds the identical encoder embeddings into a non-graph baseline (such as an MLP or the encoder alone) while holding all other factors fixed. Without this control, it remains possible that observed gains arise from increased model capacity rather than the addition of graph edges or aggregation, directly weakening the attribution of results to log-sequence versus graph representations.
- [Results] The claim that graph-only models fail to outperform encoder baselines is load-bearing for the comparative conclusions, yet the results sections provide no details on hyperparameter search procedures, number of runs, or statistical significance tests for the performance differences on either dataset. This leaves open the possibility that differences are driven by implementation choices rather than the choice of representation.
minor comments (2)
- [Abstract] The abstract states that the work covers 'conditions under which graph-augmented architectures can outperform' but the main text should explicitly enumerate these conditions with direct references to the supporting tables or figures.
- [Methodology] Notation for the graph construction process (e.g., how parent-child dependencies and temporal features are encoded as edges) could be clarified with a small illustrative example or pseudocode to improve reproducibility.
Simulated Author's Rebuttal
Thank you for the detailed and constructive feedback on our manuscript. We appreciate the referee's insights into strengthening the experimental evaluation and results reporting. We address each major comment below, and we will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Experimental Evaluation] The hybrid models integrate learned representations from log encoders into GNNs and are reported to achieve the strongest performance on TraceBench and BGL. However, the experimental design lacks an ablation that feeds the identical encoder embeddings into a non-graph baseline (such as an MLP or the encoder alone) while holding all other factors fixed. Without this control, it remains possible that observed gains arise from increased model capacity rather than the addition of graph edges or aggregation, directly weakening the attribution of results to log-sequence versus graph representations.
Authors: We agree with this observation. To better isolate the contribution of the graph structure, we will add an ablation study in the revised manuscript where the encoder embeddings are fed into a non-graph baseline such as an MLP, with all other factors (e.g., embedding dimensions, training procedures) held fixed. This will provide clearer evidence on whether the performance gains in hybrid models stem from the graph augmentation rather than increased model capacity. We believe this addition will strengthen the attribution of results to the log-sequence versus graph representations. revision: yes
-
Referee: [Results] The claim that graph-only models fail to outperform encoder baselines is load-bearing for the comparative conclusions, yet the results sections provide no details on hyperparameter search procedures, number of runs, or statistical significance tests for the performance differences on either dataset. This leaves open the possibility that differences are driven by implementation choices rather than the choice of representation.
Authors: We acknowledge that the current manuscript lacks sufficient details on the experimental setup for reproducibility and statistical rigor. In the revised version, we will include comprehensive information on the hyperparameter search procedures (e.g., grid search ranges and selected values), the number of independent runs performed, and the results of statistical significance tests (such as t-tests) comparing the performance differences. This will support the claim that graph-only models do not outperform encoder baselines and that the observed differences are not due to implementation choices. revision: yes
Circularity Check
No circularity: purely empirical model comparison on external benchmarks
full rationale
The paper conducts an empirical evaluation of log-sequence encoders versus graph-based GNNs (and hybrids) for fault diagnosis on the TraceBench and BGL datasets. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Claims rest on direct experimental performance metrics rather than any self-definitional reduction, self-citation chain, or ansatz smuggled via prior work. The central result (hybrid superiority) is presented as an observed outcome on held-out data, not a logical consequence of the paper's own inputs. This is a standard self-contained empirical study with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Tracebench: A benchmark dataset for trace-oriented moni- toring,
M. Team, “Tracebench: A benchmark dataset for trace-oriented moni- toring,” https://mtracer.github.io/TraceBench/, 2021
work page 2021
-
[2]
Deeplog: Anomaly detection and diagnosis from system logs through deep learning,
M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2017, pp. 1285–1298
work page 2017
-
[3]
Logbert: Log anomaly detection via bert,
H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” 2021
work page 2021
-
[4]
Lanobert: System log anomaly detection based on bert masked language model,
Y . Lee, J. Kim, and P. Kang, “Lanobert: System log anomaly detection based on bert masked language model,” 2021
work page 2021
-
[5]
Graph neural networks based log anomaly detection and explanation,
Z. Li, J. Shi, and M. van Leeuwen, “Graph neural networks based log anomaly detection and explanation,” 2024. [Online]. Available: https://arxiv.org/abs/2307.00527
-
[6]
Tracegra: A trace-based anomaly detection for microservice using graph deep learning,
J. Chen, F. Liu, G. Zhong, J. Jiang, D. Xu, S. Shi, and Z. Tan, “Tracegra: A trace-based anomaly detection for microservice using graph deep learning,”SSRN Electronic Journal, 01 2022
work page 2022
-
[7]
Loggd:detecting anomalies from system logs by graph neural networks,
Y . Xie, H. Zhang, and M. A. Babar, “Loggd:detecting anomalies from system logs by graph neural networks,” 2022. [Online]. Available: https://arxiv.org/abs/2209.07869
-
[8]
Deeptralog: Unified graph-based representation learning for microservice anomaly detection,
P. C. et al., “Deeptralog: Unified graph-based representation learning for microservice anomaly detection,” inICSE, 2022
work page 2022
-
[9]
A survey of graph-based deep learning for anomaly detection in distributed systems,
A. D. Pazho, G. A. Noghre, A. A. Purkayastha, J. Vempati, O. Martin, and H. Tabkhi, “A survey of graph-based deep learning for anomaly detection in distributed systems,” 2023. [Online]. Available: https://arxiv.org/abs/2206.04149
-
[10]
On the effectiveness of log representation for log-based anomaly detection,
X. Wu, H. Li, and F. Khomh, “On the effectiveness of log representation for log-based anomaly detection,”Empirical Software Engineering,
-
[11]
Available: https://arxiv.org/abs/2308.08736
[Online]. Available: https://arxiv.org/abs/2308.08736
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.