Log-based vs Graph-based Approaches to Fault Diagnosis

Mathis Nguyen; Mohamed Ali Lajnef

arxiv: 2604.14019 · v1 · submitted 2026-04-15 · 💻 cs.SE

Log-based vs Graph-based Approaches to Fault Diagnosis

Mathis Nguyen , Mohamed Ali Lajnef This is my paper

Pith reviewed 2026-05-10 12:23 UTC · model grok-4.3

classification 💻 cs.SE

keywords fault diagnosislog analysisgraph neural networkslog encodersanomaly detectionhybrid modelsdistributed systemsAIOps

0 comments

The pith

Integrating representations from log encoders into graph models yields the strongest performance for automated fault diagnosis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares log-based encoders that process logs as event sequences against graph-based models that represent logs with structural relationships such as dependencies and branching. Evaluations on TraceBench and BGL datasets show that graph-only models do not beat encoder baselines for anomaly detection or fault type classification. The strongest results occur when learned representations from the log encoders are fed into graph architectures. This matters to readers because it identifies a practical way to combine sequential patterns and structural context when building tools for maintaining reliable distributed systems.

Core claim

This paper conducts a comparative study of log-based encoder architectures and graph-based models for automated fault diagnosis. It evaluates the models on TraceBench, a trace-oriented log dataset, and on BGL, a traditional system log dataset, covering both anomaly detection and fault type classification. The results show that graph-only models fail to outperform encoder baselines. However, integrating learned representations from log encoders into graph-based models achieves the strongest overall performance.

What carries the argument

Hybrid architectures that integrate learned representations from log encoders into graph-based models to capture both sequential event patterns and structural log relationships.

Load-bearing premise

The observed performance differences stem primarily from the choice of log-sequence versus graph representations rather than from specific model implementations, hyperparameter choices, or characteristics of the datasets.

What would settle it

Finding that a graph-only model, after equivalent hyperparameter tuning and training, matches or exceeds the hybrid model's accuracy on anomaly detection and fault classification in both TraceBench and BGL would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.14019 by Mathis Nguyen, Mohamed Ali Lajnef.

**Figure 1.** Figure 1: Overview of our end-to-end pipeline. The process consists of 3 main stages: 1) Log/Trace Preprocessing: Log and trace data from the TraceBench and BGL datasets are preprocessed through several steps, including parsing, grouping events into execution traces, and labeling. This stage generates the master tables, the common representation used by all models, which link traces to events, link events to one ano… view at source ↗

**Figure 2.** Figure 2: TraceBench Dataset (Simplified). The dataset is organized into a set of relational tables. Each entry in the Trace table corresponds to a complete execution instance of the system. The Event table records all events belonging to these executions, including both semantic attributes (e.g., operation name, description) and temporal metadata (start and end times). The Edges table specifies the parent–child rel… view at source ↗

**Figure 3.** Figure 3: RQ1 – Comparison of semantic (BERT) and struc [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: RQ1 – Comparison of semantic (BERT) and struc [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: RQ2 – Comparison of semantic (BERT) and struc [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: RQ3 – Hybrid model performance (GNN + BERT) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: RQ3 – Hybrid model performance (GNN + BERT) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Modern distributed systems generate large volumes of logs that can be analyzed to support essential AIOps tasks such as fault diagnosis, which plays a crucial role in maintaining system reliability. Most existing approaches rely on log-based models that treat logs as linear sequences of events. However, such representations discard the structural context between events that are often present in execution logs, such as parent-child dependencies, fan-out (branching), or temporal features. To better capture these relationships, recent works on Graph Neural Networks (GNNs) suggest that representing logs as graphs offers a promising alternative. Building on these observations, this paper conducts a comparative study of log-based encoder architectures (e.g., BERT) and graph-based models (e.g., GNNs) for automated fault diagnosis. We evaluate our models on TraceBench, a trace-oriented log dataset, and on BGL, a more traditional system log dataset, covering both anomaly detection and fault type classification. Our results show that graph-only models fail to outperform encoder baselines. However, integrating learned representations from log encoders into graph-based models achieves the strongest overall performance. These findings highlight conditions under which graph-augmented architectures can outperform traditional log-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pure GNNs lag encoders here but hybrids win, though the graph contribution isn't cleanly separated from extra capacity.

read the letter

The main thing to know is that this paper runs a head-to-head on log-sequence encoders versus graph models for fault diagnosis and reports that graph-only versions underperform BERT-style baselines, while hybrids that feed encoder representations into GNNs come out on top across TraceBench and BGL for both anomaly detection and fault-type classification. It sticks to existing components and focuses on the representation question rather than new architectures, which keeps the work grounded and directly useful for AIOps practitioners choosing between sequence and graph views of logs. The motivation around missing structural context in linear log models is clear, and testing on two datasets with different characteristics adds some breadth to the comparison. The results are presented without overclaiming, which is a strength for an empirical study. The soft spot is the hybrid interpretation. The performance edge could easily come from richer input features or higher model capacity when encoder embeddings are added, rather than from the graph edges or aggregation themselves. Without controls that hold the log representations fixed and toggle only the graph component, the claim that graph augmentation is what drives the gain rests on weaker ground. Details on exact baselines, hyperparameter search, metrics, and statistical tests would also help judge how reliable the gaps are. This paper is aimed at researchers and engineers working on log-based fault diagnosis in distributed systems. A reader looking for concrete data on when to prefer or combine these representations will find usable takeaways, even if the work is more confirmatory than innovative. It deserves peer review because the comparison is relevant to a practical subfield and the datasets are standard, though referees should press on the ablation controls and capacity confounds.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical comparative study of log-based encoder architectures (e.g., BERT) versus graph-based models (e.g., GNNs) for fault diagnosis in distributed systems. It evaluates both approaches, along with hybrids that integrate encoder representations into graph models, on the TraceBench trace-oriented dataset and the BGL system log dataset for the tasks of anomaly detection and fault type classification. The central claims are that graph-only models fail to outperform encoder baselines, while the hybrid models achieve the strongest overall performance, highlighting conditions under which graph augmentation is beneficial.

Significance. If the results hold after addressing controls for model capacity, the work would offer practical guidance to the AIOps community on representation choices for log-based fault diagnosis. It would clarify that graph structure alone is insufficient but can enhance encoder features, potentially shaping future model designs for system reliability tasks.

major comments (2)

[Experimental Evaluation] The hybrid models integrate learned representations from log encoders into GNNs and are reported to achieve the strongest performance on TraceBench and BGL. However, the experimental design lacks an ablation that feeds the identical encoder embeddings into a non-graph baseline (such as an MLP or the encoder alone) while holding all other factors fixed. Without this control, it remains possible that observed gains arise from increased model capacity rather than the addition of graph edges or aggregation, directly weakening the attribution of results to log-sequence versus graph representations.
[Results] The claim that graph-only models fail to outperform encoder baselines is load-bearing for the comparative conclusions, yet the results sections provide no details on hyperparameter search procedures, number of runs, or statistical significance tests for the performance differences on either dataset. This leaves open the possibility that differences are driven by implementation choices rather than the choice of representation.

minor comments (2)

[Abstract] The abstract states that the work covers 'conditions under which graph-augmented architectures can outperform' but the main text should explicitly enumerate these conditions with direct references to the supporting tables or figures.
[Methodology] Notation for the graph construction process (e.g., how parent-child dependencies and temporal features are encoded as edges) could be clarified with a small illustrative example or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive feedback on our manuscript. We appreciate the referee's insights into strengthening the experimental evaluation and results reporting. We address each major comment below, and we will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Experimental Evaluation] The hybrid models integrate learned representations from log encoders into GNNs and are reported to achieve the strongest performance on TraceBench and BGL. However, the experimental design lacks an ablation that feeds the identical encoder embeddings into a non-graph baseline (such as an MLP or the encoder alone) while holding all other factors fixed. Without this control, it remains possible that observed gains arise from increased model capacity rather than the addition of graph edges or aggregation, directly weakening the attribution of results to log-sequence versus graph representations.

Authors: We agree with this observation. To better isolate the contribution of the graph structure, we will add an ablation study in the revised manuscript where the encoder embeddings are fed into a non-graph baseline such as an MLP, with all other factors (e.g., embedding dimensions, training procedures) held fixed. This will provide clearer evidence on whether the performance gains in hybrid models stem from the graph augmentation rather than increased model capacity. We believe this addition will strengthen the attribution of results to the log-sequence versus graph representations. revision: yes
Referee: [Results] The claim that graph-only models fail to outperform encoder baselines is load-bearing for the comparative conclusions, yet the results sections provide no details on hyperparameter search procedures, number of runs, or statistical significance tests for the performance differences on either dataset. This leaves open the possibility that differences are driven by implementation choices rather than the choice of representation.

Authors: We acknowledge that the current manuscript lacks sufficient details on the experimental setup for reproducibility and statistical rigor. In the revised version, we will include comprehensive information on the hyperparameter search procedures (e.g., grid search ranges and selected values), the number of independent runs performed, and the results of statistical significance tests (such as t-tests) comparing the performance differences. This will support the claim that graph-only models do not outperform encoder baselines and that the observed differences are not due to implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on external benchmarks

full rationale

The paper conducts an empirical evaluation of log-sequence encoders versus graph-based GNNs (and hybrids) for fault diagnosis on the TraceBench and BGL datasets. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Claims rest on direct experimental performance metrics rather than any self-definitional reduction, self-citation chain, or ansatz smuggled via prior work. The central result (hybrid superiority) is presented as an observed outcome on held-out data, not a logical consequence of the paper's own inputs. This is a standard self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This empirical comparison relies on standard machine learning techniques and publicly referenced datasets without introducing new parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5503 in / 1111 out tokens · 39357 ms · 2026-05-10T12:23:38.254292+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Tracebench: A benchmark dataset for trace-oriented moni- toring,

M. Team, “Tracebench: A benchmark dataset for trace-oriented moni- toring,” https://mtracer.github.io/TraceBench/, 2021

work page 2021
[2]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2017, pp. 1285–1298

work page 2017
[3]

Logbert: Log anomaly detection via bert,

H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” 2021

work page 2021
[4]

Lanobert: System log anomaly detection based on bert masked language model,

Y . Lee, J. Kim, and P. Kang, “Lanobert: System log anomaly detection based on bert masked language model,” 2021

work page 2021
[5]

Graph neural networks based log anomaly detection and explanation,

Z. Li, J. Shi, and M. van Leeuwen, “Graph neural networks based log anomaly detection and explanation,” 2024. [Online]. Available: https://arxiv.org/abs/2307.00527

work page arXiv 2024
[6]

Tracegra: A trace-based anomaly detection for microservice using graph deep learning,

J. Chen, F. Liu, G. Zhong, J. Jiang, D. Xu, S. Shi, and Z. Tan, “Tracegra: A trace-based anomaly detection for microservice using graph deep learning,”SSRN Electronic Journal, 01 2022

work page 2022
[7]

Loggd:detecting anomalies from system logs by graph neural networks,

Y . Xie, H. Zhang, and M. A. Babar, “Loggd:detecting anomalies from system logs by graph neural networks,” 2022. [Online]. Available: https://arxiv.org/abs/2209.07869

work page arXiv 2022
[8]

Deeptralog: Unified graph-based representation learning for microservice anomaly detection,

P. C. et al., “Deeptralog: Unified graph-based representation learning for microservice anomaly detection,” inICSE, 2022

work page 2022
[9]

A survey of graph-based deep learning for anomaly detection in distributed systems,

A. D. Pazho, G. A. Noghre, A. A. Purkayastha, J. Vempati, O. Martin, and H. Tabkhi, “A survey of graph-based deep learning for anomaly detection in distributed systems,” 2023. [Online]. Available: https://arxiv.org/abs/2206.04149

work page arXiv 2023
[10]

On the effectiveness of log representation for log-based anomaly detection,

X. Wu, H. Li, and F. Khomh, “On the effectiveness of log representation for log-based anomaly detection,”Empirical Software Engineering,

work page
[11]

Available: https://arxiv.org/abs/2308.08736

[Online]. Available: https://arxiv.org/abs/2308.08736

work page arXiv

[1] [1]

Tracebench: A benchmark dataset for trace-oriented moni- toring,

M. Team, “Tracebench: A benchmark dataset for trace-oriented moni- toring,” https://mtracer.github.io/TraceBench/, 2021

work page 2021

[2] [2]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2017, pp. 1285–1298

work page 2017

[3] [3]

Logbert: Log anomaly detection via bert,

H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” 2021

work page 2021

[4] [4]

Lanobert: System log anomaly detection based on bert masked language model,

Y . Lee, J. Kim, and P. Kang, “Lanobert: System log anomaly detection based on bert masked language model,” 2021

work page 2021

[5] [5]

Graph neural networks based log anomaly detection and explanation,

Z. Li, J. Shi, and M. van Leeuwen, “Graph neural networks based log anomaly detection and explanation,” 2024. [Online]. Available: https://arxiv.org/abs/2307.00527

work page arXiv 2024

[6] [6]

Tracegra: A trace-based anomaly detection for microservice using graph deep learning,

J. Chen, F. Liu, G. Zhong, J. Jiang, D. Xu, S. Shi, and Z. Tan, “Tracegra: A trace-based anomaly detection for microservice using graph deep learning,”SSRN Electronic Journal, 01 2022

work page 2022

[7] [7]

Loggd:detecting anomalies from system logs by graph neural networks,

Y . Xie, H. Zhang, and M. A. Babar, “Loggd:detecting anomalies from system logs by graph neural networks,” 2022. [Online]. Available: https://arxiv.org/abs/2209.07869

work page arXiv 2022

[8] [8]

Deeptralog: Unified graph-based representation learning for microservice anomaly detection,

P. C. et al., “Deeptralog: Unified graph-based representation learning for microservice anomaly detection,” inICSE, 2022

work page 2022

[9] [9]

A survey of graph-based deep learning for anomaly detection in distributed systems,

A. D. Pazho, G. A. Noghre, A. A. Purkayastha, J. Vempati, O. Martin, and H. Tabkhi, “A survey of graph-based deep learning for anomaly detection in distributed systems,” 2023. [Online]. Available: https://arxiv.org/abs/2206.04149

work page arXiv 2023

[10] [10]

On the effectiveness of log representation for log-based anomaly detection,

X. Wu, H. Li, and F. Khomh, “On the effectiveness of log representation for log-based anomaly detection,”Empirical Software Engineering,

work page

[11] [11]

Available: https://arxiv.org/abs/2308.08736

[Online]. Available: https://arxiv.org/abs/2308.08736

work page arXiv