pith. sign in

arxiv: 2604.14019 · v1 · submitted 2026-04-15 · 💻 cs.SE

Log-based vs Graph-based Approaches to Fault Diagnosis

Pith reviewed 2026-05-10 12:23 UTC · model grok-4.3

classification 💻 cs.SE
keywords fault diagnosislog analysisgraph neural networkslog encodersanomaly detectionhybrid modelsdistributed systemsAIOps
0
0 comments X

The pith

Integrating representations from log encoders into graph models yields the strongest performance for automated fault diagnosis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares log-based encoders that process logs as event sequences against graph-based models that represent logs with structural relationships such as dependencies and branching. Evaluations on TraceBench and BGL datasets show that graph-only models do not beat encoder baselines for anomaly detection or fault type classification. The strongest results occur when learned representations from the log encoders are fed into graph architectures. This matters to readers because it identifies a practical way to combine sequential patterns and structural context when building tools for maintaining reliable distributed systems.

Core claim

This paper conducts a comparative study of log-based encoder architectures and graph-based models for automated fault diagnosis. It evaluates the models on TraceBench, a trace-oriented log dataset, and on BGL, a traditional system log dataset, covering both anomaly detection and fault type classification. The results show that graph-only models fail to outperform encoder baselines. However, integrating learned representations from log encoders into graph-based models achieves the strongest overall performance.

What carries the argument

Hybrid architectures that integrate learned representations from log encoders into graph-based models to capture both sequential event patterns and structural log relationships.

Load-bearing premise

The observed performance differences stem primarily from the choice of log-sequence versus graph representations rather than from specific model implementations, hyperparameter choices, or characteristics of the datasets.

What would settle it

Finding that a graph-only model, after equivalent hyperparameter tuning and training, matches or exceeds the hybrid model's accuracy on anomaly detection and fault classification in both TraceBench and BGL would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.14019 by Mathis Nguyen, Mohamed Ali Lajnef.

Figure 1
Figure 1. Figure 1: Overview of our end-to-end pipeline. The process consists of 3 main stages: 1) Log/Trace Preprocessing: Log and trace data from the TraceBench and BGL datasets are preprocessed through several steps, including parsing, grouping events into execution traces, and labeling. This stage generates the master tables, the common representation used by all models, which link traces to events, link events to one ano… view at source ↗
Figure 2
Figure 2. Figure 2: TraceBench Dataset (Simplified). The dataset is organized into a set of relational tables. Each entry in the Trace table corresponds to a complete execution instance of the system. The Event table records all events belonging to these executions, including both semantic attributes (e.g., operation name, description) and temporal metadata (start and end times). The Edges table specifies the parent–child rel… view at source ↗
Figure 3
Figure 3. Figure 3: RQ1 – Comparison of semantic (BERT) and struc [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ1 – Comparison of semantic (BERT) and struc [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RQ2 – Comparison of semantic (BERT) and struc [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RQ3 – Hybrid model performance (GNN + BERT) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: RQ3 – Hybrid model performance (GNN + BERT) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Modern distributed systems generate large volumes of logs that can be analyzed to support essential AIOps tasks such as fault diagnosis, which plays a crucial role in maintaining system reliability. Most existing approaches rely on log-based models that treat logs as linear sequences of events. However, such representations discard the structural context between events that are often present in execution logs, such as parent-child dependencies, fan-out (branching), or temporal features. To better capture these relationships, recent works on Graph Neural Networks (GNNs) suggest that representing logs as graphs offers a promising alternative. Building on these observations, this paper conducts a comparative study of log-based encoder architectures (e.g., BERT) and graph-based models (e.g., GNNs) for automated fault diagnosis. We evaluate our models on TraceBench, a trace-oriented log dataset, and on BGL, a more traditional system log dataset, covering both anomaly detection and fault type classification. Our results show that graph-only models fail to outperform encoder baselines. However, integrating learned representations from log encoders into graph-based models achieves the strongest overall performance. These findings highlight conditions under which graph-augmented architectures can outperform traditional log-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical comparative study of log-based encoder architectures (e.g., BERT) versus graph-based models (e.g., GNNs) for fault diagnosis in distributed systems. It evaluates both approaches, along with hybrids that integrate encoder representations into graph models, on the TraceBench trace-oriented dataset and the BGL system log dataset for the tasks of anomaly detection and fault type classification. The central claims are that graph-only models fail to outperform encoder baselines, while the hybrid models achieve the strongest overall performance, highlighting conditions under which graph augmentation is beneficial.

Significance. If the results hold after addressing controls for model capacity, the work would offer practical guidance to the AIOps community on representation choices for log-based fault diagnosis. It would clarify that graph structure alone is insufficient but can enhance encoder features, potentially shaping future model designs for system reliability tasks.

major comments (2)
  1. [Experimental Evaluation] The hybrid models integrate learned representations from log encoders into GNNs and are reported to achieve the strongest performance on TraceBench and BGL. However, the experimental design lacks an ablation that feeds the identical encoder embeddings into a non-graph baseline (such as an MLP or the encoder alone) while holding all other factors fixed. Without this control, it remains possible that observed gains arise from increased model capacity rather than the addition of graph edges or aggregation, directly weakening the attribution of results to log-sequence versus graph representations.
  2. [Results] The claim that graph-only models fail to outperform encoder baselines is load-bearing for the comparative conclusions, yet the results sections provide no details on hyperparameter search procedures, number of runs, or statistical significance tests for the performance differences on either dataset. This leaves open the possibility that differences are driven by implementation choices rather than the choice of representation.
minor comments (2)
  1. [Abstract] The abstract states that the work covers 'conditions under which graph-augmented architectures can outperform' but the main text should explicitly enumerate these conditions with direct references to the supporting tables or figures.
  2. [Methodology] Notation for the graph construction process (e.g., how parent-child dependencies and temporal features are encoded as edges) could be clarified with a small illustrative example or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive feedback on our manuscript. We appreciate the referee's insights into strengthening the experimental evaluation and results reporting. We address each major comment below, and we will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The hybrid models integrate learned representations from log encoders into GNNs and are reported to achieve the strongest performance on TraceBench and BGL. However, the experimental design lacks an ablation that feeds the identical encoder embeddings into a non-graph baseline (such as an MLP or the encoder alone) while holding all other factors fixed. Without this control, it remains possible that observed gains arise from increased model capacity rather than the addition of graph edges or aggregation, directly weakening the attribution of results to log-sequence versus graph representations.

    Authors: We agree with this observation. To better isolate the contribution of the graph structure, we will add an ablation study in the revised manuscript where the encoder embeddings are fed into a non-graph baseline such as an MLP, with all other factors (e.g., embedding dimensions, training procedures) held fixed. This will provide clearer evidence on whether the performance gains in hybrid models stem from the graph augmentation rather than increased model capacity. We believe this addition will strengthen the attribution of results to the log-sequence versus graph representations. revision: yes

  2. Referee: [Results] The claim that graph-only models fail to outperform encoder baselines is load-bearing for the comparative conclusions, yet the results sections provide no details on hyperparameter search procedures, number of runs, or statistical significance tests for the performance differences on either dataset. This leaves open the possibility that differences are driven by implementation choices rather than the choice of representation.

    Authors: We acknowledge that the current manuscript lacks sufficient details on the experimental setup for reproducibility and statistical rigor. In the revised version, we will include comprehensive information on the hyperparameter search procedures (e.g., grid search ranges and selected values), the number of independent runs performed, and the results of statistical significance tests (such as t-tests) comparing the performance differences. This will support the claim that graph-only models do not outperform encoder baselines and that the observed differences are not due to implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on external benchmarks

full rationale

The paper conducts an empirical evaluation of log-sequence encoders versus graph-based GNNs (and hybrids) for fault diagnosis on the TraceBench and BGL datasets. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Claims rest on direct experimental performance metrics rather than any self-definitional reduction, self-citation chain, or ansatz smuggled via prior work. The central result (hybrid superiority) is presented as an observed outcome on held-out data, not a logical consequence of the paper's own inputs. This is a standard self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This empirical comparison relies on standard machine learning techniques and publicly referenced datasets without introducing new parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5503 in / 1111 out tokens · 39357 ms · 2026-05-10T12:23:38.254292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

  1. [1]

    Tracebench: A benchmark dataset for trace-oriented moni- toring,

    M. Team, “Tracebench: A benchmark dataset for trace-oriented moni- toring,” https://mtracer.github.io/TraceBench/, 2021

  2. [2]

    Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

    M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2017, pp. 1285–1298

  3. [3]

    Logbert: Log anomaly detection via bert,

    H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” 2021

  4. [4]

    Lanobert: System log anomaly detection based on bert masked language model,

    Y . Lee, J. Kim, and P. Kang, “Lanobert: System log anomaly detection based on bert masked language model,” 2021

  5. [5]

    Graph neural networks based log anomaly detection and explanation,

    Z. Li, J. Shi, and M. van Leeuwen, “Graph neural networks based log anomaly detection and explanation,” 2024. [Online]. Available: https://arxiv.org/abs/2307.00527

  6. [6]

    Tracegra: A trace-based anomaly detection for microservice using graph deep learning,

    J. Chen, F. Liu, G. Zhong, J. Jiang, D. Xu, S. Shi, and Z. Tan, “Tracegra: A trace-based anomaly detection for microservice using graph deep learning,”SSRN Electronic Journal, 01 2022

  7. [7]

    Loggd:detecting anomalies from system logs by graph neural networks,

    Y . Xie, H. Zhang, and M. A. Babar, “Loggd:detecting anomalies from system logs by graph neural networks,” 2022. [Online]. Available: https://arxiv.org/abs/2209.07869

  8. [8]

    Deeptralog: Unified graph-based representation learning for microservice anomaly detection,

    P. C. et al., “Deeptralog: Unified graph-based representation learning for microservice anomaly detection,” inICSE, 2022

  9. [9]

    A survey of graph-based deep learning for anomaly detection in distributed systems,

    A. D. Pazho, G. A. Noghre, A. A. Purkayastha, J. Vempati, O. Martin, and H. Tabkhi, “A survey of graph-based deep learning for anomaly detection in distributed systems,” 2023. [Online]. Available: https://arxiv.org/abs/2206.04149

  10. [10]

    On the effectiveness of log representation for log-based anomaly detection,

    X. Wu, H. Li, and F. Khomh, “On the effectiveness of log representation for log-based anomaly detection,”Empirical Software Engineering,

  11. [11]

    Available: https://arxiv.org/abs/2308.08736

    [Online]. Available: https://arxiv.org/abs/2308.08736