pith. sign in

arxiv: 2606.13532 · v1 · pith:2GTCRFSBnew · submitted 2026-06-11 · 💻 cs.NI · cs.LG

Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

Pith reviewed 2026-06-27 05:13 UTC · model grok-4.3

classification 💻 cs.NI cs.LG
keywords root cause analysiscausal discoverycloud networksGranger causalityconditional independencetime seriesgraph traversalnetwork incidents
0
0 comments X

The pith

A causal graph built from binary time series recovers the root cause in 85.7% of cloud network incidents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that a graph-based causal discovery method can perform root cause analysis in complex cloud networks. The method groups data spatiotemporally, builds a causal graph using Granger causality and conditional independence tests on binary series, and scores causes with time-lag probabilities. Evaluation on 35 real incidents shows 85.7% recall of the correct root cause and 74.3% exact matches. Readers would care because such an approach has already been deployed for over 800 incidents with positive feedback from engineers, suggesting it scales to live environments.

Core claim

The authors establish that by constructing a causal graph from binary time series data using bivariate Granger causality and conditional independence tests after spatiotemporal grouping, and then using a probabilistic method to assign edge-specific conditional probabilities as a function of time lag, they can perform interpretable root cause scoring via graph traversal, which recalled the correct root cause in 85.7% of 35 production incidents and exactly matched in 74.3%.

What carries the argument

The causal graph with time-lag-dependent edge probabilities used for root cause scoring through traversal.

If this is right

  • The approach reduces the dimensionality of the problem using a spatiotemporal grouping strategy and an automation ontology.
  • The probabilistic inference provides time-aware and interpretable scores for potential root causes.
  • The system has been successfully deployed and used in over 800 real-world incidents.
  • Positive qualitative feedback from network engineers supports its practicality in dynamic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the recovered causal graph reflects the true process, the method could extend to root cause analysis in other large-scale networked systems like telecommunications or transportation networks.
  • The time-lag probabilities might be used to estimate the speed of incident propagation across the network.
  • Further refinements to the binary time series encoding could improve the exact match rate in future evaluations.

Load-bearing premise

That the bivariate Granger causality tests and conditional independence checks on the grouped binary time series data recover edge probabilities that reflect the true causal relationships in the cloud network.

What would settle it

Finding a collection of new incidents where the scored root causes do not align with independent expert determinations at a rate comparable to the reported 85.7% would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2606.13532 by Dominik Janzing, Fabien Chraim, John Evans.

Figure 1
Figure 1. Figure 1: Automation Ontology: Observations (metrics, traces, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Subset of the learned causal graph for a particular [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example conditional probability functions [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A comparison of exact match match accuracy by method. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a novel graphical causal reasoning method for root cause analysis in cloud networks. It uses spatiotemporal grouping and an automation ontology to reduce dimensionality, builds a causal graph from binary time series using bivariate Granger causality and conditional independence tests, and uses lag-dependent probabilistic edge weights for inference. On a dataset of 35 labeled production incidents, it achieves 85.7% recall of the correct root cause and 74.3% exact match. The system has been deployed in over 800 incidents with positive engineer feedback.

Significance. If the recovered causal graphs accurately reflect the underlying network dependencies, this could represent a significant advance in applying causal discovery to operational RCA in large-scale systems, moving beyond rule-based approaches. The production deployment in over 800 incidents with qualitative feedback from engineers provides real-world evidence of practicality and is a clear strength. The time-aware probabilistic scoring via graph traversal offers interpretability that is valuable in operational settings.

major comments (2)
  1. [Evaluation] Evaluation section: The headline performance (85.7% recall and 74.3% exact match on 35 incidents) is measured only against human-labeled root causes. No ground-truth validation of the recovered graph structure (e.g., via synthetic benchmarks with known causal skeleton or interventional data) is reported, so it remains possible that the results reflect correlated but non-causal features recovered by the bivariate Granger + CI procedure on binary series.
  2. [Method] Method (causal graph construction): Bivariate Granger causality tests on binary time series after spatiotemporal grouping are used to build the graph. In high-dimensional cloud networks this is vulnerable to confounding and multiple-testing artifacts; the manuscript provides no additional controls or recovery benchmarks to show that the edge set meaningfully reflects the true generative process rather than spurious associations.
minor comments (1)
  1. [Abstract] The abstract states that edge probabilities are 'a function of time lag' but does not specify the functional form or estimation procedure; this should be made explicit with an equation in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the presentation of our evaluation and methodological controls.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline performance (85.7% recall and 74.3% exact match on 35 incidents) is measured only against human-labeled root causes. No ground-truth validation of the recovered graph structure (e.g., via synthetic benchmarks with known causal skeleton or interventional data) is reported, so it remains possible that the results reflect correlated but non-causal features recovered by the bivariate Granger + CI procedure on binary series.

    Authors: We agree that the absence of synthetic benchmarks with known causal structure leaves open the possibility that recovered edges capture correlations rather than causation. Our evaluation design prioritizes operational utility, using human expert labels on real production incidents as the relevant ground truth for root-cause identification. The deployment across more than 800 incidents with positive engineer feedback supplies complementary evidence of practical value. In revision we will expand the Evaluation section to explicitly discuss this limitation, justify the chosen metrics, and outline why obtaining interventional or synthetic ground truth is difficult in live cloud networks. revision: partial

  2. Referee: [Method] Method (causal graph construction): Bivariate Granger causality tests on binary time series after spatiotemporal grouping are used to build the graph. In high-dimensional cloud networks this is vulnerable to confounding and multiple-testing artifacts; the manuscript provides no additional controls or recovery benchmarks to show that the edge set meaningfully reflects the true generative process rather than spurious associations.

    Authors: The spatiotemporal grouping and automation ontology are intended to reduce dimensionality and limit the scope of tested relationships, while subsequent conditional-independence tests prune edges that fail to satisfy independence. We acknowledge that these steps do not constitute a full suite of confounding controls or synthetic recovery benchmarks. In the revised manuscript we will add an explicit subsection in the Method section describing the controls that are present, their limitations, and the rationale for not including synthetic recovery experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained

full rationale

The abstract and available description outline a data-driven pipeline: bivariate Granger causality plus conditional independence tests on binary time series to build the graph, followed by a separate probabilistic scoring layer that assigns lag-dependent edge probabilities and traverses the graph for root-cause ranking. Evaluation is performed against an external labeled set of 35 human-annotated incidents, with no quoted equations or procedural steps showing that the reported recall figures (85.7 % / 74.3 %) are algebraically identical to the fitted parameters or that the probability functions are defined in terms of the evaluation targets themselves. No self-citation chain, uniqueness theorem, or renaming of known results is invoked to close the loop. The central claim therefore retains independent empirical content relative to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the reported performance numbers rest on unstated modeling choices for the probability functions and grouping rules.

pith-pipeline@v0.9.1-grok · 5721 in / 1071 out tokens · 16505 ms · 2026-06-27T05:13:35.348556+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references

  1. [1]

    How complex systems fail.Cognitive Technologies Laboratory, University of Chicago

    Richard I Cook. How complex systems fail.Cognitive Technologies Laboratory, University of Chicago. Chicago IL, pages 64–118, 1998

  2. [2]

    Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum

    Michaela Hardt, William R. Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum. The petshop dataset – finding causes of performance issues across microservices, 2024

  3. [3]

    Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

    Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

  4. [4]

    Causal inference-based root cause analysis for online service systems with intervention recognition

    Muxuan Li, Zheng Li, Ke Yin, Xiaoyan Nie, Weiqiang Zhang, Kaige Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3230–3240. ACM, 2022

  5. [5]

    Chu, Ines F

    Jinzhou Li, Benjamin B. Chu, Ines F. Scheller, Julien Gagneur, and Marloes H. Maathuis. Root cause discovery via permutations and cholesky decomposition, 2025

  6. [6]

    Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023

    Ruofan Xin, Pinjia Chen, and Zhen Zhao. Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023

  7. [7]

    Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection

    Long Pham, Huy Ha, and He Zhang. Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection. Proceedings of the ACM on Software Engineering, 1(FSE):2214–2237, 2024

  8. [8]

    Large-scale differentiable causal discovery of factor graphs

    Romain Lopez, Jan-Christian Hütter, Jonathan K Pritchard, and Aviv Regev. Large-scale differentiable causal discovery of factor graphs. In Advances in Neural Information Processing Systems, volume 35, pages 14739–14754, 2022

  9. [9]

    Causal structure-based root cause analysis of outliers

    Kailash Budhathoki, Lenon Minorics, Patrick Bloebaum, and Dominik Janzing. Causal structure-based root cause analysis of outliers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Researc...

  10. [10]

    Root cause analysis of outliers with missing structural knowledge, 2024

    Nastaran Okati, Sergio Hernan Garrido Mejia, William Roy Orchard, Patrick Blöbaum, and Dominik Janzing. Root cause analysis of outliers with missing structural knowledge, 2024

  11. [11]

    A World of Wireless, Mobile and Multimedia Networks

    Erik Aumayr, MingXue Wang, and Anne-Marie Bosneag. Probabilistic knowledge-graph based workflow recommender for network management automation. In2019 IEEE 20th International Symposium on" A World of Wireless, Mobile and Multimedia Networks"(WoWMoM), pages 1–7. IEEE, 2019

  12. [12]

    Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

    Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

  13. [13]

    Peters, D

    J. Peters, D. Janzing, and B. Schölkopf.Elements of Causal Inference – Foundations and Learning Algorithms. MIT Press, 2017

  14. [14]

    Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020

    Giovanni Di Leo and Francesco Sardanelli. Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020

  15. [15]

    Causal inference on time series using restricted structural equation models

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Causal inference on time series using restricted structural equation models. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013

  16. [16]

    Pearl.Causality

    J. Pearl.Causality. Cambridge University Press, 2000

  17. [17]

    A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953

    Charles Clos. A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953

  18. [18]

    Morgan kaufmann, 2017

    John L Hennessy and David A Patterson.Computer architecture: a quantitative approach. Morgan kaufmann, 2017