Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

Dominik Janzing; Fabien Chraim; John Evans

arxiv: 2606.13532 · v1 · pith:2GTCRFSBnew · submitted 2026-06-11 · 💻 cs.NI · cs.LG

Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

Fabien Chraim , Dominik Janzing , John Evans This is my paper

Pith reviewed 2026-06-27 05:13 UTC · model grok-4.3

classification 💻 cs.NI cs.LG

keywords root cause analysiscausal discoverycloud networksGranger causalityconditional independencetime seriesgraph traversalnetwork incidents

0 comments

The pith

A causal graph built from binary time series recovers the root cause in 85.7% of cloud network incidents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that a graph-based causal discovery method can perform root cause analysis in complex cloud networks. The method groups data spatiotemporally, builds a causal graph using Granger causality and conditional independence tests on binary series, and scores causes with time-lag probabilities. Evaluation on 35 real incidents shows 85.7% recall of the correct root cause and 74.3% exact matches. Readers would care because such an approach has already been deployed for over 800 incidents with positive feedback from engineers, suggesting it scales to live environments.

Core claim

The authors establish that by constructing a causal graph from binary time series data using bivariate Granger causality and conditional independence tests after spatiotemporal grouping, and then using a probabilistic method to assign edge-specific conditional probabilities as a function of time lag, they can perform interpretable root cause scoring via graph traversal, which recalled the correct root cause in 85.7% of 35 production incidents and exactly matched in 74.3%.

What carries the argument

The causal graph with time-lag-dependent edge probabilities used for root cause scoring through traversal.

If this is right

The approach reduces the dimensionality of the problem using a spatiotemporal grouping strategy and an automation ontology.
The probabilistic inference provides time-aware and interpretable scores for potential root causes.
The system has been successfully deployed and used in over 800 real-world incidents.
Positive qualitative feedback from network engineers supports its practicality in dynamic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the recovered causal graph reflects the true process, the method could extend to root cause analysis in other large-scale networked systems like telecommunications or transportation networks.
The time-lag probabilities might be used to estimate the speed of incident propagation across the network.
Further refinements to the binary time series encoding could improve the exact match rate in future evaluations.

Load-bearing premise

That the bivariate Granger causality tests and conditional independence checks on the grouped binary time series data recover edge probabilities that reflect the true causal relationships in the cloud network.

What would settle it

Finding a collection of new incidents where the scored root causes do not align with independent expert determinations at a rate comparable to the reported 85.7% would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2606.13532 by Dominik Janzing, Fabien Chraim, John Evans.

**Figure 2.** Figure 2: Subset of the learned causal graph for a particular [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example conditional probability functions [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: A comparison of exact match match accuracy by method. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a deployed causal RCA pipeline for cloud networks using standard time-series tests plus grouping, with solid production use but weak validation that the graph edges are actually causal.

read the letter

The one thing to know is that this work combines spatiotemporal grouping, an automation ontology, and lag-dependent probabilities on top of bivariate Granger causality plus conditional independence tests to build a causal graph for root cause scoring in cloud networks. It reports 85.7% recall of the correct root cause on 35 labeled production incidents and has run on over 800 real incidents with positive engineer feedback.

The concrete pipeline for this domain is new; prior causal discovery work is cited but the specific assembly for network RCA with time-lag scoring is not. The deployment numbers and qualitative feedback are the strongest part of the evidence.

The soft spots sit in the evaluation. Performance is measured only against human labels for the root cause, not against any independent check that the recovered edges or probabilities match the true generative process. Bivariate tests on binary series in a high-dimensional network are known to surface spurious associations from confounding and multiple testing; the paper does not appear to include synthetic benchmarks or interventional checks to rule that out. Labeling details, baseline comparisons, and sensitivity to grouping choices are also thin in the provided description.

The math and citation pattern look standard for this style of causal discovery. No load-bearing contradictions show up.

This paper is for operations researchers and network teams who want to try data-driven RCA. A practitioner reader could extract a usable pipeline to test on their own logs. It deserves a serious referee because the production deployment gives it real-world weight even if the causal validation needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a novel graphical causal reasoning method for root cause analysis in cloud networks. It uses spatiotemporal grouping and an automation ontology to reduce dimensionality, builds a causal graph from binary time series using bivariate Granger causality and conditional independence tests, and uses lag-dependent probabilistic edge weights for inference. On a dataset of 35 labeled production incidents, it achieves 85.7% recall of the correct root cause and 74.3% exact match. The system has been deployed in over 800 incidents with positive engineer feedback.

Significance. If the recovered causal graphs accurately reflect the underlying network dependencies, this could represent a significant advance in applying causal discovery to operational RCA in large-scale systems, moving beyond rule-based approaches. The production deployment in over 800 incidents with qualitative feedback from engineers provides real-world evidence of practicality and is a clear strength. The time-aware probabilistic scoring via graph traversal offers interpretability that is valuable in operational settings.

major comments (2)

[Evaluation] Evaluation section: The headline performance (85.7% recall and 74.3% exact match on 35 incidents) is measured only against human-labeled root causes. No ground-truth validation of the recovered graph structure (e.g., via synthetic benchmarks with known causal skeleton or interventional data) is reported, so it remains possible that the results reflect correlated but non-causal features recovered by the bivariate Granger + CI procedure on binary series.
[Method] Method (causal graph construction): Bivariate Granger causality tests on binary time series after spatiotemporal grouping are used to build the graph. In high-dimensional cloud networks this is vulnerable to confounding and multiple-testing artifacts; the manuscript provides no additional controls or recovery benchmarks to show that the edge set meaningfully reflects the true generative process rather than spurious associations.

minor comments (1)

[Abstract] The abstract states that edge probabilities are 'a function of time lag' but does not specify the functional form or estimation procedure; this should be made explicit with an equation in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the presentation of our evaluation and methodological controls.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline performance (85.7% recall and 74.3% exact match on 35 incidents) is measured only against human-labeled root causes. No ground-truth validation of the recovered graph structure (e.g., via synthetic benchmarks with known causal skeleton or interventional data) is reported, so it remains possible that the results reflect correlated but non-causal features recovered by the bivariate Granger + CI procedure on binary series.

Authors: We agree that the absence of synthetic benchmarks with known causal structure leaves open the possibility that recovered edges capture correlations rather than causation. Our evaluation design prioritizes operational utility, using human expert labels on real production incidents as the relevant ground truth for root-cause identification. The deployment across more than 800 incidents with positive engineer feedback supplies complementary evidence of practical value. In revision we will expand the Evaluation section to explicitly discuss this limitation, justify the chosen metrics, and outline why obtaining interventional or synthetic ground truth is difficult in live cloud networks. revision: partial
Referee: [Method] Method (causal graph construction): Bivariate Granger causality tests on binary time series after spatiotemporal grouping are used to build the graph. In high-dimensional cloud networks this is vulnerable to confounding and multiple-testing artifacts; the manuscript provides no additional controls or recovery benchmarks to show that the edge set meaningfully reflects the true generative process rather than spurious associations.

Authors: The spatiotemporal grouping and automation ontology are intended to reduce dimensionality and limit the scope of tested relationships, while subsequent conditional-independence tests prune edges that fail to satisfy independence. We acknowledge that these steps do not constitute a full suite of confounding controls or synthetic recovery benchmarks. In the revised manuscript we will add an explicit subsection in the Method section describing the controls that are present, their limitations, and the rationale for not including synthetic recovery experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained

full rationale

The abstract and available description outline a data-driven pipeline: bivariate Granger causality plus conditional independence tests on binary time series to build the graph, followed by a separate probabilistic scoring layer that assigns lag-dependent edge probabilities and traverses the graph for root-cause ranking. Evaluation is performed against an external labeled set of 35 human-annotated incidents, with no quoted equations or procedural steps showing that the reported recall figures (85.7 % / 74.3 %) are algebraically identical to the fitted parameters or that the probability functions are defined in terms of the evaluation targets themselves. No self-citation chain, uniqueness theorem, or renaming of known results is invoked to close the loop. The central claim therefore retains independent empirical content relative to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate free parameters, axioms, or invented entities; the reported performance numbers rest on unstated modeling choices for the probability functions and grouping rules.

pith-pipeline@v0.9.1-grok · 5721 in / 1071 out tokens · 16505 ms · 2026-06-27T05:13:35.348556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references

[1]

How complex systems fail.Cognitive Technologies Laboratory, University of Chicago

Richard I Cook. How complex systems fail.Cognitive Technologies Laboratory, University of Chicago. Chicago IL, pages 64–118, 1998

1998
[2]

Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum

Michaela Hardt, William R. Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum. The petshop dataset – finding causes of performance issues across microservices, 2024

2024
[3]

Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

2022
[4]

Causal inference-based root cause analysis for online service systems with intervention recognition

Muxuan Li, Zheng Li, Ke Yin, Xiaoyan Nie, Weiqiang Zhang, Kaige Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3230–3240. ACM, 2022

2022
[5]

Chu, Ines F

Jinzhou Li, Benjamin B. Chu, Ines F. Scheller, Julien Gagneur, and Marloes H. Maathuis. Root cause discovery via permutations and cholesky decomposition, 2025

2025
[6]

Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023

Ruofan Xin, Pinjia Chen, and Zhen Zhao. Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023

2023
[7]

Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection

Long Pham, Huy Ha, and He Zhang. Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection. Proceedings of the ACM on Software Engineering, 1(FSE):2214–2237, 2024

2024
[8]

Large-scale differentiable causal discovery of factor graphs

Romain Lopez, Jan-Christian Hütter, Jonathan K Pritchard, and Aviv Regev. Large-scale differentiable causal discovery of factor graphs. In Advances in Neural Information Processing Systems, volume 35, pages 14739–14754, 2022

2022
[9]

Causal structure-based root cause analysis of outliers

Kailash Budhathoki, Lenon Minorics, Patrick Bloebaum, and Dominik Janzing. Causal structure-based root cause analysis of outliers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Researc...

2022
[10]

Root cause analysis of outliers with missing structural knowledge, 2024

Nastaran Okati, Sergio Hernan Garrido Mejia, William Roy Orchard, Patrick Blöbaum, and Dominik Janzing. Root cause analysis of outliers with missing structural knowledge, 2024

2024
[11]

A World of Wireless, Mobile and Multimedia Networks

Erik Aumayr, MingXue Wang, and Anne-Marie Bosneag. Probabilistic knowledge-graph based workflow recommender for network management automation. In2019 IEEE 20th International Symposium on" A World of Wireless, Mobile and Multimedia Networks"(WoWMoM), pages 1–7. IEEE, 2019

2019
[12]

Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

1969
[13]

Peters, D

J. Peters, D. Janzing, and B. Schölkopf.Elements of Causal Inference – Foundations and Learning Algorithms. MIT Press, 2017

2017
[14]

Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020

Giovanni Di Leo and Francesco Sardanelli. Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020

2020
[15]

Causal inference on time series using restricted structural equation models

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Causal inference on time series using restricted structural equation models. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013

2013
[16]

Pearl.Causality

J. Pearl.Causality. Cambridge University Press, 2000

2000
[17]

A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953

Charles Clos. A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953

1953
[18]

Morgan kaufmann, 2017

John L Hennessy and David A Patterson.Computer architecture: a quantitative approach. Morgan kaufmann, 2017

2017

[1] [1]

How complex systems fail.Cognitive Technologies Laboratory, University of Chicago

Richard I Cook. How complex systems fail.Cognitive Technologies Laboratory, University of Chicago. Chicago IL, pages 64–118, 1998

1998

[2] [2]

Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum

Michaela Hardt, William R. Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum. The petshop dataset – finding causes of performance issues across microservices, 2024

2024

[3] [3]

Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

2022

[4] [4]

Causal inference-based root cause analysis for online service systems with intervention recognition

Muxuan Li, Zheng Li, Ke Yin, Xiaoyan Nie, Weiqiang Zhang, Kaige Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3230–3240. ACM, 2022

2022

[5] [5]

Chu, Ines F

Jinzhou Li, Benjamin B. Chu, Ines F. Scheller, Julien Gagneur, and Marloes H. Maathuis. Root cause discovery via permutations and cholesky decomposition, 2025

2025

[6] [6]

Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023

Ruofan Xin, Pinjia Chen, and Zhen Zhao. Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023

2023

[7] [7]

Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection

Long Pham, Huy Ha, and He Zhang. Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection. Proceedings of the ACM on Software Engineering, 1(FSE):2214–2237, 2024

2024

[8] [8]

Large-scale differentiable causal discovery of factor graphs

Romain Lopez, Jan-Christian Hütter, Jonathan K Pritchard, and Aviv Regev. Large-scale differentiable causal discovery of factor graphs. In Advances in Neural Information Processing Systems, volume 35, pages 14739–14754, 2022

2022

[9] [9]

Causal structure-based root cause analysis of outliers

Kailash Budhathoki, Lenon Minorics, Patrick Bloebaum, and Dominik Janzing. Causal structure-based root cause analysis of outliers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Researc...

2022

[10] [10]

Root cause analysis of outliers with missing structural knowledge, 2024

Nastaran Okati, Sergio Hernan Garrido Mejia, William Roy Orchard, Patrick Blöbaum, and Dominik Janzing. Root cause analysis of outliers with missing structural knowledge, 2024

2024

[11] [11]

A World of Wireless, Mobile and Multimedia Networks

Erik Aumayr, MingXue Wang, and Anne-Marie Bosneag. Probabilistic knowledge-graph based workflow recommender for network management automation. In2019 IEEE 20th International Symposium on" A World of Wireless, Mobile and Multimedia Networks"(WoWMoM), pages 1–7. IEEE, 2019

2019

[12] [12]

Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

1969

[13] [13]

Peters, D

J. Peters, D. Janzing, and B. Schölkopf.Elements of Causal Inference – Foundations and Learning Algorithms. MIT Press, 2017

2017

[14] [14]

Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020

Giovanni Di Leo and Francesco Sardanelli. Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020

2020

[15] [15]

Causal inference on time series using restricted structural equation models

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Causal inference on time series using restricted structural equation models. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013

2013

[16] [16]

Pearl.Causality

J. Pearl.Causality. Cambridge University Press, 2000

2000

[17] [17]

A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953

Charles Clos. A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953

1953

[18] [18]

Morgan kaufmann, 2017

John L Hennessy and David A Patterson.Computer architecture: a quantitative approach. Morgan kaufmann, 2017

2017