Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks
Pith reviewed 2026-06-27 05:13 UTC · model grok-4.3
The pith
A causal graph built from binary time series recovers the root cause in 85.7% of cloud network incidents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that by constructing a causal graph from binary time series data using bivariate Granger causality and conditional independence tests after spatiotemporal grouping, and then using a probabilistic method to assign edge-specific conditional probabilities as a function of time lag, they can perform interpretable root cause scoring via graph traversal, which recalled the correct root cause in 85.7% of 35 production incidents and exactly matched in 74.3%.
What carries the argument
The causal graph with time-lag-dependent edge probabilities used for root cause scoring through traversal.
If this is right
- The approach reduces the dimensionality of the problem using a spatiotemporal grouping strategy and an automation ontology.
- The probabilistic inference provides time-aware and interpretable scores for potential root causes.
- The system has been successfully deployed and used in over 800 real-world incidents.
- Positive qualitative feedback from network engineers supports its practicality in dynamic environments.
Where Pith is reading between the lines
- If the recovered causal graph reflects the true process, the method could extend to root cause analysis in other large-scale networked systems like telecommunications or transportation networks.
- The time-lag probabilities might be used to estimate the speed of incident propagation across the network.
- Further refinements to the binary time series encoding could improve the exact match rate in future evaluations.
Load-bearing premise
That the bivariate Granger causality tests and conditional independence checks on the grouped binary time series data recover edge probabilities that reflect the true causal relationships in the cloud network.
What would settle it
Finding a collection of new incidents where the scored root causes do not align with independent expert determinations at a rate comparable to the reported 85.7% would falsify the effectiveness claim.
Figures
read the original abstract
Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a novel graphical causal reasoning method for root cause analysis in cloud networks. It uses spatiotemporal grouping and an automation ontology to reduce dimensionality, builds a causal graph from binary time series using bivariate Granger causality and conditional independence tests, and uses lag-dependent probabilistic edge weights for inference. On a dataset of 35 labeled production incidents, it achieves 85.7% recall of the correct root cause and 74.3% exact match. The system has been deployed in over 800 incidents with positive engineer feedback.
Significance. If the recovered causal graphs accurately reflect the underlying network dependencies, this could represent a significant advance in applying causal discovery to operational RCA in large-scale systems, moving beyond rule-based approaches. The production deployment in over 800 incidents with qualitative feedback from engineers provides real-world evidence of practicality and is a clear strength. The time-aware probabilistic scoring via graph traversal offers interpretability that is valuable in operational settings.
major comments (2)
- [Evaluation] Evaluation section: The headline performance (85.7% recall and 74.3% exact match on 35 incidents) is measured only against human-labeled root causes. No ground-truth validation of the recovered graph structure (e.g., via synthetic benchmarks with known causal skeleton or interventional data) is reported, so it remains possible that the results reflect correlated but non-causal features recovered by the bivariate Granger + CI procedure on binary series.
- [Method] Method (causal graph construction): Bivariate Granger causality tests on binary time series after spatiotemporal grouping are used to build the graph. In high-dimensional cloud networks this is vulnerable to confounding and multiple-testing artifacts; the manuscript provides no additional controls or recovery benchmarks to show that the edge set meaningfully reflects the true generative process rather than spurious associations.
minor comments (1)
- [Abstract] The abstract states that edge probabilities are 'a function of time lag' but does not specify the functional form or estimation procedure; this should be made explicit with an equation in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate planned revisions to strengthen the presentation of our evaluation and methodological controls.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The headline performance (85.7% recall and 74.3% exact match on 35 incidents) is measured only against human-labeled root causes. No ground-truth validation of the recovered graph structure (e.g., via synthetic benchmarks with known causal skeleton or interventional data) is reported, so it remains possible that the results reflect correlated but non-causal features recovered by the bivariate Granger + CI procedure on binary series.
Authors: We agree that the absence of synthetic benchmarks with known causal structure leaves open the possibility that recovered edges capture correlations rather than causation. Our evaluation design prioritizes operational utility, using human expert labels on real production incidents as the relevant ground truth for root-cause identification. The deployment across more than 800 incidents with positive engineer feedback supplies complementary evidence of practical value. In revision we will expand the Evaluation section to explicitly discuss this limitation, justify the chosen metrics, and outline why obtaining interventional or synthetic ground truth is difficult in live cloud networks. revision: partial
-
Referee: [Method] Method (causal graph construction): Bivariate Granger causality tests on binary time series after spatiotemporal grouping are used to build the graph. In high-dimensional cloud networks this is vulnerable to confounding and multiple-testing artifacts; the manuscript provides no additional controls or recovery benchmarks to show that the edge set meaningfully reflects the true generative process rather than spurious associations.
Authors: The spatiotemporal grouping and automation ontology are intended to reduce dimensionality and limit the scope of tested relationships, while subsequent conditional-independence tests prune edges that fail to satisfy independence. We acknowledge that these steps do not constitute a full suite of confounding controls or synthetic recovery benchmarks. In the revised manuscript we will add an explicit subsection in the Method section describing the controls that are present, their limitations, and the rationale for not including synthetic recovery experiments at this stage. revision: partial
Circularity Check
No significant circularity detected; derivation remains self-contained
full rationale
The abstract and available description outline a data-driven pipeline: bivariate Granger causality plus conditional independence tests on binary time series to build the graph, followed by a separate probabilistic scoring layer that assigns lag-dependent edge probabilities and traverses the graph for root-cause ranking. Evaluation is performed against an external labeled set of 35 human-annotated incidents, with no quoted equations or procedural steps showing that the reported recall figures (85.7 % / 74.3 %) are algebraically identical to the fitted parameters or that the probability functions are defined in terms of the evaluation targets themselves. No self-citation chain, uniqueness theorem, or renaming of known results is invoked to close the loop. The central claim therefore retains independent empirical content relative to its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
How complex systems fail.Cognitive Technologies Laboratory, University of Chicago
Richard I Cook. How complex systems fail.Cognitive Technologies Laboratory, University of Chicago. Chicago IL, pages 64–118, 1998
1998
-
[2]
Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum
Michaela Hardt, William R. Orchard, Patrick Blöbaum, Shiva Ka- siviswanathan, and Elke Kirschbaum. The petshop dataset – finding causes of performance issues across microservices, 2024
2024
-
[3]
Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022
Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022
2022
-
[4]
Causal inference-based root cause analysis for online service systems with intervention recognition
Muxuan Li, Zheng Li, Ke Yin, Xiaoyan Nie, Weiqiang Zhang, Kaige Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3230–3240. ACM, 2022
2022
-
[5]
Chu, Ines F
Jinzhou Li, Benjamin B. Chu, Ines F. Scheller, Julien Gagneur, and Marloes H. Maathuis. Root cause discovery via permutations and cholesky decomposition, 2025
2025
-
[6]
Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023
Ruofan Xin, Pinjia Chen, and Zhen Zhao. Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023
2023
-
[7]
Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection
Long Pham, Huy Ha, and He Zhang. Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection. Proceedings of the ACM on Software Engineering, 1(FSE):2214–2237, 2024
2024
-
[8]
Large-scale differentiable causal discovery of factor graphs
Romain Lopez, Jan-Christian Hütter, Jonathan K Pritchard, and Aviv Regev. Large-scale differentiable causal discovery of factor graphs. In Advances in Neural Information Processing Systems, volume 35, pages 14739–14754, 2022
2022
-
[9]
Causal structure-based root cause analysis of outliers
Kailash Budhathoki, Lenon Minorics, Patrick Bloebaum, and Dominik Janzing. Causal structure-based root cause analysis of outliers. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Researc...
2022
-
[10]
Root cause analysis of outliers with missing structural knowledge, 2024
Nastaran Okati, Sergio Hernan Garrido Mejia, William Roy Orchard, Patrick Blöbaum, and Dominik Janzing. Root cause analysis of outliers with missing structural knowledge, 2024
2024
-
[11]
A World of Wireless, Mobile and Multimedia Networks
Erik Aumayr, MingXue Wang, and Anne-Marie Bosneag. Probabilistic knowledge-graph based workflow recommender for network management automation. In2019 IEEE 20th International Symposium on" A World of Wireless, Mobile and Multimedia Networks"(WoWMoM), pages 1–7. IEEE, 2019
2019
-
[12]
Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969
Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969
1969
-
[13]
Peters, D
J. Peters, D. Janzing, and B. Schölkopf.Elements of Causal Inference – Foundations and Learning Algorithms. MIT Press, 2017
2017
-
[14]
Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020
Giovanni Di Leo and Francesco Sardanelli. Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach.European radiology experimental, 4:1–8, 2020
2020
-
[15]
Causal inference on time series using restricted structural equation models
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Causal inference on time series using restricted structural equation models. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013
2013
-
[16]
Pearl.Causality
J. Pearl.Causality. Cambridge University Press, 2000
2000
-
[17]
A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953
Charles Clos. A study of non-blocking switching networks.Bell System Technical Journal, 32(2):406–424, 1953
1953
-
[18]
Morgan kaufmann, 2017
John L Hennessy and David A Patterson.Computer architecture: a quantitative approach. Morgan kaufmann, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.