TORAI: Multi-source Root Cause Analysis for Blind Spots in Microservice Service Call Graph
Pith reviewed 2026-05-10 13:41 UTC · model grok-4.3
The pith
TORAI locates fine-grained root causes in microservice systems with blind spots by clustering services on anomaly severity and ranking causes inside each cluster without using a call graph.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TORAI is an unsupervised approach that quantifies anomaly severity from multi-source telemetry data, clusters services according to their severity symptom profiles, performs causal analysis within each cluster to produce local rankings, aggregates the rankings across clusters, and uses hypothesis testing to identify the fine-grained root causes. It operates without constructing or relying on a service call graph and without requiring further intrusive instrumentation on blind-spot services.
What carries the argument
Clustering services by shared anomaly severity symptoms, followed by intra-cluster causal analysis, cross-cluster aggregation, and hypothesis testing to produce a final ranking of root causes.
Load-bearing premise
Grouping services by similar anomaly severity patterns and then performing causal analysis inside those groups will surface the true root cause even when the overall service call structure is unknown.
What would settle it
Inject a known root cause into a blind-spot service in one of the benchmark systems, run TORAI, and check whether that service appears in the top-3 recommendations; consistent absence would show the clustering-plus-causal step does not isolate the cause.
Figures
read the original abstract
Existing multi-source root cause analysis (RCA) methods for microservice systems assume all services have traces to construct a service call graph. However, this assumption is not practical as microservice systems evolve rapidly and may contain blackbox services without traces, such as compiled software or unsupported services. We refer to these services as blind spots. In the presence of blind spots, the performance of existing multi-source RCA methods may be affected, as they only diagnose visible services on the call graph. To overcome this limitation, we propose TORAI, a novel unsupervised approach that effectively pinpoints fine-grained root causes without relying on the service call graph. Instead, TORAI first measures anomaly severity using available multi-source telemetry data. It then performs clustering to group services based on their severity symptoms and conducts causal analysis to rank services within each severity cluster. Finally, TORAI aggregates the cluster rankings and uses hypothesis testing to identify fine-grained root causes. TORAI provides an unsupervised approach that leverages available multi-source telemetry data for RCA without requiring a constructed service call graph or further intrusive actions, thus addressing the limitations of existing methods. Our experiments on three benchmark systems demonstrate that TORAI outperforms state-of-the-art baselines remarkably in the presence of blind spots. Performance on real-world failures further shows that TORAI can accurately pinpoint the root causes in top-3 recommendations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TORAI, an unsupervised multi-source root cause analysis (RCA) method for microservice systems that contain blind spots (services without traces). TORAI measures anomaly severity from available telemetry, clusters services by severity symptom similarity, performs causal analysis to rank services inside each cluster, aggregates the per-cluster rankings, and applies hypothesis testing to output fine-grained root causes. It claims to outperform state-of-the-art baselines on three benchmark systems with simulated blind spots and to achieve accurate top-3 identification on real-world failures, all without constructing or relying on a service call graph.
Significance. If the central technical steps hold, TORAI would address a practical limitation of existing graph-based RCA methods in rapidly evolving microservice deployments where full observability cannot be assumed. The approach of symptom-based clustering plus intra-cluster causal ranking is a plausible way to operate on partial telemetry; successful validation would be a concrete contribution to fault diagnosis under incomplete tracing.
major comments (3)
- [Abstract, §3] Abstract and §3 (method overview): the claim that intra-cluster causal analysis can reliably rank the true root cause above downstream services rests on an unstated assumption that the chosen causal procedure (unspecified in the text) can break symmetry on purely observational severity time series. No propagation model, temporal lag structure, or topological prior is introduced inside a cluster, so any correlation- or independence-based method operates on data that is symmetric between root and effect; this directly undermines the central claim that the pipeline identifies fine-grained root causes without the call graph.
- [§4] §4 (experiments): the reported outperformance on three benchmark systems is presented without ablation of the clustering step versus the causal-ranking step, without error bars or statistical significance tests across runs, and without explicit description of how blind spots were injected or how severity was quantified. These omissions make it impossible to determine whether the claimed gains are attributable to the proposed pipeline or to other factors, weakening the empirical support for the method's robustness.
- [§3.3] §3.3 (aggregation and hypothesis testing): the final aggregation of cluster rankings followed by hypothesis testing is described at a high level but lacks the concrete statistical procedure, multiple-testing correction, or threshold derivation. Because the preceding causal ranking already lacks structural signal, any downstream hypothesis test cannot recover information that was never present in the telemetry.
minor comments (2)
- [§3] Notation for anomaly severity and cluster membership is introduced without a clear mathematical definition or pseudocode; adding an explicit equation or algorithm box would improve reproducibility.
- [§2] The paper cites prior RCA methods but does not compare against recent unsupervised causal-discovery baselines that also operate on multivariate time series; a brief discussion of why those are not applicable would strengthen the related-work section.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We provide point-by-point responses to the major comments below and commit to revisions that address the raised issues.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (method overview): the claim that intra-cluster causal analysis can reliably rank the true root cause above downstream services rests on an unstated assumption that the chosen causal procedure (unspecified in the text) can break symmetry on purely observational severity time series. No propagation model, temporal lag structure, or topological prior is introduced inside a cluster, so any correlation- or independence-based method operates on data that is symmetric between root and effect; this directly undermines the central claim that the pipeline identifies fine-grained root causes without the call graph.
Authors: We appreciate this observation and agree that additional details are needed. In the revised version, we will elaborate on the causal analysis method in §3, explaining the specific procedure employed and how it utilizes temporal information from the severity time series to infer causal directions and rank the root cause higher than downstream services. This will clarify the mechanism by which symmetry is broken without relying on a service call graph. revision: yes
-
Referee: [§4] §4 (experiments): the reported outperformance on three benchmark systems is presented without ablation of the clustering step versus the causal-ranking step, without error bars or statistical significance tests across runs, and without explicit description of how blind spots were injected or how severity was quantified. These omissions make it impossible to determine whether the claimed gains are attributable to the proposed pipeline or to other factors, weakening the empirical support for the method's robustness.
Authors: We acknowledge the validity of these criticisms regarding the experimental presentation. We will revise §4 to include ablation studies comparing the full pipeline against variants without clustering or without causal ranking, report results with error bars from multiple independent runs along with statistical significance tests, and provide explicit details on the blind spot simulation process and the severity quantification formulas used for each telemetry source. revision: yes
-
Referee: [§3.3] §3.3 (aggregation and hypothesis testing): the final aggregation of cluster rankings followed by hypothesis testing is described at a high level but lacks the concrete statistical procedure, multiple-testing correction, or threshold derivation. Because the preceding causal ranking already lacks structural signal, any downstream hypothesis test cannot recover information that was never present in the telemetry.
Authors: We agree that the description in §3.3 is at a high level and will expand it in the revision to include the precise aggregation algorithm, the hypothesis testing procedure with details on test statistics and p-values, the multiple-testing correction method, and the derivation of any thresholds. We will also explicitly link this to the causal ranking step, noting that the temporal causality analysis within clusters does introduce directional information that the aggregation and testing build upon to identify the fine-grained root causes. revision: yes
Circularity Check
No circularity: TORAI is a procedural pipeline of standard statistical steps with no derivations or self-referential reductions
full rationale
The paper describes TORAI as a sequence of standard operations—measuring anomaly severity from multi-source telemetry, clustering services by symptom similarity, performing causal analysis within clusters, aggregating rankings, and applying hypothesis testing—without any equations, fitted parameters presented as predictions, or first-principles derivations. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core logic; the method is presented as an unsupervised combination of existing techniques. The central claims rest on empirical outperformance rather than any chain that reduces to its own inputs by construction, rendering the approach self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Container Advisor - an open-source tool to monitor containers
2025. Container Advisor - an open-source tool to monitor containers. Retrieved Apr 14, 2026 from https://github.com/ google/cadvisor
work page 2025
-
[2]
Elasticsearch Log Monitoring - Centralized Log Management
2025. Elasticsearch Log Monitoring - Centralized Log Management. Retrieved Apr 14, 2026 from https://www.elastic.co/
work page 2025
-
[3]
2025. Gaussian Mixture Model. Retrieved Apr 14, 2026 from https://scikit-learn.org/stable/modules/generated/sklearn. mixture.GaussianMixture.html
work page 2025
-
[4]
Google - Site Reliability Engineering
2025. Google - Site Reliability Engineering. Retrieved Apr 14, 2026 from https://sre.google/sre-book/monitoring- distributed-systems/
work page 2025
-
[5]
Grafana Loki - A horizontally scalable, highly available, multi-tenant log aggregation system
2025. Grafana Loki - A horizontally scalable, highly available, multi-tenant log aggregation system. Retrieved Apr 14, 2026 from https://grafana.com/oss/loki/
work page 2025
-
[6]
2025. The Istio service mesh. Retrieved Apr 14, 2026 from https://istio.io/
work page 2025
-
[7]
Jaeger: open source, distributed tracing platform
2025. Jaeger: open source, distributed tracing platform. Retrieved Apr 14, 2026 from https://www.jaegertracing.io/
work page 2025
-
[8]
A lightweight, ultra-fast tool for building observability pipelines
2025. A lightweight, ultra-fast tool for building observability pipelines. Retrieved Apr 14, 2026 from https://vector.dev/
work page 2025
-
[9]
Online Boutique is a cloud-first microservices demo application
2025. Online Boutique is a cloud-first microservices demo application. Retrieved Apr 14, 2026 from https://github. com/GoogleCloudPlatform/microservices-demo
work page 2025
-
[10]
An open-source monitoring and alerting toolkit
2025. An open-source monitoring and alerting toolkit. Retrieved Apr 14, 2026 from https://prometheus.io/
work page 2025
-
[11]
Sock Shop - A Microservices Demo Application
2025. Sock Shop - A Microservices Demo Application. Retrieved Apr 14, 2026 from https://github.com/microservices- demo/microservices-demo
work page 2025
-
[12]
Stress test for Computer system
2025. Stress test for Computer system. Retrieved Apr 14, 2026 from https://manpages.ubuntu.com/manpages/bionic/ man1/stress-ng.1.html
work page 2025
-
[13]
2025. Traffic Control. Retrieved Apr 14, 2026 from https://man7.org/linux/man-pages/man8/tc.8.html
work page 2025
-
[14]
2025. Train Ticket Benchmark System. Retrieved Apr 14, 2026 from https://github.com/FudanSELab/train-ticket
work page 2025
-
[15]
2025. Uber Microservice Systems. Retrieved Apr 14, 2026 from https://www.uber.com/en-AU/blog/up-portable- microservices-ready-for-the-cloud
work page 2025
-
[16]
Sachin Ashok, Vipul Harsh, Brighten Godfrey, Radhika Mittal, Srinivasan Parthasarathy, and Larisa Shwartz. 2024. TraceWeaver: Distributed Request Tracing for Microservices Without Application Modification. InProceedings of the ACM SIGCOMM 2024 Conference. 828–842
work page 2024
-
[17]
Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing.IEEE transactions on dependable and secure computing1, 1 (2004), 11–33
work page 2004
-
[18]
Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti. 2019. How bad can a bug get? an empirical analysis of software failures in the openstack cloud computing platform. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineerin...
work page 2019
-
[19]
Luca Giamattei, Antonio Guerriero, Roberto Pietrantuono, Stefano Russo, Ivano Malavolta, Tanjina Islam, Madalina Dinga, Anne Koziolek, Snigdha Singh, Martin Armbruster, et al. 2024. Monitoring tools for DevOps and microservices: A systematic grey literature review.Journal of Systems and Software208 (2024), 111906
work page 2024
-
[20]
Shenghui Gu, Guoping Rong, Tian Ren, He Zhang, Haifeng Shen, Yongda Yu, Xian Li, Jian Ouyang, and Chunan Chen
-
[21]
TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems.IEEE Transactions on Software Engineering49, 5 (2023), 3071–3088
work page 2023
-
[22]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In2017 IEEE international conference on web services (ICWS). IEEE, 33–40
work page 2017
-
[23]
Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering.ACM computing surveys (CSUR)54, 6 (2021), 1–37
work page 2021
-
[24]
Zilong He, Pengfei Chen, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, and Fangyuan Li. 2022. Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–13
work page 2022
-
[25]
Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. 2021. Diagnosing performance issues in microservices with heterogeneous data source. In2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/Social- Com/SustainCom). ...
work page 2021
-
[26]
Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root Cause Analysis of Failures in Microservices through Causal Discovery. InAdvances in Neural Information Processing Systems (NeurIPS’22), Vol. 35. 31158–31170
work page 2022
-
[27]
Amin Jaber, Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. 2020. Causal Discovery from Soft Interventions with Unknown Targets: Characterization and Learning. InAdvances in Neural Information Processing Systems (NeurIPS’20), Vol. 33. 9551–9561. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE130. Publication date: July 2026. FSE130:22 Lua...
work page 2020
-
[28]
Andrea Janes, Xiaozhou Li, and Valentina Lenarduzzi. 2023. Open tracing tools: Overview and critical comparison. Journal of Systems and Software(2023), 111793
work page 2023
-
[29]
Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. 2023. Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1750–1762
work page 2023
-
[30]
Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’22). 3230–3240
work page 2022
-
[31]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS’21). 1–10
work page 2021
-
[32]
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqing Duan, and Dan Pei. 2022. Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems. InProceedings of the 30th ACM Joint Meeting on European Software Engineering Confe...
work page 2022
-
[33]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. InInternational Conference on Service-Oriented Computing. Springer, 3–20
work page 2018
- [34]
-
[35]
Fengrui Liu, Yang Wang, Zhenyu Li, Rui Ren, Hongtao Guan, Xian Yu, Xiaofan Chen, and Gaogang Xie. 2022. MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting. InInternational Conference on Case-Based Reasoning. Springer, 224–239
work page 2022
-
[36]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Chengzhong Xu. 2022. An in-depth study of microservice call graph and runtime performance.IEEE Transactions on Parallel and Distributed Systems33, 12 (2022), 3901–3914
work page 2022
-
[37]
Leonardo Mariani, Cristina Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing faults in cloud systems. In2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 262–273
work page 2018
-
[38]
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing Failure Root Causes in a Microservice through Causality Inference. InIEEE/ACM 28th International Symposium on Quality of Service (IWQoS’20). 1–10
work page 2020
-
[39]
Odigos. 2025. Solving the Pitfalls of Distributed Tracing in Real-World Microservices. https://odigos.io/blog/solving- pitfalls-of-distributed-tracing-in-real-world-microservices. Accessed on September 2, 2025
work page 2025
-
[40]
William Roy Orchard, Nastaran Okati, Sergio Hernan Garrido Mejia, Patrick Blöbaum, and Dominik Janzing. 2025. Root Cause Analysis of Outliers with Missing Structural Knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=7Nxq4RQApu
work page 2025
-
[41]
Eva Patel and Dharmender Singh Kushwaha. 2020. Clustering cloud workloads: K-means vs gaussian mixture model. Procedia computer science171 (2020), 158–167
work page 2020
-
[42]
TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph
Luan Pham. 2026. Datasets for "TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph". (4 2026). doi:10.6084/m9.figshare.31925976.v1
- [43]
-
[44]
TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph
Luan Pham. 2026. Source code of "TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph". (4 2026). doi:10.6084/m9.figshare.31938495.v1
-
[45]
Luan Pham, Huong Ha, and Hongyu Zhang. 2024. BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection.Proceedings of the ACM on Software Engineering1, FSE (2024), 2214–2237
work page 2024
-
[46]
Luan Pham, Huong Ha, and Hongyu Zhang. 2024. Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?. InThe 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)
work page 2024
-
[47]
Luan Pham, Victor Nicolet, Joey Dodds, Hui Guan, and Daniel Kroening. 2026. EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems.Proceedings of the ACM on Software Engineering3, FSE (2026)
work page 2026
-
[48]
Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. 2025. RCAEval: a benchmark for root cause analysis of microservice systems with telemetry data. InCompanion Proceedings of the ACM on Web Conference 2025. 777–780
work page 2025
-
[49]
Raphael Rouf, Mohammadreza Rasolroveicy, Marin Litoiu, Seema Nagar, Prateeti Mohapatra, Pranjal Gupta, and Ian Watts. 2024. InstantOps: A Joint Approach to System Failure Prediction and Root Cause Identification in Microserivces Cloud-Native Applications. InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering. Proc. ACM Sof...
work page 2024
-
[50]
Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. 2019. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances5, 11 (2019)
work page 2019
-
[51]
Gideon Schwarz. 1978. Estimating the Dimension of a Model.The Annals of Statistics6, 2 (1978), 461 – 464
work page 1978
-
[52]
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. 𝜖- Diagnosis: Unsupervised and Real-Time Diagnosis of Small-Window Long-Tail Latency in Large-Scale Microservice Platforms. InThe World Wide Web Conference (WWW’19). 3215–3222
work page 2019
-
[53]
Junxian Shen, Han Zhang, Yang Xiang, Xingang Shi, Xinrui Li, Yunxi Shen, Zijian Zhang, Yongxiang Wu, Xia Yin, Jilong Wang, et al. 2023. Network-centric distributed tracing with DeepFlow: Troubleshooting your microservices in zero code. InProceedings of the ACM SIGCOMM 2023 Conference. 420–437
work page 2023
-
[54]
Jacopo Soldani and Antonio Brogi. 2022. Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey.Comput. Surveys55, 3 (2022)
work page 2022
-
[55]
Peter Spirtes, Christopher Meek, and Thomas Richardson. 1995. Causal Inference in the Presence of Latent Variables and Selection Bias. InProceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI’95). 499–506
work page 1995
-
[56]
Lingzhi Wang, Nengwen Zhao, Junjie Chen, Pinnong Li, Wenchi Zhang, and Kaixin Sui. 2020. Root-cause metric location for microservice systems via log anomaly detection. In2020 IEEE international conference on web services (ICWS). IEEE, 142–150
work page 2020
-
[57]
Grabarnik, Vijay Arya, and Karthikeyan Shanmugam
Qing Wang, Larisa Shwartz, Genady Ya. Grabarnik, Vijay Arya, and Karthikeyan Shanmugam. 2021. Detecting Causal Structure on Cloud Application Microservices Using Granger Causality Models. InIEEE 14th International Conference on Cloud Computing (CLOUD’21). 558–565
work page 2021
-
[58]
2022.Automatic performance diagnosis and recovery in cloud microservices
Li Wu. 2022.Automatic performance diagnosis and recovery in cloud microservices. TU Berlin (Germany)
work page 2022
-
[59]
Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. Microdiag: Fine-grained performance diagnosis for microservice systems. In2021 IEEE/ACM International Workshop on Cloud Intelligence. IEEE, 31–36
work page 2021
-
[60]
Shuaiyu Xie, Jian Wang, Hanbin He, Zhihao Wang, Yuqi Zhao, Neng Zhang, and Bing Li. 2026. TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework for Microservice-based Systems with Multimodal Data.ACM Trans. Softw. Eng. Methodol.35, 2, Article 40 (2026), 39 pages. doi:10.1145/3734868
-
[61]
Ruyue Xin, Peng Chen, and Zhiming Zhao. 2023. CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software203 (2023), 111724
work page 2023
-
[62]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of the Web Conference (WWW’21). 3087–3098
work page 2021
-
[63]
Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 553–565
work page 2023
-
[64]
Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML’19), Vol. 97. 7154–7163
work page 2019
-
[65]
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing {Edge-Cases} in Distributed Systems. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 321–339
work page 2023
-
[66]
Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin, et al. 2023. Robust failure diagnosis of microservice system through multimodal data.IEEE Transactions on Services Computing(2023)
work page 2023
-
[67]
Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, and Min Ke. 2021. CloudRCA: A root cause analysis framework for cloud computing platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4373–4382
work page 2021
-
[68]
Lecheng Zheng, Zhengzhang Chen, Jingrui He, and Haifeng Chen. 2024. MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems. InProceedings of the ACM on Web Conference 2024. 4107–4116
work page 2024
-
[69]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering47, 2 (2018), 243–260
work page 2018
-
[70]
Zhouruixing Zhu, Cheryl Lee, Xiaoying Tang, and Pinjia He. 2024. HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources.ACM Transactions on Software Engineering and Methodology(2024). Received 2026-02-16; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE130. Publication date: July 2026
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.