TORAI: Multi-source Root Cause Analysis for Blind Spots in Microservice Service Call Graph

Hongyu Zhang; Huong Ha; Luan Pham; Xiuzhen Zhang

arxiv: 2604.13522 · v2 · submitted 2026-04-15 · 💻 cs.SE

TORAI: Multi-source Root Cause Analysis for Blind Spots in Microservice Service Call Graph

Luan Pham , Huong Ha , Xiuzhen Zhang , Hongyu Zhang This is my paper

Pith reviewed 2026-05-10 13:41 UTC · model grok-4.3

classification 💻 cs.SE

keywords root cause analysismicroservicesblind spotsanomaly severitycausal analysisclusteringmulti-source telemetryunsupervised diagnosis

0 comments

The pith

TORAI locates fine-grained root causes in microservice systems with blind spots by clustering services on anomaly severity and ranking causes inside each cluster without using a call graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TORAI as a method to find the root causes of failures in microservice architectures when some services lack traces and therefore create blind spots that break conventional call-graph approaches. It measures anomaly severity from available multi-source telemetry, groups services that exhibit similar severity patterns through clustering, conducts causal analysis to rank services inside each group, aggregates those rankings, and applies hypothesis testing to select the most likely causes. This matters because microservice systems evolve quickly and often include black-box or unsupported components that prevent full graph construction, leaving existing tools unable to diagnose the invisible parts. If the method works as described, diagnosis can proceed using only the data already collected, without adding tracing to every service or assuming complete visibility.

Core claim

TORAI is an unsupervised approach that quantifies anomaly severity from multi-source telemetry data, clusters services according to their severity symptom profiles, performs causal analysis within each cluster to produce local rankings, aggregates the rankings across clusters, and uses hypothesis testing to identify the fine-grained root causes. It operates without constructing or relying on a service call graph and without requiring further intrusive instrumentation on blind-spot services.

What carries the argument

Clustering services by shared anomaly severity symptoms, followed by intra-cluster causal analysis, cross-cluster aggregation, and hypothesis testing to produce a final ranking of root causes.

Load-bearing premise

Grouping services by similar anomaly severity patterns and then performing causal analysis inside those groups will surface the true root cause even when the overall service call structure is unknown.

What would settle it

Inject a known root cause into a blind-spot service in one of the benchmark systems, run TORAI, and check whether that service appears in the top-3 recommendations; consistent absence would show the clustering-plus-causal step does not isolate the cause.

Figures

Figures reproduced from arXiv: 2604.13522 by Hongyu Zhang, Huong Ha, Luan Pham, Xiuzhen Zhang.

**Figure 2.** Figure 2: Overview of TORAI. (A) TORAI transforms telemetry data into time series. (B) It computes anomaly [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: CausalRanker analyses the multi-source time series data of all services within each severity group to construct a causal graph and identify the root causes. (ErrLog: Error Logs, TraceLat: Latency extracted from Traces). Once the optimal number of clusters is determined, SymptomCluster performs clustering again using this value. For each cluster, SymptomCluster measures the cluster severity score as the… view at source ↗

**Figure 5.** Figure 5: Illustration of our setup for the microservice systems and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of TORAI performance under varying [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Stack traces as Fine-grained root cause of the [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: The frequency of normal logs/stack traces of cartservice. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Existing multi-source root cause analysis (RCA) methods for microservice systems assume all services have traces to construct a service call graph. However, this assumption is not practical as microservice systems evolve rapidly and may contain blackbox services without traces, such as compiled software or unsupported services. We refer to these services as blind spots. In the presence of blind spots, the performance of existing multi-source RCA methods may be affected, as they only diagnose visible services on the call graph. To overcome this limitation, we propose TORAI, a novel unsupervised approach that effectively pinpoints fine-grained root causes without relying on the service call graph. Instead, TORAI first measures anomaly severity using available multi-source telemetry data. It then performs clustering to group services based on their severity symptoms and conducts causal analysis to rank services within each severity cluster. Finally, TORAI aggregates the cluster rankings and uses hypothesis testing to identify fine-grained root causes. TORAI provides an unsupervised approach that leverages available multi-source telemetry data for RCA without requiring a constructed service call graph or further intrusive actions, thus addressing the limitations of existing methods. Our experiments on three benchmark systems demonstrate that TORAI outperforms state-of-the-art baselines remarkably in the presence of blind spots. Performance on real-world failures further shows that TORAI can accurately pinpoint the root causes in top-3 recommendations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TORAI tries to bypass call graphs in microservice RCA by clustering services on anomaly severity then ranking causally inside clusters, but the causal step has no clear way to break symmetry between roots and downstream effects.

read the letter

The main takeaway is that TORAI claims to locate fine-grained root causes in microservices even when blind spots prevent building a service call graph. It does this by measuring severity from multi-source telemetry, clustering services with similar symptoms, running causal analysis within each cluster, aggregating the rankings, and applying hypothesis testing. The combination is new relative to prior graph-dependent methods, and the paper correctly flags a practical limitation in real systems where some services are black-box or untraced. That framing is useful and the unsupervised angle avoids extra instrumentation. The approach is straightforward on paper and could appeal to teams that already collect logs, metrics, and traces but cannot rely on complete topology. The soft spot is exactly the one the stress-test flags: once services land in the same cluster because they show comparable severity, standard causal procedures on observational data have no topological or temporal signal to rank the true root above services that are merely affected. The abstract gives no propagation model, no specific causal algorithm, and no detail on how hypothesis testing recovers the missing information. Experiments are said to show gains on three benchmarks and real failures, yet without ablations, error bars, or controls for cluster quality the results stay hard to interpret. This work is aimed at practitioners and researchers who build diagnostic tools for cloud systems with incomplete monitoring. A reader looking for concrete ideas on graph-free RCA could extract the high-level pipeline, but anyone wanting to implement or extend it would need the missing method specifics. The paper deserves peer review because the problem is real and the proposed direction is worth testing, though the causal justification and experimental reporting will need substantial strengthening before the claims can be trusted.

Referee Report

3 major / 2 minor

Summary. The paper proposes TORAI, an unsupervised multi-source root cause analysis (RCA) method for microservice systems that contain blind spots (services without traces). TORAI measures anomaly severity from available telemetry, clusters services by severity symptom similarity, performs causal analysis to rank services inside each cluster, aggregates the per-cluster rankings, and applies hypothesis testing to output fine-grained root causes. It claims to outperform state-of-the-art baselines on three benchmark systems with simulated blind spots and to achieve accurate top-3 identification on real-world failures, all without constructing or relying on a service call graph.

Significance. If the central technical steps hold, TORAI would address a practical limitation of existing graph-based RCA methods in rapidly evolving microservice deployments where full observability cannot be assumed. The approach of symptom-based clustering plus intra-cluster causal ranking is a plausible way to operate on partial telemetry; successful validation would be a concrete contribution to fault diagnosis under incomplete tracing.

major comments (3)

[Abstract, §3] Abstract and §3 (method overview): the claim that intra-cluster causal analysis can reliably rank the true root cause above downstream services rests on an unstated assumption that the chosen causal procedure (unspecified in the text) can break symmetry on purely observational severity time series. No propagation model, temporal lag structure, or topological prior is introduced inside a cluster, so any correlation- or independence-based method operates on data that is symmetric between root and effect; this directly undermines the central claim that the pipeline identifies fine-grained root causes without the call graph.
[§4] §4 (experiments): the reported outperformance on three benchmark systems is presented without ablation of the clustering step versus the causal-ranking step, without error bars or statistical significance tests across runs, and without explicit description of how blind spots were injected or how severity was quantified. These omissions make it impossible to determine whether the claimed gains are attributable to the proposed pipeline or to other factors, weakening the empirical support for the method's robustness.
[§3.3] §3.3 (aggregation and hypothesis testing): the final aggregation of cluster rankings followed by hypothesis testing is described at a high level but lacks the concrete statistical procedure, multiple-testing correction, or threshold derivation. Because the preceding causal ranking already lacks structural signal, any downstream hypothesis test cannot recover information that was never present in the telemetry.

minor comments (2)

[§3] Notation for anomaly severity and cluster membership is introduced without a clear mathematical definition or pseudocode; adding an explicit equation or algorithm box would improve reproducibility.
[§2] The paper cites prior RCA methods but does not compare against recent unsupervised causal-discovery baselines that also operate on multivariate time series; a brief discussion of why those are not applicable would strengthen the related-work section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We provide point-by-point responses to the major comments below and commit to revisions that address the raised issues.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method overview): the claim that intra-cluster causal analysis can reliably rank the true root cause above downstream services rests on an unstated assumption that the chosen causal procedure (unspecified in the text) can break symmetry on purely observational severity time series. No propagation model, temporal lag structure, or topological prior is introduced inside a cluster, so any correlation- or independence-based method operates on data that is symmetric between root and effect; this directly undermines the central claim that the pipeline identifies fine-grained root causes without the call graph.

Authors: We appreciate this observation and agree that additional details are needed. In the revised version, we will elaborate on the causal analysis method in §3, explaining the specific procedure employed and how it utilizes temporal information from the severity time series to infer causal directions and rank the root cause higher than downstream services. This will clarify the mechanism by which symmetry is broken without relying on a service call graph. revision: yes
Referee: [§4] §4 (experiments): the reported outperformance on three benchmark systems is presented without ablation of the clustering step versus the causal-ranking step, without error bars or statistical significance tests across runs, and without explicit description of how blind spots were injected or how severity was quantified. These omissions make it impossible to determine whether the claimed gains are attributable to the proposed pipeline or to other factors, weakening the empirical support for the method's robustness.

Authors: We acknowledge the validity of these criticisms regarding the experimental presentation. We will revise §4 to include ablation studies comparing the full pipeline against variants without clustering or without causal ranking, report results with error bars from multiple independent runs along with statistical significance tests, and provide explicit details on the blind spot simulation process and the severity quantification formulas used for each telemetry source. revision: yes
Referee: [§3.3] §3.3 (aggregation and hypothesis testing): the final aggregation of cluster rankings followed by hypothesis testing is described at a high level but lacks the concrete statistical procedure, multiple-testing correction, or threshold derivation. Because the preceding causal ranking already lacks structural signal, any downstream hypothesis test cannot recover information that was never present in the telemetry.

Authors: We agree that the description in §3.3 is at a high level and will expand it in the revision to include the precise aggregation algorithm, the hypothesis testing procedure with details on test statistics and p-values, the multiple-testing correction method, and the derivation of any thresholds. We will also explicitly link this to the causal ranking step, noting that the temporal causality analysis within clusters does introduce directional information that the aggregation and testing build upon to identify the fine-grained root causes. revision: yes

Circularity Check

0 steps flagged

No circularity: TORAI is a procedural pipeline of standard statistical steps with no derivations or self-referential reductions

full rationale

The paper describes TORAI as a sequence of standard operations—measuring anomaly severity from multi-source telemetry, clustering services by symptom similarity, performing causal analysis within clusters, aggregating rankings, and applying hypothesis testing—without any equations, fitted parameters presented as predictions, or first-principles derivations. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core logic; the method is presented as an unsupervised combination of existing techniques. The central claims rest on empirical outperformance rather than any chain that reduces to its own inputs by construction, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard unsupervised clustering and causal discovery techniques are sufficient without graph structure.

pith-pipeline@v0.9.0 · 5546 in / 1107 out tokens · 23031 ms · 2026-05-10T13:41:42.748428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Container Advisor - an open-source tool to monitor containers

2025. Container Advisor - an open-source tool to monitor containers. Retrieved Apr 14, 2026 from https://github.com/ google/cadvisor

work page 2025
[2]

Elasticsearch Log Monitoring - Centralized Log Management

2025. Elasticsearch Log Monitoring - Centralized Log Management. Retrieved Apr 14, 2026 from https://www.elastic.co/

work page 2025
[3]

Gaussian Mixture Model

2025. Gaussian Mixture Model. Retrieved Apr 14, 2026 from https://scikit-learn.org/stable/modules/generated/sklearn. mixture.GaussianMixture.html

work page 2025
[4]

Google - Site Reliability Engineering

2025. Google - Site Reliability Engineering. Retrieved Apr 14, 2026 from https://sre.google/sre-book/monitoring- distributed-systems/

work page 2025
[5]

Grafana Loki - A horizontally scalable, highly available, multi-tenant log aggregation system

2025. Grafana Loki - A horizontally scalable, highly available, multi-tenant log aggregation system. Retrieved Apr 14, 2026 from https://grafana.com/oss/loki/

work page 2025
[6]

The Istio service mesh

2025. The Istio service mesh. Retrieved Apr 14, 2026 from https://istio.io/

work page 2025
[7]

Jaeger: open source, distributed tracing platform

2025. Jaeger: open source, distributed tracing platform. Retrieved Apr 14, 2026 from https://www.jaegertracing.io/

work page 2025
[8]

A lightweight, ultra-fast tool for building observability pipelines

2025. A lightweight, ultra-fast tool for building observability pipelines. Retrieved Apr 14, 2026 from https://vector.dev/

work page 2025
[9]

Online Boutique is a cloud-first microservices demo application

2025. Online Boutique is a cloud-first microservices demo application. Retrieved Apr 14, 2026 from https://github. com/GoogleCloudPlatform/microservices-demo

work page 2025
[10]

An open-source monitoring and alerting toolkit

2025. An open-source monitoring and alerting toolkit. Retrieved Apr 14, 2026 from https://prometheus.io/

work page 2025
[11]

Sock Shop - A Microservices Demo Application

2025. Sock Shop - A Microservices Demo Application. Retrieved Apr 14, 2026 from https://github.com/microservices- demo/microservices-demo

work page 2025
[12]

Stress test for Computer system

2025. Stress test for Computer system. Retrieved Apr 14, 2026 from https://manpages.ubuntu.com/manpages/bionic/ man1/stress-ng.1.html

work page 2025
[13]

Traffic Control

2025. Traffic Control. Retrieved Apr 14, 2026 from https://man7.org/linux/man-pages/man8/tc.8.html

work page 2025
[14]

Train Ticket Benchmark System

2025. Train Ticket Benchmark System. Retrieved Apr 14, 2026 from https://github.com/FudanSELab/train-ticket

work page 2025
[15]

Uber Microservice Systems

2025. Uber Microservice Systems. Retrieved Apr 14, 2026 from https://www.uber.com/en-AU/blog/up-portable- microservices-ready-for-the-cloud

work page 2025
[16]

Sachin Ashok, Vipul Harsh, Brighten Godfrey, Radhika Mittal, Srinivasan Parthasarathy, and Larisa Shwartz. 2024. TraceWeaver: Distributed Request Tracing for Microservices Without Application Modification. InProceedings of the ACM SIGCOMM 2024 Conference. 828–842

work page 2024
[17]

Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing.IEEE transactions on dependable and secure computing1, 1 (2004), 11–33

work page 2004
[18]

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti. 2019. How bad can a bug get? an empirical analysis of software failures in the openstack cloud computing platform. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineerin...

work page 2019
[19]

Luca Giamattei, Antonio Guerriero, Roberto Pietrantuono, Stefano Russo, Ivano Malavolta, Tanjina Islam, Madalina Dinga, Anne Koziolek, Snigdha Singh, Martin Armbruster, et al. 2024. Monitoring tools for DevOps and microservices: A systematic grey literature review.Journal of Systems and Software208 (2024), 111906

work page 2024
[20]

Shenghui Gu, Guoping Rong, Tian Ren, He Zhang, Haifeng Shen, Yongda Yu, Xian Li, Jian Ouyang, and Chunan Chen

work page
[21]

TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems.IEEE Transactions on Software Engineering49, 5 (2023), 3071–3088

work page 2023
[22]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In2017 IEEE international conference on web services (ICWS). IEEE, 33–40

work page 2017
[23]

Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering.ACM computing surveys (CSUR)54, 6 (2021), 1–37

work page 2021
[24]

Zilong He, Pengfei Chen, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, and Fangyuan Li. 2022. Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–13

work page 2022
[25]

Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. 2021. Diagnosing performance issues in microservices with heterogeneous data source. In2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/Social- Com/SustainCom). ...

work page 2021
[26]

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root Cause Analysis of Failures in Microservices through Causal Discovery. InAdvances in Neural Information Processing Systems (NeurIPS’22), Vol. 35. 31158–31170

work page 2022
[27]

Amin Jaber, Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. 2020. Causal Discovery from Soft Interventions with Unknown Targets: Characterization and Learning. InAdvances in Neural Information Processing Systems (NeurIPS’20), Vol. 33. 9551–9561. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE130. Publication date: July 2026. FSE130:22 Lua...

work page 2020
[28]

Andrea Janes, Xiaozhou Li, and Valentina Lenarduzzi. 2023. Open tracing tools: Overview and critical comparison. Journal of Systems and Software(2023), 111793

work page 2023
[29]

Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. 2023. Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1750–1762

work page 2023
[30]

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’22). 3230–3240

work page 2022
[31]

Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS’21). 1–10

work page 2021
[32]

Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqing Duan, and Dan Pei. 2022. Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems. InProceedings of the 30th ACM Joint Meeting on European Software Engineering Confe...

work page 2022
[33]

Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. InInternational Conference on Service-Oriented Computing. Springer, 3–20

work page 2018
[34]

Chenghao Liu, Wenzhuo Yang, Himanshu Mittal, Manpreet Singh, Doyen Sahoo, and Steven CH Hoi. 2023. PyRCA: A Library for Metric-based Root Cause Analysis.arXiv preprint arXiv:2306.11417(2023)

work page arXiv 2023
[35]

Fengrui Liu, Yang Wang, Zhenyu Li, Rui Ren, Hongtao Guan, Xian Yu, Xiaofan Chen, and Gaogang Xie. 2022. MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting. InInternational Conference on Case-Based Reasoning. Springer, 224–239

work page 2022
[36]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Chengzhong Xu. 2022. An in-depth study of microservice call graph and runtime performance.IEEE Transactions on Parallel and Distributed Systems33, 12 (2022), 3901–3914

work page 2022
[37]

Leonardo Mariani, Cristina Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing faults in cloud systems. In2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 262–273

work page 2018
[38]

Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing Failure Root Causes in a Microservice through Causality Inference. InIEEE/ACM 28th International Symposium on Quality of Service (IWQoS’20). 1–10

work page 2020
[39]

Odigos. 2025. Solving the Pitfalls of Distributed Tracing in Real-World Microservices. https://odigos.io/blog/solving- pitfalls-of-distributed-tracing-in-real-world-microservices. Accessed on September 2, 2025

work page 2025
[40]

William Roy Orchard, Nastaran Okati, Sergio Hernan Garrido Mejia, Patrick Blöbaum, and Dominik Janzing. 2025. Root Cause Analysis of Outliers with Missing Structural Knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=7Nxq4RQApu

work page 2025
[41]

Eva Patel and Dharmender Singh Kushwaha. 2020. Clustering cloud workloads: K-means vs gaussian mixture model. Procedia computer science171 (2020), 158–167

work page 2020
[42]

TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph

Luan Pham. 2026. Datasets for "TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph". (4 2026). doi:10.6084/m9.figshare.31925976.v1

work page doi:10.6084/m9.figshare.31925976.v1 2026
[43]

Luan Pham. 2026. Graph-Free Root Cause Analysis.arXiv preprint arXiv:2601.21359(2026)

work page arXiv 2026
[44]

TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph

Luan Pham. 2026. Source code of "TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph". (4 2026). doi:10.6084/m9.figshare.31938495.v1

work page doi:10.6084/m9.figshare.31938495.v1 2026
[45]

Luan Pham, Huong Ha, and Hongyu Zhang. 2024. BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection.Proceedings of the ACM on Software Engineering1, FSE (2024), 2214–2237

work page 2024
[46]

Luan Pham, Huong Ha, and Hongyu Zhang. 2024. Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?. InThe 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

work page 2024
[47]

Luan Pham, Victor Nicolet, Joey Dodds, Hui Guan, and Daniel Kroening. 2026. EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems.Proceedings of the ACM on Software Engineering3, FSE (2026)

work page 2026
[48]

Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. 2025. RCAEval: a benchmark for root cause analysis of microservice systems with telemetry data. InCompanion Proceedings of the ACM on Web Conference 2025. 777–780

work page 2025
[49]

Raphael Rouf, Mohammadreza Rasolroveicy, Marin Litoiu, Seema Nagar, Prateeti Mohapatra, Pranjal Gupta, and Ian Watts. 2024. InstantOps: A Joint Approach to System Failure Prediction and Root Cause Identification in Microserivces Cloud-Native Applications. InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering. Proc. ACM Sof...

work page 2024
[50]

Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. 2019. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances5, 11 (2019)

work page 2019
[51]

Gideon Schwarz. 1978. Estimating the Dimension of a Model.The Annals of Statistics6, 2 (1978), 461 – 464

work page 1978
[52]

Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. 𝜖- Diagnosis: Unsupervised and Real-Time Diagnosis of Small-Window Long-Tail Latency in Large-Scale Microservice Platforms. InThe World Wide Web Conference (WWW’19). 3215–3222

work page 2019
[53]

Junxian Shen, Han Zhang, Yang Xiang, Xingang Shi, Xinrui Li, Yunxi Shen, Zijian Zhang, Yongxiang Wu, Xia Yin, Jilong Wang, et al. 2023. Network-centric distributed tracing with DeepFlow: Troubleshooting your microservices in zero code. InProceedings of the ACM SIGCOMM 2023 Conference. 420–437

work page 2023
[54]

Jacopo Soldani and Antonio Brogi. 2022. Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey.Comput. Surveys55, 3 (2022)

work page 2022
[55]

Peter Spirtes, Christopher Meek, and Thomas Richardson. 1995. Causal Inference in the Presence of Latent Variables and Selection Bias. InProceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI’95). 499–506

work page 1995
[56]

Lingzhi Wang, Nengwen Zhao, Junjie Chen, Pinnong Li, Wenchi Zhang, and Kaixin Sui. 2020. Root-cause metric location for microservice systems via log anomaly detection. In2020 IEEE international conference on web services (ICWS). IEEE, 142–150

work page 2020
[57]

Grabarnik, Vijay Arya, and Karthikeyan Shanmugam

Qing Wang, Larisa Shwartz, Genady Ya. Grabarnik, Vijay Arya, and Karthikeyan Shanmugam. 2021. Detecting Causal Structure on Cloud Application Microservices Using Granger Causality Models. InIEEE 14th International Conference on Cloud Computing (CLOUD’21). 558–565

work page 2021
[58]

2022.Automatic performance diagnosis and recovery in cloud microservices

Li Wu. 2022.Automatic performance diagnosis and recovery in cloud microservices. TU Berlin (Germany)

work page 2022
[59]

Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. Microdiag: Fine-grained performance diagnosis for microservice systems. In2021 IEEE/ACM International Workshop on Cloud Intelligence. IEEE, 31–36

work page 2021
[60]

Shuaiyu Xie, Jian Wang, Hanbin He, Zhihao Wang, Yuqi Zhao, Neng Zhang, and Bing Li. 2026. TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework for Microservice-based Systems with Multimodal Data.ACM Trans. Softw. Eng. Methodol.35, 2, Article 40 (2026), 39 pages. doi:10.1145/3734868

work page doi:10.1145/3734868 2026
[61]

Ruyue Xin, Peng Chen, and Zhiming Zhao. 2023. CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software203 (2023), 111724

work page 2023
[62]

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of the Web Conference (WWW’21). 3087–3098

work page 2021
[63]

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 553–565

work page 2023
[64]

Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML’19), Vol. 97. 7154–7163

work page 2019
[65]

Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing {Edge-Cases} in Distributed Systems. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 321–339

work page 2023
[66]

Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin, et al. 2023. Robust failure diagnosis of microservice system through multimodal data.IEEE Transactions on Services Computing(2023)

work page 2023
[67]

Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, and Min Ke. 2021. CloudRCA: A root cause analysis framework for cloud computing platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4373–4382

work page 2021
[68]

Lecheng Zheng, Zhengzhang Chen, Jingrui He, and Haifeng Chen. 2024. MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems. InProceedings of the ACM on Web Conference 2024. 4107–4116

work page 2024
[69]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering47, 2 (2018), 243–260

work page 2018
[70]

Zhouruixing Zhu, Cheryl Lee, Xiaoying Tang, and Pinjia He. 2024. HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources.ACM Transactions on Software Engineering and Methodology(2024). Received 2026-02-16; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE130. Publication date: July 2026

work page 2024

[1] [1]

Container Advisor - an open-source tool to monitor containers

2025. Container Advisor - an open-source tool to monitor containers. Retrieved Apr 14, 2026 from https://github.com/ google/cadvisor

work page 2025

[2] [2]

Elasticsearch Log Monitoring - Centralized Log Management

2025. Elasticsearch Log Monitoring - Centralized Log Management. Retrieved Apr 14, 2026 from https://www.elastic.co/

work page 2025

[3] [3]

Gaussian Mixture Model

2025. Gaussian Mixture Model. Retrieved Apr 14, 2026 from https://scikit-learn.org/stable/modules/generated/sklearn. mixture.GaussianMixture.html

work page 2025

[4] [4]

Google - Site Reliability Engineering

2025. Google - Site Reliability Engineering. Retrieved Apr 14, 2026 from https://sre.google/sre-book/monitoring- distributed-systems/

work page 2025

[5] [5]

Grafana Loki - A horizontally scalable, highly available, multi-tenant log aggregation system

2025. Grafana Loki - A horizontally scalable, highly available, multi-tenant log aggregation system. Retrieved Apr 14, 2026 from https://grafana.com/oss/loki/

work page 2025

[6] [6]

The Istio service mesh

2025. The Istio service mesh. Retrieved Apr 14, 2026 from https://istio.io/

work page 2025

[7] [7]

Jaeger: open source, distributed tracing platform

2025. Jaeger: open source, distributed tracing platform. Retrieved Apr 14, 2026 from https://www.jaegertracing.io/

work page 2025

[8] [8]

A lightweight, ultra-fast tool for building observability pipelines

2025. A lightweight, ultra-fast tool for building observability pipelines. Retrieved Apr 14, 2026 from https://vector.dev/

work page 2025

[9] [9]

Online Boutique is a cloud-first microservices demo application

2025. Online Boutique is a cloud-first microservices demo application. Retrieved Apr 14, 2026 from https://github. com/GoogleCloudPlatform/microservices-demo

work page 2025

[10] [10]

An open-source monitoring and alerting toolkit

2025. An open-source monitoring and alerting toolkit. Retrieved Apr 14, 2026 from https://prometheus.io/

work page 2025

[11] [11]

Sock Shop - A Microservices Demo Application

2025. Sock Shop - A Microservices Demo Application. Retrieved Apr 14, 2026 from https://github.com/microservices- demo/microservices-demo

work page 2025

[12] [12]

Stress test for Computer system

2025. Stress test for Computer system. Retrieved Apr 14, 2026 from https://manpages.ubuntu.com/manpages/bionic/ man1/stress-ng.1.html

work page 2025

[13] [13]

Traffic Control

2025. Traffic Control. Retrieved Apr 14, 2026 from https://man7.org/linux/man-pages/man8/tc.8.html

work page 2025

[14] [14]

Train Ticket Benchmark System

2025. Train Ticket Benchmark System. Retrieved Apr 14, 2026 from https://github.com/FudanSELab/train-ticket

work page 2025

[15] [15]

Uber Microservice Systems

2025. Uber Microservice Systems. Retrieved Apr 14, 2026 from https://www.uber.com/en-AU/blog/up-portable- microservices-ready-for-the-cloud

work page 2025

[16] [16]

Sachin Ashok, Vipul Harsh, Brighten Godfrey, Radhika Mittal, Srinivasan Parthasarathy, and Larisa Shwartz. 2024. TraceWeaver: Distributed Request Tracing for Microservices Without Application Modification. InProceedings of the ACM SIGCOMM 2024 Conference. 828–842

work page 2024

[17] [17]

Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing.IEEE transactions on dependable and secure computing1, 1 (2004), 11–33

work page 2004

[18] [18]

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti. 2019. How bad can a bug get? an empirical analysis of software failures in the openstack cloud computing platform. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineerin...

work page 2019

[19] [19]

Luca Giamattei, Antonio Guerriero, Roberto Pietrantuono, Stefano Russo, Ivano Malavolta, Tanjina Islam, Madalina Dinga, Anne Koziolek, Snigdha Singh, Martin Armbruster, et al. 2024. Monitoring tools for DevOps and microservices: A systematic grey literature review.Journal of Systems and Software208 (2024), 111906

work page 2024

[20] [20]

Shenghui Gu, Guoping Rong, Tian Ren, He Zhang, Haifeng Shen, Yongda Yu, Xian Li, Jian Ouyang, and Chunan Chen

work page

[21] [21]

TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems.IEEE Transactions on Software Engineering49, 5 (2023), 3071–3088

work page 2023

[22] [22]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In2017 IEEE international conference on web services (ICWS). IEEE, 33–40

work page 2017

[23] [23]

Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering.ACM computing surveys (CSUR)54, 6 (2021), 1–37

work page 2021

[24] [24]

Zilong He, Pengfei Chen, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, and Fangyuan Li. 2022. Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–13

work page 2022

[25] [25]

Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. 2021. Diagnosing performance issues in microservices with heterogeneous data source. In2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/Social- Com/SustainCom). ...

work page 2021

[26] [26]

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root Cause Analysis of Failures in Microservices through Causal Discovery. InAdvances in Neural Information Processing Systems (NeurIPS’22), Vol. 35. 31158–31170

work page 2022

[27] [27]

Amin Jaber, Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. 2020. Causal Discovery from Soft Interventions with Unknown Targets: Characterization and Learning. InAdvances in Neural Information Processing Systems (NeurIPS’20), Vol. 33. 9551–9561. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE130. Publication date: July 2026. FSE130:22 Lua...

work page 2020

[28] [28]

Andrea Janes, Xiaozhou Li, and Valentina Lenarduzzi. 2023. Open tracing tools: Overview and critical comparison. Journal of Systems and Software(2023), 111793

work page 2023

[29] [29]

Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. 2023. Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1750–1762

work page 2023

[30] [30]

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’22). 3230–3240

work page 2022

[31] [31]

Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS’21). 1–10

work page 2021

[32] [32]

Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqing Duan, and Dan Pei. 2022. Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems. InProceedings of the 30th ACM Joint Meeting on European Software Engineering Confe...

work page 2022

[33] [33]

Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. InInternational Conference on Service-Oriented Computing. Springer, 3–20

work page 2018

[34] [34]

Chenghao Liu, Wenzhuo Yang, Himanshu Mittal, Manpreet Singh, Doyen Sahoo, and Steven CH Hoi. 2023. PyRCA: A Library for Metric-based Root Cause Analysis.arXiv preprint arXiv:2306.11417(2023)

work page arXiv 2023

[35] [35]

Fengrui Liu, Yang Wang, Zhenyu Li, Rui Ren, Hongtao Guan, Xian Yu, Xiaofan Chen, and Gaogang Xie. 2022. MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting. InInternational Conference on Case-Based Reasoning. Springer, 224–239

work page 2022

[36] [36]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Jian He, and Chengzhong Xu. 2022. An in-depth study of microservice call graph and runtime performance.IEEE Transactions on Parallel and Distributed Systems33, 12 (2022), 3901–3914

work page 2022

[37] [37]

Leonardo Mariani, Cristina Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing faults in cloud systems. In2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 262–273

work page 2018

[38] [38]

Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing Failure Root Causes in a Microservice through Causality Inference. InIEEE/ACM 28th International Symposium on Quality of Service (IWQoS’20). 1–10

work page 2020

[39] [39]

Odigos. 2025. Solving the Pitfalls of Distributed Tracing in Real-World Microservices. https://odigos.io/blog/solving- pitfalls-of-distributed-tracing-in-real-world-microservices. Accessed on September 2, 2025

work page 2025

[40] [40]

William Roy Orchard, Nastaran Okati, Sergio Hernan Garrido Mejia, Patrick Blöbaum, and Dominik Janzing. 2025. Root Cause Analysis of Outliers with Missing Structural Knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=7Nxq4RQApu

work page 2025

[41] [41]

Eva Patel and Dharmender Singh Kushwaha. 2020. Clustering cloud workloads: K-means vs gaussian mixture model. Procedia computer science171 (2020), 158–167

work page 2020

[42] [42]

TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph

Luan Pham. 2026. Datasets for "TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph". (4 2026). doi:10.6084/m9.figshare.31925976.v1

work page doi:10.6084/m9.figshare.31925976.v1 2026

[43] [43]

Luan Pham. 2026. Graph-Free Root Cause Analysis.arXiv preprint arXiv:2601.21359(2026)

work page arXiv 2026

[44] [44]

TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph

Luan Pham. 2026. Source code of "TORAI: Multi-Source Root Cause Analysis for Blind Spots in Microservice Service Call Graph". (4 2026). doi:10.6084/m9.figshare.31938495.v1

work page doi:10.6084/m9.figshare.31938495.v1 2026

[45] [45]

Luan Pham, Huong Ha, and Hongyu Zhang. 2024. BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection.Proceedings of the ACM on Software Engineering1, FSE (2024), 2214–2237

work page 2024

[46] [46]

Luan Pham, Huong Ha, and Hongyu Zhang. 2024. Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?. InThe 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

work page 2024

[47] [47]

Luan Pham, Victor Nicolet, Joey Dodds, Hui Guan, and Daniel Kroening. 2026. EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems.Proceedings of the ACM on Software Engineering3, FSE (2026)

work page 2026

[48] [48]

Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. 2025. RCAEval: a benchmark for root cause analysis of microservice systems with telemetry data. InCompanion Proceedings of the ACM on Web Conference 2025. 777–780

work page 2025

[49] [49]

Raphael Rouf, Mohammadreza Rasolroveicy, Marin Litoiu, Seema Nagar, Prateeti Mohapatra, Pranjal Gupta, and Ian Watts. 2024. InstantOps: A Joint Approach to System Failure Prediction and Root Cause Identification in Microserivces Cloud-Native Applications. InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering. Proc. ACM Sof...

work page 2024

[50] [50]

Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. 2019. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances5, 11 (2019)

work page 2019

[51] [51]

Gideon Schwarz. 1978. Estimating the Dimension of a Model.The Annals of Statistics6, 2 (1978), 461 – 464

work page 1978

[52] [52]

Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. 𝜖- Diagnosis: Unsupervised and Real-Time Diagnosis of Small-Window Long-Tail Latency in Large-Scale Microservice Platforms. InThe World Wide Web Conference (WWW’19). 3215–3222

work page 2019

[53] [53]

Junxian Shen, Han Zhang, Yang Xiang, Xingang Shi, Xinrui Li, Yunxi Shen, Zijian Zhang, Yongxiang Wu, Xia Yin, Jilong Wang, et al. 2023. Network-centric distributed tracing with DeepFlow: Troubleshooting your microservices in zero code. InProceedings of the ACM SIGCOMM 2023 Conference. 420–437

work page 2023

[54] [54]

Jacopo Soldani and Antonio Brogi. 2022. Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey.Comput. Surveys55, 3 (2022)

work page 2022

[55] [55]

Peter Spirtes, Christopher Meek, and Thomas Richardson. 1995. Causal Inference in the Presence of Latent Variables and Selection Bias. InProceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI’95). 499–506

work page 1995

[56] [56]

Lingzhi Wang, Nengwen Zhao, Junjie Chen, Pinnong Li, Wenchi Zhang, and Kaixin Sui. 2020. Root-cause metric location for microservice systems via log anomaly detection. In2020 IEEE international conference on web services (ICWS). IEEE, 142–150

work page 2020

[57] [57]

Grabarnik, Vijay Arya, and Karthikeyan Shanmugam

Qing Wang, Larisa Shwartz, Genady Ya. Grabarnik, Vijay Arya, and Karthikeyan Shanmugam. 2021. Detecting Causal Structure on Cloud Application Microservices Using Granger Causality Models. InIEEE 14th International Conference on Cloud Computing (CLOUD’21). 558–565

work page 2021

[58] [58]

2022.Automatic performance diagnosis and recovery in cloud microservices

Li Wu. 2022.Automatic performance diagnosis and recovery in cloud microservices. TU Berlin (Germany)

work page 2022

[59] [59]

Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. Microdiag: Fine-grained performance diagnosis for microservice systems. In2021 IEEE/ACM International Workshop on Cloud Intelligence. IEEE, 31–36

work page 2021

[60] [60]

Shuaiyu Xie, Jian Wang, Hanbin He, Zhihao Wang, Yuqi Zhao, Neng Zhang, and Bing Li. 2026. TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework for Microservice-based Systems with Multimodal Data.ACM Trans. Softw. Eng. Methodol.35, 2, Article 40 (2026), 39 pages. doi:10.1145/3734868

work page doi:10.1145/3734868 2026

[61] [61]

Ruyue Xin, Peng Chen, and Zhiming Zhao. 2023. CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software203 (2023), 111724

work page 2023

[62] [62]

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of the Web Conference (WWW’21). 3087–3098

work page 2021

[63] [63]

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 553–565

work page 2023

[64] [64]

Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML’19), Vol. 97. 7154–7163

work page 2019

[65] [65]

Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing {Edge-Cases} in Distributed Systems. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 321–339

work page 2023

[66] [66]

Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin, et al. 2023. Robust failure diagnosis of microservice system through multimodal data.IEEE Transactions on Services Computing(2023)

work page 2023

[67] [67]

Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, and Min Ke. 2021. CloudRCA: A root cause analysis framework for cloud computing platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4373–4382

work page 2021

[68] [68]

Lecheng Zheng, Zhengzhang Chen, Jingrui He, and Haifeng Chen. 2024. MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems. InProceedings of the ACM on Web Conference 2024. 4107–4116

work page 2024

[69] [69]

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering47, 2 (2018), 243–260

work page 2018

[70] [70]

Zhouruixing Zhu, Cheryl Lee, Xiaoying Tang, and Pinjia He. 2024. HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources.ACM Transactions on Software Engineering and Methodology(2024). Received 2026-02-16; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE130. Publication date: July 2026

work page 2024