Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems ?
Pith reviewed 2026-05-07 13:21 UTC · model grok-4.3
The pith
Entity-level heterogeneity in microservices produces asymmetric fault propagation dominated by service-host cross-layer interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. NexusRCL internalizes these patterns by formalizing services and hosts as distinct node types within a heterogeneous graph, paired with an event-based abstraction mechanism, to capture both data-level and entity-level heterogeneity while lowering labeling costs through active learning.
What carries the argument
A heterogeneous graph that represents services and hosts as separate node types, together with event-based abstraction to encode fault propagation.
Load-bearing premise
The asymmetric cross-layer fault patterns measured on the two chosen benchmarks are representative of industrial microservice deployments and that separating services from hosts in the graph captures the main diagnostic signals without bias or omitted interactions.
What would settle it
Running the same benchmarks or a third independent microservice system and checking whether the measured fault propagation remains dominated by asymmetric service-host cross-layer edges; if it becomes largely symmetric or intra-layer, the motivation for the heterogeneous-graph design would be undercut.
Figures
read the original abstract
Microservice root cause localization is fundamentally challenged by the inherent heterogeneity of cloud-native systems, which encompasses diverse observability data and multiple system entities. Existing approaches typically focus on only one aspect of heterogeneity and thus fail to capture its full diagnostic value. In this work, we systematically examine the multifaceted role of heterogeneity within both microservice systems and the RCL process. This analysis motivates a deeper investigation into how entity-level distinctions and their asymmetric dependencies influence fault behavior. Our empirical analysis of two microservice benchmarks reveals that entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. In light of this, we propose NexusRCL, a semi-supervised framework that internalizes these propagation patterns by formalizing services and hosts as distinct node types within a heterogeneous graph. This design, coupled with an event-based abstraction mechanism, allows NexusRCL to effectively capture both data level and entity-level heterogeneity while minimizing labeling costs through active learning. Comprehensive evaluations on two industrial benchmark datasets demonstrate NexusRCL's superior performance, achieving improvements of up to 49.85\% in Top-1 accuracy (A@1) and 32.70\% in Average Top-5 accuracy (A@5) compared to state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines heterogeneity in microservice root cause localization (RCL), arguing that existing methods address only partial aspects. Empirical analysis on two benchmarks shows entity-level heterogeneity produces asymmetric fault propagation dominated by service-host cross-layer interactions. This motivates NexusRCL, a semi-supervised heterogeneous graph framework treating services and hosts as distinct node types, augmented by event-based abstraction and active learning to capture data- and entity-level heterogeneity while reducing labeling effort. Evaluations on two industrial datasets report gains of up to 49.85% in Top-1 accuracy (A@1) and 32.70% in Average Top-5 accuracy (A@5) over state-of-the-art baselines.
Significance. If the empirical patterns and performance gains hold under broader validation, the work offers concrete diagnostic value by identifying which heterogeneity dimensions (entity-level distinctions and cross-layer asymmetries) matter most for RCL. The benchmark-driven discovery of propagation asymmetry and the integration of active learning to limit labeling costs are positive contributions. The heterogeneous-graph design directly internalizes the observed patterns rather than treating heterogeneity as a generic modeling choice.
major comments (2)
- [Empirical analysis of benchmarks (motivating § on heterogeneous graph construction)] The central motivation and modeling decision in NexusRCL rest on the claim that entity-level heterogeneity in the two benchmarks produces representative asymmetric, cross-layer fault propagation. No quantitative comparison of benchmark scale, topology diversity, observability coverage, or failure-mode distribution against the industrial datasets is provided, leaving open whether the observed patterns are benchmark artifacts rather than general signals.
- [Evaluation section (industrial dataset results)] The reported improvements (49.85% A@1, 32.70% A@5) are presented without accompanying statistical significance tests, variance across runs, or detailed baseline re-implementation notes (e.g., hyper-parameter matching, feature sets). This weakens the strength of the superiority claim on the industrial datasets.
minor comments (2)
- [Abstract] Abstract omits any mention of statistical testing, baseline implementation details, or error analysis supporting the percentage improvements.
- [NexusRCL framework description] Notation for node types, edge semantics, and the event-based abstraction mechanism should be introduced with a small illustrative example or diagram early in the method description.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. The feedback on strengthening the empirical motivation and evaluation rigor is valuable. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Empirical analysis of benchmarks (motivating § on heterogeneous graph construction)] The central motivation and modeling decision in NexusRCL rest on the claim that entity-level heterogeneity in the two benchmarks produces representative asymmetric, cross-layer fault propagation. No quantitative comparison of benchmark scale, topology diversity, observability coverage, or failure-mode distribution against the industrial datasets is provided, leaving open whether the observed patterns are benchmark artifacts rather than general signals.
Authors: We agree that a quantitative comparison would strengthen the argument that the observed asymmetric, cross-layer fault propagation is representative rather than benchmark-specific. In the revised manuscript, we will add a new table in the empirical analysis section comparing the two benchmarks and the industrial datasets along the suggested dimensions: scale (number of services/hosts), topology diversity (e.g., average degree, number of layers, connectivity metrics), observability coverage (availability of metrics, logs, traces), and failure-mode distributions. This will explicitly demonstrate alignment between the motivating patterns and industrial characteristics. The consistent performance gains on the industrial datasets already provide supporting evidence that the heterogeneity modeling is effective beyond the benchmarks, but the added comparison will make this case more rigorous. revision: yes
-
Referee: [Evaluation section (industrial dataset results)] The reported improvements (49.85% A@1, 32.70% A@5) are presented without accompanying statistical significance tests, variance across runs, or detailed baseline re-implementation notes (e.g., hyper-parameter matching, feature sets). This weakens the strength of the superiority claim on the industrial datasets.
Authors: We acknowledge that the current presentation of results lacks sufficient statistical rigor and implementation details. In the revised manuscript, we will expand the evaluation section to include: (1) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with reported p-values) comparing NexusRCL against each baseline; (2) variance measures such as standard deviations across multiple runs (e.g., 5–10 random seeds); and (3) detailed baseline re-implementation notes, including hyperparameter tuning procedures, feature sets, and any adaptations made to ensure fair comparison. These additions will be supported by updated experimental setup descriptions and will allow readers to better evaluate the robustness of the reported gains. revision: yes
Circularity Check
No circularity: empirical motivation followed by separate evaluation
full rationale
The paper's chain is: (1) empirical analysis of two benchmarks observes entity-level heterogeneity and asymmetric fault propagation; (2) this observation motivates modeling services and hosts as distinct node types in a heterogeneous graph; (3) the resulting NexusRCL is evaluated on separate industrial datasets, reporting A@1/A@5 gains. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central performance claims are presented as outcomes of evaluation rather than reductions to the motivating observations by construction. Benchmark representativeness is a generalizability concern, not a circularity issue.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The two microservice benchmarks used for empirical analysis are representative of real-world industrial systems.
- domain assumption Modeling services and hosts as distinct node types in a heterogeneous graph, combined with event-based abstraction, captures the essential asymmetric propagation patterns.
Reference graph
Works this paper leans on
-
[1]
Understanding cloud-native applications after 10 years of cloud computing-a systematic mapping study.Journal of Systems and Software, 126:1–16, 2017
Nane Kratzke and Peter-Christian Quint. Understanding cloud-native applications after 10 years of cloud computing-a systematic mapping study.Journal of Systems and Software, 126:1–16, 2017
2017
-
[2]
Challenges and solution directions of microservice architectures: A systematic literature review.Applied sciences, 12(11):5507, 2022
Mehmet Söylemez, Bedir Tekinerdogan, and Ayça Kolukısa Tarhan. Challenges and solution directions of microservice architectures: A systematic literature review.Applied sciences, 12(11):5507, 2022
2022
-
[3]
Dependable microservices in the kubernetes era: A practitioners survey.Journal of Internet Services and Applications, 15(1):561–583, 2024
Vinícius JS Souza, Vânia O Neves, and Bruno YL Kimura. Dependable microservices in the kubernetes era: A practitioners survey.Journal of Internet Services and Applications, 15(1):561–583, 2024
2024
-
[4]
Microser- vice vulnerability analysis: A literature review with empirical insights.IEEE Access, 2024
Raveen Kanishka Jayalath, Hussain Ahmad, Diksha Goel, Muhammad Shuja Syed, and Faheem Ullah. Microser- vice vulnerability analysis: A literature review with empirical insights.IEEE Access, 2024
2024
-
[5]
Microhecl: High-efficient root cause localization in large-scale microservice systems
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. Microhecl: High-efficient root cause localization in large-scale microservice systems. In2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 338–347. IEEE, 2021
2021
-
[6]
Microrca: Root cause localization of performance issues in microservices
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. Microrca: Root cause localization of performance issues in microservices. InNOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium, pages 1–9. IEEE, 2020. 15
2020
-
[7]
Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022
Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022
2022
-
[8]
Cloudrca: A root cause analysis framework for cloud computing platforms
Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, and Min Ke. Cloudrca: A root cause analysis framework for cloud computing platforms. InProceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 4373–4382, 2021
2021
-
[9]
Cloudranger: Root cause identification for cloud native systems
Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. Cloudranger: Root cause identification for cloud native systems. In2018 18th IEEE/ACM International Symposium on Cluster , Cloud and Grid Computing (CCGRID), pages 492–502. IEEE, 2018
2018
-
[10]
Logrca: Log-based root cause analysis for distributed services
Thorsten Wittkopp, Philipp Wiesner, and Odej Kao. Logrca: Log-based root cause analysis for distributed services. InEuropean Conference on Parallel Processing, pages 362–376. Springer, 2024
2024
-
[11]
Trace-based multi-dimensional root cause localization of performance issues in microservice systems
Chenxi Zhang, Zhen Dong, Xin Peng, Bicheng Zhang, and Miao Chen. Trace-based multi-dimensional root cause localization of performance issues in microservice systems. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–12, 2024
2024
-
[12]
Shenghui Gu, Guoping Rong, Tian Ren, He Zhang, Haifeng Shen, Yongda Yu, Xian Li, Jian Ouyang, and Chunan Chen. Trinityrcl: Multi-granular and code-level root cause localization using multiple types of telemetry data in microservice systems.IEEE Transactions on Software Engineering, 49(5):3071–3088, 2023
2023
-
[13]
Eadro: An end-to-end troubleshooting framework for microservices on multi-source data
Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1750–1762. IEEE, 2023
2023
-
[14]
Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data
Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the F oundations of Software Engineering, pages 553–565, 2023
2023
-
[15]
Mulan: multi-modal causal structure learning and root cause analysis for microservice systems
Lecheng Zheng, Zhengzhang Chen, Jingrui He, and Haifeng Chen. Mulan: multi-modal causal structure learning and root cause analysis for microservice systems. InProceedings of the ACM Web Conference 2024, pages 4107–4116, 2024
2024
-
[16]
Diagnosing performance issues for large-scale microservice systems with heterogeneous graph
Lei Tao, Xianglin Lu, Shenglin Zhang, Jiaqi Luan, Yingke Li, Mingjie Li, Zeyan Li, Qingyang Yu, Hucheng Xie, Ruijie Xu, et al. Diagnosing performance issues for large-scale microservice systems with heterogeneous graph. IEEE Transactions on Services Computing, 17(5):2223–2235, 2024
2024
-
[17]
Hg-pad: Heterogeneous graph structure learning aided performance anomaly diagnosis in microservice systems.IEEE Transactions on Services Computing, 2025
Jian Yang, Zian Wang, Shuangwu Chen, Huasen He, Yunpeng Hou, and Xiaofeng Jiang. Hg-pad: Heterogeneous graph structure learning aided performance anomaly diagnosis in microservice systems.IEEE Transactions on Services Computing, 2025
2025
-
[18]
Ziming Zhao, Tiehua Zhang, Zhishu Shen, Hai Dong, Xingjun Ma, Xianhui Liu, and Yun Yang. Chase: A causal heterogeneous graph based framework for root cause analysis in multimodal microservice systems.arXiv preprint arXiv:2406.19711, 2024
-
[19]
Art: A unified unsupervised framework for incident management in microservice systems
Yongqian Sun, Binpeng Shi, Mingyu Mao, Minghua Ma, Sibo Xia, Shenglin Zhang, and Dan Pei. Art: A unified unsupervised framework for incident management in microservice systems. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 1183–1194, 2024
2024
-
[20]
Interpretable failure localization for microservice systems based on graph autoencoder
Yongqian Sun, Zihan Lin, Binpeng Shi, Shenglin Zhang, Shiyu Ma, Pengxiang Jin, Zhenyu Zhong, Lemeng Pan, Yicheng Guo, and Dan Pei. Interpretable failure localization for microservice systems based on graph autoencoder. ACM Transactions on Software Engineering and Methodology, 34(2):1–28, 2025
2025
-
[21]
Failure diagnosis in microservice systems: A comprehensive survey and analysis.ACM Transactions on Software Engineering and Methodology, 35(1):1–55, 2025
Shenglin Zhang, Sibo Xia, Wenzhao Fan, Binpeng Shi, Xiao Xiong, Zhenyu Zhong, Minghua Ma, Yongqian Sun, and Dan Pei. Failure diagnosis in microservice systems: A comprehensive survey and analysis.ACM Transactions on Software Engineering and Methodology, 35(1):1–55, 2025
2025
-
[22]
Diagnosing performance issues in microservices with heterogeneous data source
Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. Diagnosing performance issues in microservices with heterogeneous data source. In2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pages 49...
2021
-
[23]
Loggt: cross-system log anomaly detection via heterogeneous graph feature and transfer learning.Expert Systems with Applications, 251:124082, 2024
Peipeng Wang, Xiuguo Zhang, Zhiying Cao, Weigang Xu, and Wangwang Li. Loggt: cross-system log anomaly detection via heterogeneous graph feature and transfer learning.Expert Systems with Applications, 251:124082, 2024. 16
2024
-
[24]
Horizontal pod autoscaling
Kubernetes Authors. Horizontal pod autoscaling. https://kubernetes.io/docs/tasks/ run-application/horizontal-pod-autoscale/, 2025. Accessed: 2025-05-27
2025
-
[25]
Diagnosing root causes of intermittent slow queries in cloud databases
Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, Nengjun Qiu, et al. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment, 13(8):1176–1189, 2020
2020
-
[26]
Pearson Education, 2010
Jez Humble and David Farley.Continuous delivery: reliable software releases through build, test, and deployment automation. Pearson Education, 2010
2010
-
[27]
Blue-green and canary deployments in devops: A comparative study
Vidyasagar Vangala. Blue-green and canary deployments in devops: A comparative study. 2025
2025
-
[28]
Automap: Diagnose your microservice-based web applications automatically
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. Automap: Diagnose your microservice-based web applications automatically. InProceedings of The Web Conference 2020, pages 246–258, 2020
2020
-
[29]
Performance of 4 pre-trained sentence transformer models in the semantic query of a systematic review dataset on peri-implantitis.Information, 15(2):68, 2024
Carlo Galli, Nikolaos Donos, and Elena Calciolari. Performance of 4 pre-trained sentence transformer models in the semantic query of a systematic review dataset on peri-implantitis.Information, 15(2):68, 2024
2024
-
[30]
A density-based algorithm for discovering clusters in large spatial databases with noise
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, volume 96, pages 226–231, 1996
1996
-
[31]
Stl: A seasonal-trend decomposition procedure based on loess.J Off Stat, 6:3–73, 1990
CLEVELAND RB. Stl: A seasonal-trend decomposition procedure based on loess.J Off Stat, 6:3–73, 1990
1990
-
[32]
Some methods for classification and analysis of multivariate observations
James MacQueen. Some methods for classification and analysis of multivariate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, V olume 1: Statistics, volume 5, pages 281–298. University of California press, 1967
1967
-
[33]
Hierarchical grouping to optimize an objective function.Journal of the American statistical association, 58(301):236–244, 1963
Joe H Ward Jr. Hierarchical grouping to optimize an objective function.Journal of the American statistical association, 58(301):236–244, 1963
1963
-
[34]
Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023
Ruyue Xin, Peng Chen, and Zhiming Zhao. Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications.Journal of Systems and Software, 203:111724, 2023
2023
-
[35]
Robust failure diagnosis of microservice system through multimodal data.IEEE Transactions on Services Computing, 16(6):3851–3864, 2023
Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin, et al. Robust failure diagnosis of microservice system through multimodal data.IEEE Transactions on Services Computing, 16(6):3851–3864, 2023
2023
-
[36]
Actionable and interpretable fault localization for recurring failures in online service systems
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, et al. Actionable and interpretable fault localization for recurring failures in online service systems. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the F oundations of Software Engin...
2022
-
[37]
Hipster shop - sample microservices application
Lightstep. Hipster shop - sample microservices application. https://github.com/lightstep/ hipster-shop, 2025. Accessed: 2025-09-04
2025
-
[38]
Online boutique - microservices demo application
GoogleCloudPlatform. Online boutique - microservices demo application. https://github.com/ GoogleCloudPlatform/microservices-demo, 2025. Accessed: 2025-09-04
2025
-
[39]
Causal inference-based root cause analysis for online service systems with intervention recognition
Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3230–3240, 2022. 17
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.