TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

Junle Wang; Wenjun Wu; Xingchuang Liao

arxiv: 2605.15611 · v1 · pith:4T44FVV7new · submitted 2026-05-15 · 💻 cs.AI

TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

Junle Wang , Xingchuang Liao , Wenjun Wu This is my paper

Pith reviewed 2026-05-20 19:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords root cause analysismicroservicesmulti-agent frameworktopology-aware reasoningmultimodal alignmentvector quantizationself-evolving systemsfailure propagation

0 comments

The pith

TopoEvo distinguishes root causes from downstream symptoms using topology-constrained multi-agent reasoning in microservices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Root cause analysis in microservices must contend with noisy multimodal signals, cascading failure paths, and frequent topology changes from autoscaling. TopoEvo first builds stable node representations by decomposing metrics into orthogonal subspaces and contrastively aligning logs and traces. It then discretizes these states into symptom tokens via vector quantization. On this foundation the system runs a multi-agent Hypothesis-Evidence-Test loop that checks whether candidate explanations are consistent with observed propagation, thereby isolating the initiating anomaly from later amplified symptoms. A self-evolving memory layer refreshes hierarchical incident records and applies conservative adaptation to keep performance stable under drift.

Core claim

TopoEvo couples graph representation learning with structured reasoning by introducing Metric-orthogonal Multimodal Alignment to reduce redundancy, vector quantization to produce auditable symptom tokens, a multi-agent Hypothesis-Evidence-Test workflow to verify propagation-consistent explanations, and a self-evolving mechanism that refreshes incident memory and performs test-time adaptation, thereby separating initiating anomalies from amplified downstream symptoms.

What carries the argument

The multi-agent Hypothesis--Evidence--Test (HET) workflow that verifies propagation-consistent explanations on discrete symptom tokens derived from topology-enhanced states.

If this is right

Separates initiating anomalies from amplified downstream symptoms.
Enables reliable retrieval and token-level evidence grounding.
Maintains robustness under non-stationary topology drift through conservative test-time adaptation.
Couples graph representation learning directly with structured, verifiable reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-level grounding could support integration with existing incident dashboards for human review of automated diagnoses.
Propagation-consistency checks might transfer to root-cause tasks in other dynamic graphs such as sensor networks or distributed ledgers.
Hierarchical memory refresh could reduce the need for full retraining when new service versions are deployed.

Load-bearing premise

Metric-orthogonal multimodal alignment together with vector quantization will produce stable node representations that reduce redundancy and sparsity enough to support reliable topology-constrained reasoning under non-stationary drift.

What would settle it

On a microservices benchmark containing documented cascading failures and controlled topology changes, measure whether the framework attributes the root cause to the known initiator rather than a downstream victim in the majority of cases after each drift event.

Figures

Figures reproduced from arXiv: 2605.15611 by Junle Wang, Wenjun Wu, Xingchuang Liao.

**Figure 2.** Figure 2: The overview of TopoEvo. dedicated to log-consistent and trace-consistent alignment, respectively, thereby preserving information while reducing redundancy in downstream graph reasoning. Given metric embeddings x metric v , two components are produced as uv = Wux metric v , vv = Wvx metric v , (3) where uv is intended to capture metric factors most consistent with log evidence, while vv captures compleme… view at source ↗

**Figure 3.** Figure 3: Illustration of vector quantization and symptom vocabulary construc [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Main prompt of Hypothesis Planner. provide compact, consistent context for these agents, while topology saliency focuses attention on propagation-relevant regions. 1) Shared context: Candidate-centric subgraph and tokenized evidence. A candidate-centric subgraph is constructed to keep the workflow focused. TopoEvo selects the top-K candidate nodes according to the root-cause score s(v) and extracts an H-h… view at source ↗

**Figure 6.** Figure 6: Parameter sensitivity analysis. K denotes the VQ partition (codebook [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 5.** Figure 5: Parameter sensitivity analysis. K denotes the VQ partition (codebook [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: In case study experiments, We injected a [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TopoEvo adds topology constraints and a hypothesis-evidence-test loop to multi-agent RCA, with ablations tying F1 gains to the new modules under simulated drift.

read the letter

TopoEvo combines topology-aware graph learning with a multi-agent Hypothesis-Evidence-Test workflow to tackle root cause analysis in microservices while trying to avoid blaming downstream symptoms. The new parts are the Metric-orthogonal Multimodal Alignment that reduces redundancy across metrics, logs, and traces, the vector quantization step that turns states into symptom tokens with a lexicon, and the self-evolving memory for handling drift. What the paper does well is provide the details on how MOMA uses orthogonal decomposition and contrastive alignment, plus ablation results that connect the F1 improvements specifically to these additions under simulated non-stationary conditions. That makes the contribution more traceable. The main soft spot is the evaluation setup. Everything is tested in simulated drift and topology changes rather than on real incident data from production systems, which often have additional complexities like missing traces or correlated but unrelated events. This could mean the reported gains overstate the practical impact. This is for people in the AI for systems reliability area who are looking at agent-based approaches to automated diagnosis. It offers a way to make the reasoning more structured and auditable. I recommend putting it through peer review. The framework is coherent, the experiments back the claims without obvious circularity, and the topic is relevant enough to warrant referee input.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes TopoEvo, a topology-aware self-evolving multi-agent framework for root cause analysis (RCA) in microservices. It introduces Metric-orthogonal Multimodal Alignment (MOMA) to decompose and align multimodal observability data into stable node representations, Vector Quantization (VQ) to discretize states into auditable symptom tokens with a lexicon, a multi-agent Hypothesis-Evidence-Test (HET) workflow to verify propagation-consistent explanations and isolate initiating anomalies, and a self-evolving mechanism for hierarchical memory refresh and test-time adaptation under topology drift induced by autoscaling.

Significance. If the reported ablation gains hold, the work could meaningfully advance LLM-based RCA by reducing symptom-amplification bias through explicit topology constraints and discrete token grounding. The coupling of graph representation learning with structured HET reasoning, together with the self-evolving adaptation, addresses practical challenges in non-stationary microservice environments. The empirical results tying F1 improvements to the proposed components (MOMA, VQ, and topology in HET) under simulated drift constitute a concrete strength.

major comments (1)

[§5.1] §5.1, ablation on drift scenarios: the claim that MOMA plus VQ yields stable representations sufficient for reliable HET reasoning under non-stationary drift is load-bearing, yet the manuscript reports only aggregate F1 gains without per-drift-type breakdowns or statistical significance tests across the simulated rolling-update and autoscaling cases; this weakens the link between the weakest assumption and the central separation of root causes from downstream symptoms.

minor comments (2)

[§4.2] The symptom lexicon construction in the VQ section would benefit from an explicit example of token-to-evidence mapping to clarify how retrieval supports the Test phase of HET.
Figure 3 caption and axis labels could more clearly distinguish the contribution of topology constraints versus the self-evolving memory update.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of TopoEvo's contributions and for the recommendation of minor revision. The single major comment is addressed point-by-point below. We agree that additional granularity in the drift ablation will strengthen the manuscript and will incorporate the requested changes.

read point-by-point responses

Referee: [§5.1] §5.1, ablation on drift scenarios: the claim that MOMA plus VQ yields stable representations sufficient for reliable HET reasoning under non-stationary drift is load-bearing, yet the manuscript reports only aggregate F1 gains without per-drift-type breakdowns or statistical significance tests across the simulated rolling-update and autoscaling cases; this weakens the link between the weakest assumption and the central separation of root causes from downstream symptoms.

Authors: We agree that the current presentation of aggregate F1 scores across all drift scenarios limits the ability to isolate the contribution of MOMA+VQ stability to HET performance under each drift mechanism. In the revised manuscript we will add per-drift-type breakdowns (rolling-update vs. autoscaling) for the key ablation configurations (full TopoEvo, w/o MOMA, w/o VQ, w/o topology in HET). We will also report statistical significance (paired t-tests with Bonferroni correction, or Wilcoxon signed-rank tests where normality assumptions are violated) on the F1 deltas between configurations for each drift type. These additions will be placed in an expanded §5.1 and the corresponding appendix table, directly strengthening the empirical link between the proposed components and the separation of root causes from cascading symptoms. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes a multi-agent RCA framework at the level of architectural components (MOMA for orthogonal alignment, VQ for symptom tokenization, HET workflow, and self-evolving memory) without presenting equations, derivations, or fitted-parameter predictions that could reduce to inputs by construction. Ablation results and empirical F1 gains are tied to explicit component implementations rather than asserted via self-citation chains or self-definitional loops. The central claim remains externally falsifiable through topology-constrained reasoning benchmarks and does not rely on load-bearing self-references or renamed empirical patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on domain assumptions about microservice observability and failure propagation; it introduces new conceptual modules without independent empirical grounding in the abstract.

axioms (1)

domain assumption Microservice systems exhibit noisy and heterogeneous multimodal observability, cascading failure propagation, and non-stationary topology drift.
Explicitly listed as the three core challenges in the abstract.

invented entities (1)

symptom tokens no independent evidence
purpose: Discrete auditable units obtained via vector quantization for reliable retrieval and token-level evidence grounding.
Introduced as output of the VQ step on topology-enhanced states.

pith-pipeline@v0.9.0 · 5819 in / 1294 out tokens · 60936 ms · 2026-05-20T19:07:58.949055+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MOMA decomposes metric embeddings into orthogonal subspaces and contrastively aligns logs/traces; VQ discretizes topology-enhanced states into symptom tokens; HET workflow verifies propagation-consistent hypotheses.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Topology-aware GAT encoder + self-evolving mechanism under non-stationary drift.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Journal of Systems and Software178, 110964 (2021)

Soldani, J., Brogi, A.: Anomaly Detection and Failure Root Cause Anal- ysis in (Micro)Service-Based Cloud Applications: A Survey. Journal of Systems and Software178, 110964 (2021)

work page 2021
[2]

arXiv preprint arXiv:2408.00803 (2024)

Wang, T., Qi, G.: A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends. arXiv preprint arXiv:2408.00803 (2024)

work page arXiv 2024
[3]

RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,

L. Pham, H. Zhang, H. Ha, F. Salim, and X. Zhang, “RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,” inCompanion Proceedings of the ACM Web Confer- ence 2025 (WWW Companion), pp. 777–780, 2025

work page 2025
[4]

Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,

Z. Yao, C. Pei, X. Nie, et al., “Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,” inProceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), 2024

work page 2024
[5]

In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp

Pan, X., Yu, Y ., Zhang, H.: DyCause: Dynamic Causal Inference for Root Cause Analysis. In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp. 435–446 (2023)

work page 2023
[6]

In: IEEE International Confer- ence on Services Computing (SCC), pp

Wu, G., Liu, D., Zhang, H.: MMRCA: Multimodal Root Cause Analysis via Fusion of Logs, Metrics and Traces. In: IEEE International Confer- ence on Services Computing (SCC), pp. 110–120 (2021)

work page 2021
[7]

In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp

Li, X., Zhang, H., Pham, L., et al.: MRCA: Metric-Level Root Cause Analysis for Microservices via Multi-Modal Data. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1115–1127 (2024)

work page 2024
[8]

In: Proceedings of the ACM Web Conference (WWW), pp

Lou, J., Yu, H., Zhang, M., et al.: Unimodal is Not Enough: A Bench- mark and Unified Framework for Multimodal Anomaly Detection in Logs, Metrics, and Traces. In: Proceedings of the ACM Web Conference (WWW), pp. 2841–2852 (2023)

work page 2023
[9]

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,

Z. Wang, Z. Liu, Y . Zhao, et al., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024

work page 2024
[10]

Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei, “Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,” inCompanion Proceedings of the ACM Web Conference 2025 (WWW Companion), pp. 422–431, 2025

work page 2025
[11]

mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture,

W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, C. Yan, Z. Li, T. Li, X. Shi, L. Zheng, and B. Zhang, “mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4017–4033, 2024

work page 2024
[12]

Microrca-agent: Microservice root cause analysis method based on large language model agents.arXiv preprint arXiv:2509.15635, 2025

P. Tang, S. Tang, H. Pu, Z. Miao, and Z. Wang, “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents,”arXiv preprint arXiv:2509.15635, 2025

work page arXiv 2025
[13]

Online Multi-modal Root Cause Identification in Microservice Systems,

L. Zheng, Z. Chen, and H. Chen, “Online Multi-modal Root Cause Identification in Microservice Systems,” in2025 IEEE International Conference on Big Data (BigData), pp. 1–8, 2025

work page 2025
[14]

arXiv preprint arXiv:2406.13604 (2024)

Phan, X., Li, M., Zhang, Q.: MicroCERCL: Root Cause Local- ization in Cloud-Edge Microservice Environments. arXiv preprint arXiv:2406.13604 (2024)

work page arXiv 2024
[15]

In: IEEE International Conference on Cloud Computing (CLOUD), pp

Li, Z., Chen, J., Jiao, R., et al.: Practical Root Cause Localization for Microservice Systems via Trace Analysis. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 25–34 (2021)

work page 2021
[16]

In: IEEE International Conference on Cloud Computing (CLOUD), pp

Jha, S., et al.: Localizing and Explaining Faults in Microservices Using Distributed Traces and Logs. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 53–63 (2022)

work page 2022
[17]

Journal of Systems and Software 206, 111520 (2023)

Li, Z., Zhang, H., Wang, C.: MicroIRC: Instance-Level Root Cause Localization for Microservice Systems. Journal of Systems and Software 206, 111520 (2023)

work page 2023
[18]

IEEE Trans

Chillarege, R., Bhandari, I., Chaar, J., et al.: Orthogonal Defect Classi- fication—A Concept for In-Process Measurements. IEEE Trans. Softw. Eng.18(11), 943–956 (1992)

work page 1992
[19]

In: Proceedings of the 17th ACM SIGKDD, pp

Nguyen, H.A., Tan, J., Gu, X., et al.: PAL: Performance Anomaly Localization for Cloud Systems. In: Proceedings of the 17th ACM SIGKDD, pp. 1230–1238 (2011)

work page 2011
[20]

In: IEEE International Conference on Services Computing (SCC), pp

Jia, Z., He, J., Xu, X., et al.: Fault Localization for Microservice Systems via Graph Neural Network Learning. In: IEEE International Conference on Services Computing (SCC), pp. 218–225 (2020)

work page 2020
[21]

In: IEEE/ACM International Conference on Software Engineering (ICSE), pp

Yang, Y ., Zhou, M., Yang, S., et al.: Unsupervised Root Cause Analysis for Microservice Systems via Spatio-Temporal Graph Modeling. In: IEEE/ACM International Conference on Software Engineering (ICSE), pp. 156–166 (2020)

work page 2020
[22]

In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp

Guo, Y ., Wang, J., Li, Y ., et al.: CARR: A Causal-Aware Neural Ap- proach for Root Cause Localization in Cloud Systems. In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 3922–3930 (2022)

work page 2022
[23]

In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp

Lin, Y ., Fang, Y ., Zhang, H., et al.: Causal Tracing for Root Cause Localization in Microservices. In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 33–44 (2022)

work page 2022
[24]

arXiv preprint arXiv:2108.00344 (2021)

Wang, H., Wu, Z., Jiang, H., et al.: Groot: An Event-Graph-Based Approach for Root Cause Analysis in Industrial Settings. arXiv preprint arXiv:2108.00344 (2021)

work page arXiv 2021
[25]

Root Cause Analysis of Failures in Microservices through Causal Discovery,

A. Ikram, S. Chakraborty, S. Mitra, S. K. Saini, S. Bagchi, and M. Kocaoglu, “Root Cause Analysis of Failures in Microservices through Causal Discovery,” inAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022

work page 2022
[26]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

Chakraborty, S., et al.: Root Cause Analysis of Failures in Microservices through Causal Structure Discovery. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

work page 2022
[27]

In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp

Zhou, Y ., Chen, T., Liu, Z., et al.: LatentScope: Unsupervised Root Cause Analysis with Limited Observability. In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp. 1–13 (2024)

work page 2024
[28]

Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,

Z. Xie, S. Zhang, Y . Geng, X. Nie, Z. Yao, L. Xu, Y . Sun, W. Li, and D. Pei, “Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2024

work page 2024
[29]

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point De- tection,

L. Pham, H. Ha, and H. Zhang, “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point De- tection,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2214–2237, 2024

work page 2024
[30]

In: NCTA Conference Proceedings, pp

Neumann, A.: Causality Based Instant Root Cause Analysis for Mi- croservices Failure. In: NCTA Conference Proceedings, pp. 140–152 (2024)

work page 2024
[31]

Pham, L., Ha, H., Zhang, H.: Root Cause Analysis for Microservice Systems Based on Causal Inference: How Far Are We? arXiv preprint arXiv:2408.13729 (2024)

work page arXiv 2024
[32]

DoWhy Example Notebook (2024), https://www

DoWhy Contributors: Root Cause Analysis of Latencies in a Microser- vice Architecture. DoWhy Example Notebook (2024), https://www. pywhy.org

work page 2024
[33]

In: ACM Symposium on Cloud Computing (SoCC) (2025)

Zhang, H., Pham, L., Liu, Y .: Intelligent Root Cause Localization in MicroService Systems. In: ACM Symposium on Cloud Computing (SoCC) (2025)

work page 2025
[34]

Root Cause Analysis for Microservice Systems via Hierarchical Reinforce- ment Learning from Human Feedback,

L. Wang, C. Zhang, R. Ding, Y . Xu, Q. Chen, W. Zou, Q. Chen, M. Zhang, X.-C. Gao, H. Fan, S. Rajmohan, Q. Lin, and D. Zhang, “Root Cause Analysis for Microservice Systems via Hierarchical Reinforce- ment Learning from Human Feedback,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 5116–5125, 2023

work page 2023
[35]

Adaptive root cause localization for microservice systems with multi-agent recursion-of-thought.arXiv preprint arXiv:2508.20370, 2025

L. Zhang, T. Jia, K. Wang, W. Hong, C. Duan, M. He, and Y . Li, “Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought,”arXiv preprint arXiv:2508.20370, 2025

work page arXiv 2025
[36]

Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,

L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, M. Jia, and Y . Li, “Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,”arXiv preprint arXiv:2601.02732, 2026

work page arXiv 2026
[37]

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data,

G. Yu, P. Chen, Y . Li, et al., “Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023

work page 2023
[38]

Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,

C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,” inProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pp. 1750–1762, 2023

work page 2023
[39]

Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,

Y . Han, Q. Du, Y . Huang, P. Li, X. Shi, J. Wu, P. Fang, F. Tian, and C. He, “Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,”IEEE Transactions on Services Computing, vol. 17, no. 6, pp. 3789–3802, 2024

work page 2024
[40]

TAMO: Fine-Grained Root Cause Analy- sis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems[J]

Zhang X, Wang Q, Li M, et al. TAMO: Fine-Grained Root Cause Analy- sis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems[J]. IEEE Transactions on Services Computing, 2025, 18(6): 4221-4233

work page 2025

[1] [1]

Journal of Systems and Software178, 110964 (2021)

Soldani, J., Brogi, A.: Anomaly Detection and Failure Root Cause Anal- ysis in (Micro)Service-Based Cloud Applications: A Survey. Journal of Systems and Software178, 110964 (2021)

work page 2021

[2] [2]

arXiv preprint arXiv:2408.00803 (2024)

Wang, T., Qi, G.: A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends. arXiv preprint arXiv:2408.00803 (2024)

work page arXiv 2024

[3] [3]

RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,

L. Pham, H. Zhang, H. Ha, F. Salim, and X. Zhang, “RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,” inCompanion Proceedings of the ACM Web Confer- ence 2025 (WWW Companion), pp. 777–780, 2025

work page 2025

[4] [4]

Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,

Z. Yao, C. Pei, X. Nie, et al., “Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,” inProceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), 2024

work page 2024

[5] [5]

In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp

Pan, X., Yu, Y ., Zhang, H.: DyCause: Dynamic Causal Inference for Root Cause Analysis. In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp. 435–446 (2023)

work page 2023

[6] [6]

In: IEEE International Confer- ence on Services Computing (SCC), pp

Wu, G., Liu, D., Zhang, H.: MMRCA: Multimodal Root Cause Analysis via Fusion of Logs, Metrics and Traces. In: IEEE International Confer- ence on Services Computing (SCC), pp. 110–120 (2021)

work page 2021

[7] [7]

In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp

Li, X., Zhang, H., Pham, L., et al.: MRCA: Metric-Level Root Cause Analysis for Microservices via Multi-Modal Data. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1115–1127 (2024)

work page 2024

[8] [8]

In: Proceedings of the ACM Web Conference (WWW), pp

Lou, J., Yu, H., Zhang, M., et al.: Unimodal is Not Enough: A Bench- mark and Unified Framework for Multimodal Anomaly Detection in Logs, Metrics, and Traces. In: Proceedings of the ACM Web Conference (WWW), pp. 2841–2852 (2023)

work page 2023

[9] [9]

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,

Z. Wang, Z. Liu, Y . Zhao, et al., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024

work page 2024

[10] [10]

Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei, “Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,” inCompanion Proceedings of the ACM Web Conference 2025 (WWW Companion), pp. 422–431, 2025

work page 2025

[11] [11]

mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture,

W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, C. Yan, Z. Li, T. Li, X. Shi, L. Zheng, and B. Zhang, “mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4017–4033, 2024

work page 2024

[12] [12]

Microrca-agent: Microservice root cause analysis method based on large language model agents.arXiv preprint arXiv:2509.15635, 2025

P. Tang, S. Tang, H. Pu, Z. Miao, and Z. Wang, “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents,”arXiv preprint arXiv:2509.15635, 2025

work page arXiv 2025

[13] [13]

Online Multi-modal Root Cause Identification in Microservice Systems,

L. Zheng, Z. Chen, and H. Chen, “Online Multi-modal Root Cause Identification in Microservice Systems,” in2025 IEEE International Conference on Big Data (BigData), pp. 1–8, 2025

work page 2025

[14] [14]

arXiv preprint arXiv:2406.13604 (2024)

Phan, X., Li, M., Zhang, Q.: MicroCERCL: Root Cause Local- ization in Cloud-Edge Microservice Environments. arXiv preprint arXiv:2406.13604 (2024)

work page arXiv 2024

[15] [15]

In: IEEE International Conference on Cloud Computing (CLOUD), pp

Li, Z., Chen, J., Jiao, R., et al.: Practical Root Cause Localization for Microservice Systems via Trace Analysis. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 25–34 (2021)

work page 2021

[16] [16]

In: IEEE International Conference on Cloud Computing (CLOUD), pp

Jha, S., et al.: Localizing and Explaining Faults in Microservices Using Distributed Traces and Logs. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 53–63 (2022)

work page 2022

[17] [17]

Journal of Systems and Software 206, 111520 (2023)

Li, Z., Zhang, H., Wang, C.: MicroIRC: Instance-Level Root Cause Localization for Microservice Systems. Journal of Systems and Software 206, 111520 (2023)

work page 2023

[18] [18]

IEEE Trans

Chillarege, R., Bhandari, I., Chaar, J., et al.: Orthogonal Defect Classi- fication—A Concept for In-Process Measurements. IEEE Trans. Softw. Eng.18(11), 943–956 (1992)

work page 1992

[19] [19]

In: Proceedings of the 17th ACM SIGKDD, pp

Nguyen, H.A., Tan, J., Gu, X., et al.: PAL: Performance Anomaly Localization for Cloud Systems. In: Proceedings of the 17th ACM SIGKDD, pp. 1230–1238 (2011)

work page 2011

[20] [20]

In: IEEE International Conference on Services Computing (SCC), pp

Jia, Z., He, J., Xu, X., et al.: Fault Localization for Microservice Systems via Graph Neural Network Learning. In: IEEE International Conference on Services Computing (SCC), pp. 218–225 (2020)

work page 2020

[21] [21]

In: IEEE/ACM International Conference on Software Engineering (ICSE), pp

Yang, Y ., Zhou, M., Yang, S., et al.: Unsupervised Root Cause Analysis for Microservice Systems via Spatio-Temporal Graph Modeling. In: IEEE/ACM International Conference on Software Engineering (ICSE), pp. 156–166 (2020)

work page 2020

[22] [22]

In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp

Guo, Y ., Wang, J., Li, Y ., et al.: CARR: A Causal-Aware Neural Ap- proach for Root Cause Localization in Cloud Systems. In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 3922–3930 (2022)

work page 2022

[23] [23]

In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp

Lin, Y ., Fang, Y ., Zhang, H., et al.: Causal Tracing for Root Cause Localization in Microservices. In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 33–44 (2022)

work page 2022

[24] [24]

arXiv preprint arXiv:2108.00344 (2021)

Wang, H., Wu, Z., Jiang, H., et al.: Groot: An Event-Graph-Based Approach for Root Cause Analysis in Industrial Settings. arXiv preprint arXiv:2108.00344 (2021)

work page arXiv 2021

[25] [25]

Root Cause Analysis of Failures in Microservices through Causal Discovery,

A. Ikram, S. Chakraborty, S. Mitra, S. K. Saini, S. Bagchi, and M. Kocaoglu, “Root Cause Analysis of Failures in Microservices through Causal Discovery,” inAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022

work page 2022

[26] [26]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

Chakraborty, S., et al.: Root Cause Analysis of Failures in Microservices through Causal Structure Discovery. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

work page 2022

[27] [27]

In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp

Zhou, Y ., Chen, T., Liu, Z., et al.: LatentScope: Unsupervised Root Cause Analysis with Limited Observability. In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp. 1–13 (2024)

work page 2024

[28] [28]

Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,

Z. Xie, S. Zhang, Y . Geng, X. Nie, Z. Yao, L. Xu, Y . Sun, W. Li, and D. Pei, “Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2024

work page 2024

[29] [29]

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point De- tection,

L. Pham, H. Ha, and H. Zhang, “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point De- tection,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2214–2237, 2024

work page 2024

[30] [30]

In: NCTA Conference Proceedings, pp

Neumann, A.: Causality Based Instant Root Cause Analysis for Mi- croservices Failure. In: NCTA Conference Proceedings, pp. 140–152 (2024)

work page 2024

[31] [31]

Pham, L., Ha, H., Zhang, H.: Root Cause Analysis for Microservice Systems Based on Causal Inference: How Far Are We? arXiv preprint arXiv:2408.13729 (2024)

work page arXiv 2024

[32] [32]

DoWhy Example Notebook (2024), https://www

DoWhy Contributors: Root Cause Analysis of Latencies in a Microser- vice Architecture. DoWhy Example Notebook (2024), https://www. pywhy.org

work page 2024

[33] [33]

In: ACM Symposium on Cloud Computing (SoCC) (2025)

Zhang, H., Pham, L., Liu, Y .: Intelligent Root Cause Localization in MicroService Systems. In: ACM Symposium on Cloud Computing (SoCC) (2025)

work page 2025

[34] [34]

Root Cause Analysis for Microservice Systems via Hierarchical Reinforce- ment Learning from Human Feedback,

L. Wang, C. Zhang, R. Ding, Y . Xu, Q. Chen, W. Zou, Q. Chen, M. Zhang, X.-C. Gao, H. Fan, S. Rajmohan, Q. Lin, and D. Zhang, “Root Cause Analysis for Microservice Systems via Hierarchical Reinforce- ment Learning from Human Feedback,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 5116–5125, 2023

work page 2023

[35] [35]

Adaptive root cause localization for microservice systems with multi-agent recursion-of-thought.arXiv preprint arXiv:2508.20370, 2025

L. Zhang, T. Jia, K. Wang, W. Hong, C. Duan, M. He, and Y . Li, “Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought,”arXiv preprint arXiv:2508.20370, 2025

work page arXiv 2025

[36] [36]

Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,

L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, M. Jia, and Y . Li, “Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,”arXiv preprint arXiv:2601.02732, 2026

work page arXiv 2026

[37] [37]

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data,

G. Yu, P. Chen, Y . Li, et al., “Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023

work page 2023

[38] [38]

Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,

C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,” inProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pp. 1750–1762, 2023

work page 2023

[39] [39]

Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,

Y . Han, Q. Du, Y . Huang, P. Li, X. Shi, J. Wu, P. Fang, F. Tian, and C. He, “Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,”IEEE Transactions on Services Computing, vol. 17, no. 6, pp. 3789–3802, 2024

work page 2024

[40] [40]

TAMO: Fine-Grained Root Cause Analy- sis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems[J]

Zhang X, Wang Q, Li M, et al. TAMO: Fine-Grained Root Cause Analy- sis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems[J]. IEEE Transactions on Services Computing, 2025, 18(6): 4221-4233

work page 2025