TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices
Pith reviewed 2026-05-20 19:07 UTC · model grok-4.3
The pith
TopoEvo distinguishes root causes from downstream symptoms using topology-constrained multi-agent reasoning in microservices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TopoEvo couples graph representation learning with structured reasoning by introducing Metric-orthogonal Multimodal Alignment to reduce redundancy, vector quantization to produce auditable symptom tokens, a multi-agent Hypothesis-Evidence-Test workflow to verify propagation-consistent explanations, and a self-evolving mechanism that refreshes incident memory and performs test-time adaptation, thereby separating initiating anomalies from amplified downstream symptoms.
What carries the argument
The multi-agent Hypothesis--Evidence--Test (HET) workflow that verifies propagation-consistent explanations on discrete symptom tokens derived from topology-enhanced states.
If this is right
- Separates initiating anomalies from amplified downstream symptoms.
- Enables reliable retrieval and token-level evidence grounding.
- Maintains robustness under non-stationary topology drift through conservative test-time adaptation.
- Couples graph representation learning directly with structured, verifiable reasoning.
Where Pith is reading between the lines
- The token-level grounding could support integration with existing incident dashboards for human review of automated diagnoses.
- Propagation-consistency checks might transfer to root-cause tasks in other dynamic graphs such as sensor networks or distributed ledgers.
- Hierarchical memory refresh could reduce the need for full retraining when new service versions are deployed.
Load-bearing premise
Metric-orthogonal multimodal alignment together with vector quantization will produce stable node representations that reduce redundancy and sparsity enough to support reliable topology-constrained reasoning under non-stationary drift.
What would settle it
On a microservices benchmark containing documented cascading failures and controlled topology changes, measure whether the framework attributes the root cause to the known initiator rather than a downstream victim in the majority of cases after each drift event.
Figures
read the original abstract
Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TopoEvo, a topology-aware self-evolving multi-agent framework for root cause analysis (RCA) in microservices. It introduces Metric-orthogonal Multimodal Alignment (MOMA) to decompose and align multimodal observability data into stable node representations, Vector Quantization (VQ) to discretize states into auditable symptom tokens with a lexicon, a multi-agent Hypothesis-Evidence-Test (HET) workflow to verify propagation-consistent explanations and isolate initiating anomalies, and a self-evolving mechanism for hierarchical memory refresh and test-time adaptation under topology drift induced by autoscaling.
Significance. If the reported ablation gains hold, the work could meaningfully advance LLM-based RCA by reducing symptom-amplification bias through explicit topology constraints and discrete token grounding. The coupling of graph representation learning with structured HET reasoning, together with the self-evolving adaptation, addresses practical challenges in non-stationary microservice environments. The empirical results tying F1 improvements to the proposed components (MOMA, VQ, and topology in HET) under simulated drift constitute a concrete strength.
major comments (1)
- [§5.1] §5.1, ablation on drift scenarios: the claim that MOMA plus VQ yields stable representations sufficient for reliable HET reasoning under non-stationary drift is load-bearing, yet the manuscript reports only aggregate F1 gains without per-drift-type breakdowns or statistical significance tests across the simulated rolling-update and autoscaling cases; this weakens the link between the weakest assumption and the central separation of root causes from downstream symptoms.
minor comments (2)
- [§4.2] The symptom lexicon construction in the VQ section would benefit from an explicit example of token-to-evidence mapping to clarify how retrieval supports the Test phase of HET.
- Figure 3 caption and axis labels could more clearly distinguish the contribution of topology constraints versus the self-evolving memory update.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of TopoEvo's contributions and for the recommendation of minor revision. The single major comment is addressed point-by-point below. We agree that additional granularity in the drift ablation will strengthen the manuscript and will incorporate the requested changes.
read point-by-point responses
-
Referee: [§5.1] §5.1, ablation on drift scenarios: the claim that MOMA plus VQ yields stable representations sufficient for reliable HET reasoning under non-stationary drift is load-bearing, yet the manuscript reports only aggregate F1 gains without per-drift-type breakdowns or statistical significance tests across the simulated rolling-update and autoscaling cases; this weakens the link between the weakest assumption and the central separation of root causes from downstream symptoms.
Authors: We agree that the current presentation of aggregate F1 scores across all drift scenarios limits the ability to isolate the contribution of MOMA+VQ stability to HET performance under each drift mechanism. In the revised manuscript we will add per-drift-type breakdowns (rolling-update vs. autoscaling) for the key ablation configurations (full TopoEvo, w/o MOMA, w/o VQ, w/o topology in HET). We will also report statistical significance (paired t-tests with Bonferroni correction, or Wilcoxon signed-rank tests where normality assumptions are violated) on the F1 deltas between configurations for each drift type. These additions will be placed in an expanded §5.1 and the corresponding appendix table, directly strengthening the empirical link between the proposed components and the separation of root causes from cascading symptoms. revision: yes
Circularity Check
No significant circularity detected
full rationale
The manuscript describes a multi-agent RCA framework at the level of architectural components (MOMA for orthogonal alignment, VQ for symptom tokenization, HET workflow, and self-evolving memory) without presenting equations, derivations, or fitted-parameter predictions that could reduce to inputs by construction. Ablation results and empirical F1 gains are tied to explicit component implementations rather than asserted via self-citation chains or self-definitional loops. The central claim remains externally falsifiable through topology-constrained reasoning benchmarks and does not rely on load-bearing self-references or renamed empirical patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Microservice systems exhibit noisy and heterogeneous multimodal observability, cascading failure propagation, and non-stationary topology drift.
invented entities (1)
-
symptom tokens
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MOMA decomposes metric embeddings into orthogonal subspaces and contrastively aligns logs/traces; VQ discretizes topology-enhanced states into symptom tokens; HET workflow verifies propagation-consistent hypotheses.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Topology-aware GAT encoder + self-evolving mechanism under non-stationary drift.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Journal of Systems and Software178, 110964 (2021)
Soldani, J., Brogi, A.: Anomaly Detection and Failure Root Cause Anal- ysis in (Micro)Service-Based Cloud Applications: A Survey. Journal of Systems and Software178, 110964 (2021)
work page 2021
-
[2]
arXiv preprint arXiv:2408.00803 (2024)
Wang, T., Qi, G.: A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends. arXiv preprint arXiv:2408.00803 (2024)
-
[3]
RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,
L. Pham, H. Zhang, H. Ha, F. Salim, and X. Zhang, “RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,” inCompanion Proceedings of the ACM Web Confer- ence 2025 (WWW Companion), pp. 777–780, 2025
work page 2025
-
[4]
Z. Yao, C. Pei, X. Nie, et al., “Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,” inProceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), 2024
work page 2024
-
[5]
In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp
Pan, X., Yu, Y ., Zhang, H.: DyCause: Dynamic Causal Inference for Root Cause Analysis. In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp. 435–446 (2023)
work page 2023
-
[6]
In: IEEE International Confer- ence on Services Computing (SCC), pp
Wu, G., Liu, D., Zhang, H.: MMRCA: Multimodal Root Cause Analysis via Fusion of Logs, Metrics and Traces. In: IEEE International Confer- ence on Services Computing (SCC), pp. 110–120 (2021)
work page 2021
-
[7]
Li, X., Zhang, H., Pham, L., et al.: MRCA: Metric-Level Root Cause Analysis for Microservices via Multi-Modal Data. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1115–1127 (2024)
work page 2024
-
[8]
In: Proceedings of the ACM Web Conference (WWW), pp
Lou, J., Yu, H., Zhang, M., et al.: Unimodal is Not Enough: A Bench- mark and Unified Framework for Multimodal Anomaly Detection in Logs, Metrics, and Traces. In: Proceedings of the ACM Web Conference (WWW), pp. 2841–2852 (2023)
work page 2023
-
[9]
RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,
Z. Wang, Z. Liu, Y . Zhao, et al., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024
work page 2024
-
[10]
Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,
C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei, “Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,” inCompanion Proceedings of the ACM Web Conference 2025 (WWW Companion), pp. 422–431, 2025
work page 2025
-
[11]
W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, C. Yan, Z. Li, T. Li, X. Shi, L. Zheng, and B. Zhang, “mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4017–4033, 2024
work page 2024
-
[12]
P. Tang, S. Tang, H. Pu, Z. Miao, and Z. Wang, “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents,”arXiv preprint arXiv:2509.15635, 2025
-
[13]
Online Multi-modal Root Cause Identification in Microservice Systems,
L. Zheng, Z. Chen, and H. Chen, “Online Multi-modal Root Cause Identification in Microservice Systems,” in2025 IEEE International Conference on Big Data (BigData), pp. 1–8, 2025
work page 2025
-
[14]
arXiv preprint arXiv:2406.13604 (2024)
Phan, X., Li, M., Zhang, Q.: MicroCERCL: Root Cause Local- ization in Cloud-Edge Microservice Environments. arXiv preprint arXiv:2406.13604 (2024)
-
[15]
In: IEEE International Conference on Cloud Computing (CLOUD), pp
Li, Z., Chen, J., Jiao, R., et al.: Practical Root Cause Localization for Microservice Systems via Trace Analysis. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 25–34 (2021)
work page 2021
-
[16]
In: IEEE International Conference on Cloud Computing (CLOUD), pp
Jha, S., et al.: Localizing and Explaining Faults in Microservices Using Distributed Traces and Logs. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 53–63 (2022)
work page 2022
-
[17]
Journal of Systems and Software 206, 111520 (2023)
Li, Z., Zhang, H., Wang, C.: MicroIRC: Instance-Level Root Cause Localization for Microservice Systems. Journal of Systems and Software 206, 111520 (2023)
work page 2023
-
[18]
Chillarege, R., Bhandari, I., Chaar, J., et al.: Orthogonal Defect Classi- fication—A Concept for In-Process Measurements. IEEE Trans. Softw. Eng.18(11), 943–956 (1992)
work page 1992
-
[19]
In: Proceedings of the 17th ACM SIGKDD, pp
Nguyen, H.A., Tan, J., Gu, X., et al.: PAL: Performance Anomaly Localization for Cloud Systems. In: Proceedings of the 17th ACM SIGKDD, pp. 1230–1238 (2011)
work page 2011
-
[20]
In: IEEE International Conference on Services Computing (SCC), pp
Jia, Z., He, J., Xu, X., et al.: Fault Localization for Microservice Systems via Graph Neural Network Learning. In: IEEE International Conference on Services Computing (SCC), pp. 218–225 (2020)
work page 2020
-
[21]
In: IEEE/ACM International Conference on Software Engineering (ICSE), pp
Yang, Y ., Zhou, M., Yang, S., et al.: Unsupervised Root Cause Analysis for Microservice Systems via Spatio-Temporal Graph Modeling. In: IEEE/ACM International Conference on Software Engineering (ICSE), pp. 156–166 (2020)
work page 2020
-
[22]
Guo, Y ., Wang, J., Li, Y ., et al.: CARR: A Causal-Aware Neural Ap- proach for Root Cause Localization in Cloud Systems. In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 3922–3930 (2022)
work page 2022
-
[23]
In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp
Lin, Y ., Fang, Y ., Zhang, H., et al.: Causal Tracing for Root Cause Localization in Microservices. In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 33–44 (2022)
work page 2022
-
[24]
arXiv preprint arXiv:2108.00344 (2021)
Wang, H., Wu, Z., Jiang, H., et al.: Groot: An Event-Graph-Based Approach for Root Cause Analysis in Industrial Settings. arXiv preprint arXiv:2108.00344 (2021)
-
[25]
Root Cause Analysis of Failures in Microservices through Causal Discovery,
A. Ikram, S. Chakraborty, S. Mitra, S. K. Saini, S. Bagchi, and M. Kocaoglu, “Root Cause Analysis of Failures in Microservices through Causal Discovery,” inAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022
work page 2022
-
[26]
In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Chakraborty, S., et al.: Root Cause Analysis of Failures in Microservices through Causal Structure Discovery. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
work page 2022
-
[27]
In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp
Zhou, Y ., Chen, T., Liu, Z., et al.: LatentScope: Unsupervised Root Cause Analysis with Limited Observability. In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp. 1–13 (2024)
work page 2024
-
[28]
Z. Xie, S. Zhang, Y . Geng, X. Nie, Z. Yao, L. Xu, Y . Sun, W. Li, and D. Pei, “Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2024
work page 2024
-
[29]
L. Pham, H. Ha, and H. Zhang, “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point De- tection,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2214–2237, 2024
work page 2024
-
[30]
In: NCTA Conference Proceedings, pp
Neumann, A.: Causality Based Instant Root Cause Analysis for Mi- croservices Failure. In: NCTA Conference Proceedings, pp. 140–152 (2024)
work page 2024
- [31]
-
[32]
DoWhy Example Notebook (2024), https://www
DoWhy Contributors: Root Cause Analysis of Latencies in a Microser- vice Architecture. DoWhy Example Notebook (2024), https://www. pywhy.org
work page 2024
-
[33]
In: ACM Symposium on Cloud Computing (SoCC) (2025)
Zhang, H., Pham, L., Liu, Y .: Intelligent Root Cause Localization in MicroService Systems. In: ACM Symposium on Cloud Computing (SoCC) (2025)
work page 2025
-
[34]
L. Wang, C. Zhang, R. Ding, Y . Xu, Q. Chen, W. Zou, Q. Chen, M. Zhang, X.-C. Gao, H. Fan, S. Rajmohan, Q. Lin, and D. Zhang, “Root Cause Analysis for Microservice Systems via Hierarchical Reinforce- ment Learning from Human Feedback,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 5116–5125, 2023
work page 2023
-
[35]
L. Zhang, T. Jia, K. Wang, W. Hong, C. Duan, M. He, and Y . Li, “Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought,”arXiv preprint arXiv:2508.20370, 2025
-
[36]
Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,
L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, M. Jia, and Y . Li, “Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,”arXiv preprint arXiv:2601.02732, 2026
-
[37]
G. Yu, P. Chen, Y . Li, et al., “Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023
work page 2023
-
[38]
Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,
C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,” inProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pp. 1750–1762, 2023
work page 2023
-
[39]
Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,
Y . Han, Q. Du, Y . Huang, P. Li, X. Shi, J. Wu, P. Fang, F. Tian, and C. He, “Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,”IEEE Transactions on Services Computing, vol. 17, no. 6, pp. 3789–3802, 2024
work page 2024
-
[40]
Zhang X, Wang Q, Li M, et al. TAMO: Fine-Grained Root Cause Analy- sis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems[J]. IEEE Transactions on Services Computing, 2025, 18(6): 4221-4233
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.