pith. sign in

arxiv: 2605.15611 · v1 · pith:4T44FVV7new · submitted 2026-05-15 · 💻 cs.AI

TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

Pith reviewed 2026-05-20 19:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords root cause analysismicroservicesmulti-agent frameworktopology-aware reasoningmultimodal alignmentvector quantizationself-evolving systemsfailure propagation
0
0 comments X

The pith

TopoEvo distinguishes root causes from downstream symptoms using topology-constrained multi-agent reasoning in microservices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Root cause analysis in microservices must contend with noisy multimodal signals, cascading failure paths, and frequent topology changes from autoscaling. TopoEvo first builds stable node representations by decomposing metrics into orthogonal subspaces and contrastively aligning logs and traces. It then discretizes these states into symptom tokens via vector quantization. On this foundation the system runs a multi-agent Hypothesis-Evidence-Test loop that checks whether candidate explanations are consistent with observed propagation, thereby isolating the initiating anomaly from later amplified symptoms. A self-evolving memory layer refreshes hierarchical incident records and applies conservative adaptation to keep performance stable under drift.

Core claim

TopoEvo couples graph representation learning with structured reasoning by introducing Metric-orthogonal Multimodal Alignment to reduce redundancy, vector quantization to produce auditable symptom tokens, a multi-agent Hypothesis-Evidence-Test workflow to verify propagation-consistent explanations, and a self-evolving mechanism that refreshes incident memory and performs test-time adaptation, thereby separating initiating anomalies from amplified downstream symptoms.

What carries the argument

The multi-agent Hypothesis--Evidence--Test (HET) workflow that verifies propagation-consistent explanations on discrete symptom tokens derived from topology-enhanced states.

If this is right

  • Separates initiating anomalies from amplified downstream symptoms.
  • Enables reliable retrieval and token-level evidence grounding.
  • Maintains robustness under non-stationary topology drift through conservative test-time adaptation.
  • Couples graph representation learning directly with structured, verifiable reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-level grounding could support integration with existing incident dashboards for human review of automated diagnoses.
  • Propagation-consistency checks might transfer to root-cause tasks in other dynamic graphs such as sensor networks or distributed ledgers.
  • Hierarchical memory refresh could reduce the need for full retraining when new service versions are deployed.

Load-bearing premise

Metric-orthogonal multimodal alignment together with vector quantization will produce stable node representations that reduce redundancy and sparsity enough to support reliable topology-constrained reasoning under non-stationary drift.

What would settle it

On a microservices benchmark containing documented cascading failures and controlled topology changes, measure whether the framework attributes the root cause to the known initiator rather than a downstream victim in the majority of cases after each drift event.

Figures

Figures reproduced from arXiv: 2605.15611 by Junle Wang, Wenjun Wu, Xingchuang Liao.

Figure 1
Figure 1. Figure 1: Illustration of symptom-amplification bias. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of TopoEvo. dedicated to log-consistent and trace-consistent alignment, respectively, thereby preserving information while reducing redundancy in downstream graph reasoning. Given metric embeddings x metric v , two components are pro￾duced as uv = Wux metric v , vv = Wvx metric v , (3) where uv is intended to capture metric factors most con￾sistent with log evidence, while vv captures compleme… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of vector quantization and symptom vocabulary construc [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Main prompt of Hypothesis Planner. provide compact, consistent context for these agents, while topology saliency focuses attention on propagation-relevant regions. 1) Shared context: Candidate-centric subgraph and tok￾enized evidence. A candidate-centric subgraph is constructed to keep the workflow focused. TopoEvo selects the top-K candidate nodes according to the root-cause score s(v) and extracts an H-h… view at source ↗
Figure 6
Figure 6. Figure 6: Parameter sensitivity analysis. K denotes the VQ partition (codebook [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Parameter sensitivity analysis. K denotes the VQ partition (codebook [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: In case study experiments, We injected a [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes TopoEvo, a topology-aware self-evolving multi-agent framework for root cause analysis (RCA) in microservices. It introduces Metric-orthogonal Multimodal Alignment (MOMA) to decompose and align multimodal observability data into stable node representations, Vector Quantization (VQ) to discretize states into auditable symptom tokens with a lexicon, a multi-agent Hypothesis-Evidence-Test (HET) workflow to verify propagation-consistent explanations and isolate initiating anomalies, and a self-evolving mechanism for hierarchical memory refresh and test-time adaptation under topology drift induced by autoscaling.

Significance. If the reported ablation gains hold, the work could meaningfully advance LLM-based RCA by reducing symptom-amplification bias through explicit topology constraints and discrete token grounding. The coupling of graph representation learning with structured HET reasoning, together with the self-evolving adaptation, addresses practical challenges in non-stationary microservice environments. The empirical results tying F1 improvements to the proposed components (MOMA, VQ, and topology in HET) under simulated drift constitute a concrete strength.

major comments (1)
  1. [§5.1] §5.1, ablation on drift scenarios: the claim that MOMA plus VQ yields stable representations sufficient for reliable HET reasoning under non-stationary drift is load-bearing, yet the manuscript reports only aggregate F1 gains without per-drift-type breakdowns or statistical significance tests across the simulated rolling-update and autoscaling cases; this weakens the link between the weakest assumption and the central separation of root causes from downstream symptoms.
minor comments (2)
  1. [§4.2] The symptom lexicon construction in the VQ section would benefit from an explicit example of token-to-evidence mapping to clarify how retrieval supports the Test phase of HET.
  2. Figure 3 caption and axis labels could more clearly distinguish the contribution of topology constraints versus the self-evolving memory update.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of TopoEvo's contributions and for the recommendation of minor revision. The single major comment is addressed point-by-point below. We agree that additional granularity in the drift ablation will strengthen the manuscript and will incorporate the requested changes.

read point-by-point responses
  1. Referee: [§5.1] §5.1, ablation on drift scenarios: the claim that MOMA plus VQ yields stable representations sufficient for reliable HET reasoning under non-stationary drift is load-bearing, yet the manuscript reports only aggregate F1 gains without per-drift-type breakdowns or statistical significance tests across the simulated rolling-update and autoscaling cases; this weakens the link between the weakest assumption and the central separation of root causes from downstream symptoms.

    Authors: We agree that the current presentation of aggregate F1 scores across all drift scenarios limits the ability to isolate the contribution of MOMA+VQ stability to HET performance under each drift mechanism. In the revised manuscript we will add per-drift-type breakdowns (rolling-update vs. autoscaling) for the key ablation configurations (full TopoEvo, w/o MOMA, w/o VQ, w/o topology in HET). We will also report statistical significance (paired t-tests with Bonferroni correction, or Wilcoxon signed-rank tests where normality assumptions are violated) on the F1 deltas between configurations for each drift type. These additions will be placed in an expanded §5.1 and the corresponding appendix table, directly strengthening the empirical link between the proposed components and the separation of root causes from cascading symptoms. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes a multi-agent RCA framework at the level of architectural components (MOMA for orthogonal alignment, VQ for symptom tokenization, HET workflow, and self-evolving memory) without presenting equations, derivations, or fitted-parameter predictions that could reduce to inputs by construction. Ablation results and empirical F1 gains are tied to explicit component implementations rather than asserted via self-citation chains or self-definitional loops. The central claim remains externally falsifiable through topology-constrained reasoning benchmarks and does not rely on load-bearing self-references or renamed empirical patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on domain assumptions about microservice observability and failure propagation; it introduces new conceptual modules without independent empirical grounding in the abstract.

axioms (1)
  • domain assumption Microservice systems exhibit noisy and heterogeneous multimodal observability, cascading failure propagation, and non-stationary topology drift.
    Explicitly listed as the three core challenges in the abstract.
invented entities (1)
  • symptom tokens no independent evidence
    purpose: Discrete auditable units obtained via vector quantization for reliable retrieval and token-level evidence grounding.
    Introduced as output of the VQ step on topology-enhanced states.

pith-pipeline@v0.9.0 · 5819 in / 1294 out tokens · 60936 ms · 2026-05-20T19:07:58.949055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Journal of Systems and Software178, 110964 (2021)

    Soldani, J., Brogi, A.: Anomaly Detection and Failure Root Cause Anal- ysis in (Micro)Service-Based Cloud Applications: A Survey. Journal of Systems and Software178, 110964 (2021)

  2. [2]

    arXiv preprint arXiv:2408.00803 (2024)

    Wang, T., Qi, G.: A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends. arXiv preprint arXiv:2408.00803 (2024)

  3. [3]

    RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,

    L. Pham, H. Zhang, H. Ha, F. Salim, and X. Zhang, “RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data,” inCompanion Proceedings of the ACM Web Confer- ence 2025 (WWW Companion), pp. 777–780, 2025

  4. [4]

    Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,

    Z. Yao, C. Pei, X. Nie, et al., “Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph,” inProceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE), 2024

  5. [5]

    In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp

    Pan, X., Yu, Y ., Zhang, H.: DyCause: Dynamic Causal Inference for Root Cause Analysis. In: Proceedings of the 45th International Conference on Software Engineering (ICSE), pp. 435–446 (2023)

  6. [6]

    In: IEEE International Confer- ence on Services Computing (SCC), pp

    Wu, G., Liu, D., Zhang, H.: MMRCA: Multimodal Root Cause Analysis via Fusion of Logs, Metrics and Traces. In: IEEE International Confer- ence on Services Computing (SCC), pp. 110–120 (2021)

  7. [7]

    In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp

    Li, X., Zhang, H., Pham, L., et al.: MRCA: Metric-Level Root Cause Analysis for Microservices via Multi-Modal Data. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1115–1127 (2024)

  8. [8]

    In: Proceedings of the ACM Web Conference (WWW), pp

    Lou, J., Yu, H., Zhang, M., et al.: Unimodal is Not Enough: A Bench- mark and Unified Framework for Multimodal Anomaly Detection in Logs, Metrics, and Traces. In: Proceedings of the ACM Web Conference (WWW), pp. 2841–2852 (2023)

  9. [9]

    RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,

    Z. Wang, Z. Liu, Y . Zhao, et al., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024

  10. [10]

    Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,

    C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei, “Flow-of-Action: SOP Enhanced LLM- Based Multi-Agent System for Root Cause Analysis,” inCompanion Proceedings of the ACM Web Conference 2025 (WWW Companion), pp. 422–431, 2025

  11. [11]

    mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture,

    W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, C. Yan, Z. Li, T. Li, X. Shi, L. Zheng, and B. Zhang, “mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4017–4033, 2024

  12. [12]

    Microrca-agent: Microservice root cause analysis method based on large language model agents.arXiv preprint arXiv:2509.15635, 2025

    P. Tang, S. Tang, H. Pu, Z. Miao, and Z. Wang, “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents,”arXiv preprint arXiv:2509.15635, 2025

  13. [13]

    Online Multi-modal Root Cause Identification in Microservice Systems,

    L. Zheng, Z. Chen, and H. Chen, “Online Multi-modal Root Cause Identification in Microservice Systems,” in2025 IEEE International Conference on Big Data (BigData), pp. 1–8, 2025

  14. [14]

    arXiv preprint arXiv:2406.13604 (2024)

    Phan, X., Li, M., Zhang, Q.: MicroCERCL: Root Cause Local- ization in Cloud-Edge Microservice Environments. arXiv preprint arXiv:2406.13604 (2024)

  15. [15]

    In: IEEE International Conference on Cloud Computing (CLOUD), pp

    Li, Z., Chen, J., Jiao, R., et al.: Practical Root Cause Localization for Microservice Systems via Trace Analysis. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 25–34 (2021)

  16. [16]

    In: IEEE International Conference on Cloud Computing (CLOUD), pp

    Jha, S., et al.: Localizing and Explaining Faults in Microservices Using Distributed Traces and Logs. In: IEEE International Conference on Cloud Computing (CLOUD), pp. 53–63 (2022)

  17. [17]

    Journal of Systems and Software 206, 111520 (2023)

    Li, Z., Zhang, H., Wang, C.: MicroIRC: Instance-Level Root Cause Localization for Microservice Systems. Journal of Systems and Software 206, 111520 (2023)

  18. [18]

    IEEE Trans

    Chillarege, R., Bhandari, I., Chaar, J., et al.: Orthogonal Defect Classi- fication—A Concept for In-Process Measurements. IEEE Trans. Softw. Eng.18(11), 943–956 (1992)

  19. [19]

    In: Proceedings of the 17th ACM SIGKDD, pp

    Nguyen, H.A., Tan, J., Gu, X., et al.: PAL: Performance Anomaly Localization for Cloud Systems. In: Proceedings of the 17th ACM SIGKDD, pp. 1230–1238 (2011)

  20. [20]

    In: IEEE International Conference on Services Computing (SCC), pp

    Jia, Z., He, J., Xu, X., et al.: Fault Localization for Microservice Systems via Graph Neural Network Learning. In: IEEE International Conference on Services Computing (SCC), pp. 218–225 (2020)

  21. [21]

    In: IEEE/ACM International Conference on Software Engineering (ICSE), pp

    Yang, Y ., Zhou, M., Yang, S., et al.: Unsupervised Root Cause Analysis for Microservice Systems via Spatio-Temporal Graph Modeling. In: IEEE/ACM International Conference on Software Engineering (ICSE), pp. 156–166 (2020)

  22. [22]

    In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp

    Guo, Y ., Wang, J., Li, Y ., et al.: CARR: A Causal-Aware Neural Ap- proach for Root Cause Localization in Cloud Systems. In: Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp. 3922–3930 (2022)

  23. [23]

    In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp

    Lin, Y ., Fang, Y ., Zhang, H., et al.: Causal Tracing for Root Cause Localization in Microservices. In: IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 33–44 (2022)

  24. [24]

    arXiv preprint arXiv:2108.00344 (2021)

    Wang, H., Wu, Z., Jiang, H., et al.: Groot: An Event-Graph-Based Approach for Root Cause Analysis in Industrial Settings. arXiv preprint arXiv:2108.00344 (2021)

  25. [25]

    Root Cause Analysis of Failures in Microservices through Causal Discovery,

    A. Ikram, S. Chakraborty, S. Mitra, S. K. Saini, S. Bagchi, and M. Kocaoglu, “Root Cause Analysis of Failures in Microservices through Causal Discovery,” inAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022

  26. [26]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Chakraborty, S., et al.: Root Cause Analysis of Failures in Microservices through Causal Structure Discovery. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

  27. [27]

    In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp

    Zhou, Y ., Chen, T., Liu, Z., et al.: LatentScope: Unsupervised Root Cause Analysis with Limited Observability. In: ACM International Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), pp. 1–13 (2024)

  28. [28]

    Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,

    Z. Xie, S. Zhang, Y . Geng, X. Nie, Z. Yao, L. Xu, Y . Sun, W. Li, and D. Pei, “Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2024

  29. [29]

    BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point De- tection,

    L. Pham, H. Ha, and H. Zhang, “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point De- tection,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2214–2237, 2024

  30. [30]

    In: NCTA Conference Proceedings, pp

    Neumann, A.: Causality Based Instant Root Cause Analysis for Mi- croservices Failure. In: NCTA Conference Proceedings, pp. 140–152 (2024)

  31. [31]

    Pham, L., Ha, H., Zhang, H.: Root Cause Analysis for Microservice Systems Based on Causal Inference: How Far Are We? arXiv preprint arXiv:2408.13729 (2024)

  32. [32]

    DoWhy Example Notebook (2024), https://www

    DoWhy Contributors: Root Cause Analysis of Latencies in a Microser- vice Architecture. DoWhy Example Notebook (2024), https://www. pywhy.org

  33. [33]

    In: ACM Symposium on Cloud Computing (SoCC) (2025)

    Zhang, H., Pham, L., Liu, Y .: Intelligent Root Cause Localization in MicroService Systems. In: ACM Symposium on Cloud Computing (SoCC) (2025)

  34. [34]

    Root Cause Analysis for Microservice Systems via Hierarchical Reinforce- ment Learning from Human Feedback,

    L. Wang, C. Zhang, R. Ding, Y . Xu, Q. Chen, W. Zou, Q. Chen, M. Zhang, X.-C. Gao, H. Fan, S. Rajmohan, Q. Lin, and D. Zhang, “Root Cause Analysis for Microservice Systems via Hierarchical Reinforce- ment Learning from Human Feedback,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 5116–5125, 2023

  35. [35]

    Adaptive root cause localization for microservice systems with multi-agent recursion-of-thought.arXiv preprint arXiv:2508.20370, 2025

    L. Zhang, T. Jia, K. Wang, W. Hong, C. Duan, M. He, and Y . Li, “Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought,”arXiv preprint arXiv:2508.20370, 2025

  36. [36]

    Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,

    L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, M. Jia, and Y . Li, “Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices,”arXiv preprint arXiv:2601.02732, 2026

  37. [37]

    Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data,

    G. Yu, P. Chen, Y . Li, et al., “Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023

  38. [38]

    Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,

    C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An End- to-End Troubleshooting Framework for Microservices on Multi-source Data,” inProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE), pp. 1750–1762, 2023

  39. [39]

    Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,

    Y . Han, Q. Du, Y . Huang, P. Li, X. Shi, J. Wu, P. Fang, F. Tian, and C. He, “Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data,”IEEE Transactions on Services Computing, vol. 17, no. 6, pp. 3789–3802, 2024

  40. [40]

    TAMO: Fine-Grained Root Cause Analy- sis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems[J]

    Zhang X, Wang Q, Li M, et al. TAMO: Fine-Grained Root Cause Analy- sis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems[J]. IEEE Transactions on Services Computing, 2025, 18(6): 4221-4233