Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought
Pith reviewed 2026-06-30 20:02 UTC · model grok-4.3
The pith
RCLAgent decomposes trace graphs with dedicated agents running recursion-of-thought in parallel to localize microservice root causes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RCLAgent realizes multi-agent recursion-of-thought with parallel reasoning. It decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph.
What carries the argument
Multi-agent recursion-of-thought, which decomposes diagnostics by assigning dedicated agents to individual trace spans and coordinates them recursively and in parallel to produce a synthesized diagnosis.
If this is right
- Localization accuracy improves because each agent focuses on a limited local context while still contributing to a global evidence graph.
- Inference time decreases through parallel execution of agents rather than serial chain-of-thought steps.
- Interpretability increases because the final report is assembled from per-span diagnoses that follow the actual call graph.
- Transferability across deployments rises because the agent organization is driven by runtime trace topology rather than fixed training data.
Where Pith is reading between the lines
- The same recursive-agent pattern could be tested on other graph-structured diagnostic tasks such as network fault isolation or distributed database debugging.
- If the synthesis step that merges the Root-Level Diagnosis Report and Global Evidence Graph proves sensitive to agent count, then scaling behavior on very large traces would need separate validation.
- The approach implicitly suggests that human-inspired decomposition may reduce the token budget required for LLM-based diagnosis, an efficiency dimension not directly measured in the reported experiments.
Load-bearing premise
Decomposing the diagnostic process along the trace graph with dedicated agents organized recursively and in parallel will prevent context explosion and enable deeper causal exploration.
What would settle it
On the same public benchmarks, if RCLAgent shows no gain in localization accuracy or inference speed relative to the strongest prior methods, the central performance claim would be falsified.
Figures
read the original abstract
As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RCLAgent, a multi-agent recursion-of-thought framework for root cause localization (RCL) in microservice systems. Motivated by an empirical study of human SRE diagnostic practices and the shortcomings of prior LLM-based methods (context explosion and serial reasoning), the approach assigns dedicated agents to individual spans in the trace graph, organizes them recursively and in parallel according to graph topology, and produces a final diagnosis via synthesis of a Root-Level Diagnosis Report and Global Evidence Graph. The central empirical claim is that RCLAgent consistently outperforms state-of-the-art methods on public benchmarks in both localization accuracy and inference efficiency.
Significance. If the reported gains are reproducible and the design choices are shown to be load-bearing, the work would offer a concrete, practice-motivated advance over existing LLM-based RCL techniques. The multi-agent decomposition along the trace graph directly targets two well-recognized failure modes (context dilution and shallow causal chains) and could improve both accuracy and latency in production diagnosis pipelines.
major comments (2)
- [Abstract / §4] Abstract and §4 (Experiments): the claim that RCLAgent 'consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency' is the central result, yet the visible text provides neither the concrete metrics (e.g., F1, precision@K, latency), the exact baselines, the number of benchmarks, nor any ablation or statistical test. Without these data the support for the claim cannot be evaluated.
- [§3] §3 (Method): the assumption that recursive parallel agent organization along the trace graph will mitigate context explosion is presented as following directly from the SRE study, but no quantitative evidence (e.g., context-length measurements before/after decomposition or agent-interaction overhead) is supplied to show that the decomposition actually reduces effective context size or improves causal depth.
minor comments (2)
- [Abstract] The abstract refers to 'multiple public benchmarks' without naming them; listing the datasets (and citing their sources) would allow readers to assess generalizability.
- [§3] Notation for the Global Evidence Graph and Root-Level Diagnosis Report is introduced without a formal definition or pseudocode; a small diagram or algorithm box would clarify the synthesis step.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract / §4] Abstract and §4 (Experiments): the claim that RCLAgent 'consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency' is the central result, yet the visible text provides neither the concrete metrics (e.g., F1, precision@K, latency), the exact baselines, the number of benchmarks, nor any ablation or statistical test. Without these data the support for the claim cannot be evaluated.
Authors: The detailed results—including F1, precision@K, latency, exact baselines, benchmark counts, ablations, and statistical tests—are reported in the tables and analysis of §4. To make the central claim immediately evaluable from the abstract and §4 summary, we will revise both to explicitly list the key metrics, baselines, and benchmark details while retaining the full supporting data in the section. revision: yes
-
Referee: [§3] §3 (Method): the assumption that recursive parallel agent organization along the trace graph will mitigate context explosion is presented as following directly from the SRE study, but no quantitative evidence (e.g., context-length measurements before/after decomposition or agent-interaction overhead) is supplied to show that the decomposition actually reduces effective context size or improves causal depth.
Authors: The SRE study supplies the qualitative motivation for targeting context explosion and serial reasoning via graph-aligned decomposition. End-to-end gains in accuracy and efficiency are shown quantitatively in §4. We agree that direct measurements (e.g., token counts pre/post-decomposition and interaction overhead) would strengthen the mechanistic claim and will add this analysis to a revised §3. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper motivates RCLAgent from an independent study of SRE practices and limitations of prior LLM-based methods, then presents an empirical evaluation on public benchmarks showing outperformance. No equations, fitted parameters renamed as predictions, self-citation chains, or self-definitional reductions appear in the abstract or described structure. The central claim rests on benchmark results rather than any derivation that reduces to its own inputs by construction. This is the normal case of a self-contained empirical proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The trace graph topology provides an effective structure for organizing agents recursively and in parallel to mitigate context explosion and support deep causal exploration.
Reference graph
Works this paper leans on
-
[1]
Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study,
X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, W. Li, and D. Ding, “Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study,”IEEE Transactions on Software Engineering, vol. 47, no. 2, pp. 243–260, 2018
2018
-
[2]
A survey of aiops in the era of large language models,
L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. Yu, and Y . Li, “A survey of aiops in the era of large language models,” ACM Computing Surveys, 2025
2025
-
[3]
Developing self-adaptive microservice systems: Challenges and directions,
N. C. Mendonc ¸a, P. Jamshidi, D. Garlan, and C. Pahl, “Developing self-adaptive microservice systems: Challenges and directions,”IEEE Software, vol. 38, no. 2, pp. 70–79, 2019
2019
-
[4]
Design, monitoring, and testing of microservices systems: The prac- titioners’ perspective,
M. Waseem, P. Liang, M. Shahin, A. Di Salle, and G. M ´arquez, “Design, monitoring, and testing of microservices systems: The prac- titioners’ perspective,”Journal of Systems and Software, vol. 182, p. 111061, 2021
2021
-
[5]
Towards close-to-zero runtime collection overhead: Raft-based anomaly diagno- sis on system faults for distributed storage system,
L. Zhang, T. Jia, M. Jia, H. Liu, Y . Yang, Z. Wu, and Y . Li, “Towards close-to-zero runtime collection overhead: Raft-based anomaly diagno- sis on system faults for distributed storage system,”IEEE Transactions on Services Computing, 2024
2024
-
[6]
Time-tired compaction: An elastic compaction scheme for lsm-tree based time-series database,
L.-Z. Zhang, X.-D. Huang, Y .-K. Wang, J.-L. Qiao, S.-X. Song, and J.-M. Wang, “Time-tired compaction: An elastic compaction scheme for lsm-tree based time-series database,”Advanced Engineering Infor- matics, vol. 59, p. 102224, 2024
2024
-
[7]
Separation or not: On handing out-of-order time-series data in leveled lsm-tree,
Y . Kang, X. Huang, S. Song, L. Zhang, J. Qiao, C. Wang, J. Wang, and J. Feinauer, “Separation or not: On handing out-of-order time-series data in leveled lsm-tree,” in2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2022, pp. 3340–3352
2022
-
[8]
Multivariate log-based anomaly detection for distributed database,
L. Zhang, T. Jia, M. Jia, Y . Li, Y . Yang, and Z. Wu, “Multivariate log-based anomaly detection for distributed database,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 4256–4267
2024
-
[9]
Reducing events to augment log-based anomaly detection models: An empirical study,
L. Zhang, T. Jia, K. Wang, M. Jia, Y . Yang, and Y . Li, “Reducing events to augment log-based anomaly detection models: An empirical study,” inProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2024, pp. 538– 548
2024
-
[10]
Inter- dependent causal networks for root cause localization,
D. Wang, Z. Chen, J. Ni, L. Tong, Z. Wang, Y . Fu, and H. Chen, “Inter- dependent causal networks for root cause localization,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5051–5060. TOW ARDS IN-DEPTH ROOT CAUSE LOCALIZATION FOR MICROSERVICES WITH MULTI-AGENT RECURSION-OF-THOUGHT 16
2023
-
[11]
Failure diagnosis in microservice systems: A comprehensive survey and analysis,
S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, 2024
2024
-
[12]
A survey on intelligent management of alerts and incidents in it services,
Q. Yu, N. Zhao, M. Li, Z. Li, H. Wang, W. Zhang, K. Sui, and D. Pei, “A survey on intelligent management of alerts and incidents in it services,”Journal of Network and Computer Applications, p. 103842, 2024
2024
-
[13]
Interpretable failure localization for microservice systems based on graph autoencoder,
Y . Sun, Z. Lin, B. Shi, S. Zhang, S. Ma, P. Jin, Z. Zhong, L. Pan, Y . Guo, and D. Pei, “Interpretable failure localization for microservice systems based on graph autoencoder,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025
2025
-
[14]
Hemirca: Fine-grained root cause analysis for microservices with heterogeneous data sources,
Z. Zhu, C. Lee, X. Tang, and P. He, “Hemirca: Fine-grained root cause analysis for microservices with heterogeneous data sources,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–25, 2024
2024
-
[15]
Microservice root cause analysis with limited observability through intervention recognition in the latent space,
Z. Xie, S. Zhang, Y . Geng, Y . Zhang, M. Ma, X. Nie, Z. Yao, L. Xu, Y . Sun, W. Liet al., “Microservice root cause analysis with limited observability through intervention recognition in the latent space,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6049–6060
2024
-
[16]
Kgroot: A knowledge graph-enhanced method for root cause analysis,
T. Wang, G. Qi, and T. Wu, “Kgroot: A knowledge graph-enhanced method for root cause analysis,”Expert Systems with Applications, vol. 255, p. 124679, 2024
2024
-
[17]
Rcaeval: A bench- mark for root cause analysis of microservice systems with telemetry data,
L. Pham, H. Zhang, H. Ha, F. Salim, and X. Zhang, “Rcaeval: A bench- mark for root cause analysis of microservice systems with telemetry data,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 777–780
2025
-
[18]
Lemma-rca: A large multi-modal multi-domain dataset for root cause analysis,
L. Zheng, Z. Chen, D. Wang, C. Deng, R. Matsuoka, and H. Chen, “Lemma-rca: A large multi-modal multi-domain dataset for root cause analysis,”arXiv preprint arXiv:2406.05375, 2024
-
[19]
Microscope: Pinpoint performance issues with causal graphs in micro-service environments,
J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues with causal graphs in micro-service environments,” inService- Oriented Computing: 16th International Conference, ICSOC 2018, Hangzhou, China, November 12-15, 2018, Proceedings 16. Springer, 2018, pp. 3–20
2018
-
[20]
Causal inference-based root cause analysis for online service systems with intervention recognition,
M. Li, Z. Li, K. Yin, X. Nie, W. Zhang, K. Sui, and D. Pei, “Causal inference-based root cause analysis for online service systems with intervention recognition,” inProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3230–3240
2022
-
[21]
Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,
G. Yu, P. Chen, H. Chen, Z. Guan, Z. Huang, L. Jing, T. Weng, X. Sun, and X. Li, “Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,” in Proceedings of the Web Conference 2021, 2021, pp. 3087–3098
2021
-
[22]
Tracerank: Abnormal service local- ization with dis-aggregated end-to-end tracing data in cloud native systems,
G. Yu, Z. Huang, and P. Chen, “Tracerank: Abnormal service local- ization with dis-aggregated end-to-end tracing data in cloud native systems,”Journal of Software: Evolution and Process, vol. 35, no. 10, p. e2413, 2023
2023
-
[23]
{CRISP}: Critical path analysis of{Large-Scale} microservice architectures,
Z. Zhang, M. K. Ramanathan, P. Raj, A. Parwal, T. Sherwood, and M. Chabbi, “{CRISP}: Critical path analysis of{Large-Scale} microservice architectures,” in2022 USENIX Annual Technical Con- ference (USENIX ATC 22), 2022, pp. 655–672
2022
-
[24]
Trace-based multi-dimensional root cause localization of performance issues in microservice systems,
C. Zhang, Z. Dong, X. Peng, B. Zhang, and M. Chen, “Trace-based multi-dimensional root cause localization of performance issues in microservice systems,” inProceedings of the IEEE/ACM 46th Inter- national Conference on Software Engineering, 2024, pp. 1–12
2024
-
[25]
Root cause analysis in microservice using neural granger causal discovery,
C.-M. Lin, C. Chang, W.-Y . Wang, K.-D. Wang, and W.-C. Peng, “Root cause analysis in microservice using neural granger causal discovery,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 206–213
2024
-
[26]
Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,
G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565
2023
-
[27]
Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,
Y . Gan, Y . Zhang, K. Hu, D. Cheng, Y . He, M. Pancholi, and C. Delimitrou, “Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,” inProceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, 2019, pp. 19–33
2019
-
[28]
Micromilts: Fault location for microservices based mutual information and lstm autoencoder,
L. Yang, J. Li, K. Shi, S. Yang, Q. Yang, and J. Sun, “Micromilts: Fault location for microservices based mutual information and lstm autoencoder,” in2022 23rd Asia-Pacific Network Operations and Management Symposium (APNOMS). IEEE, 2022, pp. 1–6
2022
-
[29]
Modelcoder: A fault model based automatic root cause localization framework for microservice systems,
Y . Cai, B. Han, J. Li, N. Zhao, and J. Su, “Modelcoder: A fault model based automatic root cause localization framework for microservice systems,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021, pp. 1–6
2021
-
[30]
Tracediag: Adaptive, interpretable, and efficient root cause analysis on large-scale microservice systems,
R. Ding, C. Zhang, L. Wang, Y . Xu, M. Ma, X. Wu, M. Zhang, Q. Chen, X. Gao, X. Gaoet al., “Tracediag: Adaptive, interpretable, and efficient root cause analysis on large-scale microservice systems,” inProceed- ings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1762–1773
2023
-
[31]
Root cause analysis for microservice systems via hierarchical reinforcement learning from human feedback,
L. Wang, C. Zhang, R. Ding, Y . Xu, Q. Chen, W. Zou, Q. Chen, M. Zhang, X. Gao, H. Fanet al., “Root cause analysis for microservice systems via hierarchical reinforcement learning from human feedback,” inProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5116–5125
2023
-
[32]
Trace-based intel- ligent fault diagnosis for microservices with deep learning,
H. Chen, K. Wei, A. Li, T. Wang, and W. Zhang, “Trace-based intel- ligent fault diagnosis for microservices with deep learning,” in2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2021, pp. 884–893
2021
-
[33]
Grace: Interpretable root cause analysis by graph convolutional network for microservices,
R. Ren, Y . Wang, F. Liu, Z. Li, G. Tyson, T. Miao, and G. Xie, “Grace: Interpretable root cause analysis by graph convolutional network for microservices,” in2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS). IEEE, 2023, pp. 1–4
2023
-
[34]
Actionable and interpretable fault localization for recurring failures in online service systems,
Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996–1008
2022
-
[35]
Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,
C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1750–1762
2023
-
[36]
mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,
W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, Y . Chaoran, Z. Li, T. Li, X. Shi, L. Zhenget al., “mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 4017–4033
2024
-
[37]
Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,
Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 4966–4974
2024
-
[38]
Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,
C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431
2025
-
[39]
Coca: Generative root cause analysis for distributed systems with code knowledge,
Y . Li, Y . Wu, J. Liu, Z. Jiang, Z. Chen, G. Yu, and M. R. Lyu, “Coca: Generative root cause analysis for distributed systems with code knowledge,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 1346–1358
2025
-
[40]
Q. Wang, X. Zhang, M. Li, Y . Yuan, M. Xiao, F. Zhuang, and D. Yu, “Tamo: Fine-grained root cause analysis via tool-assisted llm agent with multi-modality observation data in cloud-native systems,”arXiv preprint arXiv:2504.20462, 2025
-
[41]
The multi-agent fault localization system based on monte carlo tree search approach,
R. Ren, “The multi-agent fault localization system based on monte carlo tree search approach,”arXiv preprint arXiv:2507.22800, 2025
-
[42]
Exploring llm-based agents for root cause analysis,
D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” arXiv preprint arXiv:2403.04123, 2024
-
[43]
H. Shi, L. Cheng, W. Wu, Y . Wang, X. Liu, S. Nie, W. Wang, X. Min, C. Men, and Y . Lin, “Enhancing cluster resilience: Llm-agent based autonomous intelligent cluster diagnosis system and evaluation framework,”arXiv preprint arXiv:2411.05349, 2024
-
[44]
Z. Xie, Y . Zheng, L. Ottens, K. Zhang, C. Kozyrakis, and J. Mace, “Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight,”arXiv preprint arXiv:2407.08694, 2024
-
[45]
The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,
Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943
2024
-
[46]
L. Zhang, Y . Zhai, T. Jia, C. Duan, S. Yu, J. Gao, B. Ding, Z. Wu, and Y . Li, “Thinkfl: Self-refining failure localization for microservice sys- tems via reinforcement fine-tuning,”arXiv preprint arXiv:2504.18776, 2025. TOW ARDS IN-DEPTH ROOT CAUSE LOCALIZATION FOR MICROSERVICES WITH MULTI-AGENT RECURSION-OF-THOUGHT 17
-
[47]
Scalalog: Scalable log-based failure diagnosis using llm,
L. Zhang, T. Jia, M. Jia, Y . Wu, H. Liu, and Y . Li, “Scalalog: Scalable log-based failure diagnosis using llm,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[48]
Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,
L. Zhang, Y . Zhai, T. Jia, X. Huang, C. Duan, and Y . Li, “Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,”arXiv preprint arXiv:2504.06614, 2025
-
[49]
Automated root causing of cloud incidents using in-context learning with gpt-4,
X. Zhang, S. Ghosh, C. Bansal, R. Wang, M. Ma, Y . Kang, and S. Ra- jmohan, “Automated root causing of cloud incidents using in-context learning with gpt-4,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 266–277
2024
-
[50]
Face it yourselves: An llm-based two-stage strategy to localize configuration errors via logs,
S. Shan, Y . Huo, Y . Su, Y . Li, D. Li, and Z. Zheng, “Face it yourselves: An llm-based two-stage strategy to localize configuration errors via logs,”arXiv preprint arXiv:2404.00640, 2024
-
[51]
Openrca: Can large language models locate the root cause of software failures?
J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “Openrca: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[52]
Lever- aging large language models for the auto-remediation of microservice applications: An experimental study,
K. Sarda, Z. Namrud, M. Litoiu, L. Shwartz, and I. Watts, “Lever- aging large language models for the auto-remediation of microservice applications: An experimental study,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 358–369
2024
-
[53]
AIOPS 2022 Championship,
“AIOPS 2022 Championship,” https://competition.aiops.cn/, 2022
2022
-
[54]
Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,
J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,”ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–39, 2022
2022
-
[55]
Characterizing job microarchitectural profiles at scale: Dataset and analysis,
K. Wang, Y . Li, C. Wang, T. Jia, K. Chow, Y . Wen, Y . Dou, G. Xu, C. Hou, J. Yaoet al., “Characterizing job microarchitectural profiles at scale: Dataset and analysis,” inProceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–11
2022
-
[56]
Capturing request execution path for understanding service behavior and detecting anomalies without code instrumentation,
Y . Yang, L. Wang, J. Gu, and Y . Li, “Capturing request execution path for understanding service behavior and detecting anomalies without code instrumentation,”IEEE Transactions on Services Computing, vol. 16, no. 2, pp. 996–1010, 2022
2022
-
[57]
Network-centric distributed tracing with deepflow: Troubleshooting your microservices in zero code,
J. Shen, H. Zhang, Y . Xiang, X. Shi, X. Li, Y . Shen, Z. Zhang, Y . Wu, X. Yin, J. Wanget al., “Network-centric distributed tracing with deepflow: Troubleshooting your microservices in zero code,” in Proceedings of the ACM SIGCOMM 2023 Conference, 2023, pp. 420– 437
2023
-
[58]
Agentic memory enhanced recursive reasoning for root cause localization in microservices,
L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, M. Jia, and Y . Li, “Agentic memory enhanced recursive reasoning for root cause localization in microservices,”arXiv preprint arXiv:2601.02732, 2026
-
[59]
Simplifying root cause analysis in kubernetes with stategraph and llm,
Y . Xiang, C. P. Chen, L. Zeng, W. Yin, X. Liu, H. Li, and W. Xu, “Simplifying root cause analysis in kubernetes with stategraph and llm,”arXiv preprint arXiv:2506.02490, 2025
-
[60]
Gala: Can graph-augmented large language model agentic workflows elevate root cause analysis?
Y . Tian, Y . Liu, Z. Chong, Z. Huang, and H.-A. Jacobsen, “Gala: Can graph-augmented large language model agentic workflows elevate root cause analysis?”arXiv preprint arXiv:2508.12472, 2025
-
[61]
Tvdiag: A task-oriented and view-invariant failure diagnosis frame- work for microservice-based systems with multimodal data,
S. Xie, J. Wang, H. He, Z. Wang, Y . Zhao, N. Zhang, and B. Li, “Tvdiag: A task-oriented and view-invariant failure diagnosis frame- work for microservice-based systems with multimodal data,”ACM Transactions on Software Engineering and Methodology, 2025
2025
-
[62]
Causalrca: Causal inference based pre- cise fine-grained root cause localization for microservice applications,
R. Xin, P. Chen, and Z. Zhao, “Causalrca: Causal inference based pre- cise fine-grained root cause localization for microservice applications,” Journal of Systems and Software, vol. 203, p. 111724, 2023
2023
-
[63]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
2022
-
[64]
Evaluating Large Language Models Trained on Code
M. Chen, “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[65]
Enjoy your observability: an industrial survey of microservice tracing and analysis,
B. Li, X. Peng, Q. Xiang, H. Wang, T. Xie, J. Sun, and X. Liu, “Enjoy your observability: an industrial survey of microservice tracing and analysis,”Empirical Software Engineering, vol. 27, pp. 1–28, 2022
2022
-
[66]
Characterizing microservice dependency and performance: Alibaba trace analysis,
S. Luo, H. Xu, C. Lu, K. Ye, G. Xu, L. Zhang, Y . Ding, J. He, and C. Xu, “Characterizing microservice dependency and performance: Alibaba trace analysis,” inProceedings of the ACM symposium on cloud computing, 2021, pp. 412–426
2021
-
[67]
Ora: Job runtime prediction for high-performance computing platforms using the online retrieval-augmented language model,
H. Liu, Y . Ma, X. Huang, L. Zhang, T. Jia, and Y . Li, “Ora: Job runtime prediction for high-performance computing platforms using the online retrieval-augmented language model,” inProceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 884–894
2025
-
[68]
Walk the talk: Is your log-based software reliability maintenance system really reliable?
M. He, T. Jia, C. Duan, P. Xiao, L. Zhang, K. Wang, Y . Wu, Y . Li, and G. Huang, “Walk the talk: Is your log-based software reliability maintenance system really reliable?”arXiv preprint arXiv:2509.24352, 2025
-
[69]
United we stand: Towards end-to-end log- based fault diagnosis via interactive multi-task learning,
M. He, C. Duan, P. Xiao, T. Jia, S. Yu, L. Zhang, W. Hong, J. Han, Y . Wu, Y . Liet al., “United we stand: Towards end-to-end log- based fault diagnosis via interactive multi-task learning,”arXiv preprint arXiv:2509.24364, 2025
-
[70]
Uda-rcl: Unsupervised domain adaptation for microservice root cause localization utilizing multimodal data,
X. Huang, H. Liu, Y . Wu, L. Zhang, T. Jia, Y . Li, and Z. Wu, “Uda-rcl: Unsupervised domain adaptation for microservice root cause localization utilizing multimodal data,”IEEE Transactions on Services Computing, 2025
2025
-
[71]
Aaad: Asynchronous inter-variable relationship-aware anomaly detection for multivariate time series,
H. Liu, X. Huang, M. Jia, L. Zhang, T. Jia, Z. Wu, and Y . Li, “Aaad: Asynchronous inter-variable relationship-aware anomaly detection for multivariate time series,” in2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6
2025
-
[72]
Logaction: Consistent cross-system anomaly detection through logs via active domain adaptation,
C. Duan, M. He, P. Xiao, T. Jia, X. Zhang, Z. Zhong, X. Luo, Y . Niu, L. Zhang, S. Yuet al., “Logaction: Consistent cross-system anomaly detection through logs via active domain adaptation,” in 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 700–712
2025
-
[73]
Coorlog: Efficient-generalizable log anomaly detection via adaptive coordinator in software evolution,
P. Xiao, C. Duan, M. He, T. Jia, Y . Wu, J. Xu, G. Gao, L. Zhang, W. Hong, Y . Liet al., “Coorlog: Efficient-generalizable log anomaly detection via adaptive coordinator in software evolution,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 1119–1131
2025
-
[74]
Latent error prediction and fault localization for microservice applications by learning from system trace logs,
X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 683–694
2019
-
[75]
Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,
P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xueet al., “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” in2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2020, pp. 48–58
2020
-
[76]
Sage: practical and scalable ml-driven performance debugging in microservices,
Y . Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou, “Sage: practical and scalable ml-driven performance debugging in microservices,” in Proceedings of the 26th ACM International Conference on Architec- tural Support for Programming Languages and Operating Systems, 2021, pp. 135–151
2021
-
[77]
Practical root cause localization for microservice systems via trace analysis,
Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wanget al., “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021, pp. 1–10
2021
-
[78]
Lag-llama: Towards foundation models for time series forecasting,
K. Rasul, A. Ashok, A. R. Williams, A. Khorasani, G. Adamopoulos, R. Bhagwatkar, M. Bilo ˇs, H. Ghonia, N. Hassen, A. Schneideret al., “Lag-llama: Towards foundation models for time series forecasting,” inR0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023
2023
-
[79]
Timer: generative pre-trained transformers are large time series models,
Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long, “Timer: generative pre-trained transformers are large time series models,” in Proceedings of the 41st International Conference on Machine Learn- ing, 2024, pp. 32 369–32 399
2024
-
[80]
A decoder-only foundation model for time-series forecasting,
A. Das, W. Kong, R. Sen, and Y . Zhou, “A decoder-only foundation model for time-series forecasting,” inForty-first International Confer- ence on Machine Learning, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.