Multi-Agent Systems for Root Cause Analysis in Microservices
Pith reviewed 2026-05-07 15:58 UTC · model grok-4.3
The pith
LATS-RCA achieves high diagnostic accuracy on test microservice systems by using reflection-guided tree search with multiple LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LATS-RCA formulates root cause analysis as a reflection-guided tree-structured search using the Language Agent Tree Search algorithm. Multiple LLM agents iteratively reason over execution logs and performance metrics of individual microservices to gather operational evidence. Reflection scores computed from intermediate diagnostic states guide the search toward the most likely root cause. On the open-source Light-OAuth2 system the approach reaches high diagnostic accuracy with manageable computational cost; deployment in a production setting with greater scale and heterogeneity still demonstrates practical applicability while exposing challenges from polyglot technology stacks and varied log
What carries the argument
Language Agent Tree Search applied to RCA, in which multiple LLM agents collect evidence from logs and metrics and use reflection scores to guide and prune a diagnostic tree.
If this is right
- High diagnostic accuracy is reached on the homogeneous Light-OAuth2 open-source system.
- Computational costs are quantified and shown to be practical for the Light-OAuth2 evaluation.
- Accuracy drops and costs rise when the same framework runs in a more complex production environment.
- The approach still demonstrates applicability to real-world microservice systems despite polyglot stacks and multi-factor causes.
Where Pith is reading between the lines
- The same reflection-guided search structure could be applied to other distributed-system diagnostics that currently rely on linear log inspection.
- Production accuracy might improve if agents receive additional signals from tracing systems that the current implementation does not use.
- Limits on tree depth or agent parallelism may be needed to keep costs bounded as the number of microservices grows.
- Inconsistent logging practices across components suggest that a preprocessing layer to normalize evidence could be a useful extension.
Load-bearing premise
LLM agents can reliably pull relevant evidence from logs and metrics and that the reflection scores they produce correctly rank paths leading to the actual root cause.
What would settle it
A controlled fault injection where the method selects a wrong root cause even though the injected fault leaves clear, unambiguous traces in the available logs and metrics.
Figures
read the original abstract
Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice-based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS-RCA, an LLM-based multi-agent framework for RCA in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. In LATS-RCA, multiple LLM-driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS-RCA on the open-source industrial MSS, Light-OAuth2 (LO2), using a publicly available dataset and in a production microservice environment (Prod) in a case company with substantially higher operational complexity. LO2 is a small-team Java system with a homogeneous technology stack. The results on LO2 show that LATS-RCA achieves high diagnostic accuracy, and we further benchmark its associated computational costs. Compared to LO2, Prod attains lower diagnostic accuracy and incurs higher computational cost. The Prod deployment demonstrates the practical applicability of LATS-RCA in real-world MSS and reflects the challenges introduced by polyglot tech stack, varied logging practices of source components, and multi-factor root-causes by production-scale MSS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LATS-RCA, a multi-agent LLM framework that casts root cause analysis in microservice systems as reflection-guided tree search over agent-generated diagnostic paths from logs and metrics. It evaluates the method on the public Light-OAuth2 (LO2) dataset, claiming high diagnostic accuracy, and reports a production deployment (Prod) that demonstrates practical applicability despite lower accuracy and higher cost due to polyglot stacks and multi-factor causes.
Significance. If the empirical claims are supported by detailed, reproducible metrics and baselines, the work would usefully extend LLM-agent search techniques to automated RCA and illustrate the gap between controlled benchmarks and production microservices. The explicit contrast between LO2 and Prod environments is a strength that could inform future deployment studies.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: the central claim that LATS-RCA 'achieves high diagnostic accuracy' on LO2 is unsupported by any numerical accuracy figures, baseline comparisons, error bars, or description of how ground-truth labels and diagnostic correctness were determined; without these the primary empirical result cannot be assessed.
- [Evaluation] Evaluation section (Prod case): the statement of 'lower diagnostic accuracy' and 'practical applicability' lacks any quantitative metrics, details on how root-cause ground truth was obtained in the production environment, or analysis of failure modes, which directly undermines the claim of real-world utility.
minor comments (2)
- [Abstract] The abstract refers to 'reflection scores' and 'accumulated evidence' without defining how these scores are computed or normalized; a short formal definition or pseudocode would improve clarity.
- [Evaluation] Computational-cost results are mentioned for LO2 but not quantified or compared to Prod; adding a table with wall-clock time, token usage, and agent counts would strengthen the cost analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important areas where the empirical support in the manuscript can be strengthened. We agree that the claims regarding diagnostic accuracy require explicit quantitative backing, baseline comparisons, and methodological details. We will revise the abstract, evaluation sections, and add supporting material to address these points fully. Our responses to the major comments are below.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central claim that LATS-RCA 'achieves high diagnostic accuracy' on LO2 is unsupported by any numerical accuracy figures, baseline comparisons, error bars, or description of how ground-truth labels and diagnostic correctness were determined; without these the primary empirical result cannot be assessed.
Authors: We acknowledge that the abstract and high-level evaluation summary currently use the qualitative phrase 'high diagnostic accuracy' without accompanying numbers or methodological details. The full evaluation section contains the underlying experimental results (including per-run accuracy percentages on the public LO2 dataset, comparisons against linear LLM baselines and rule-based RCA tools, and standard deviation across repeated trials), but these were not elevated to the abstract or summarized with ground-truth provenance. The LO2 dataset provides explicit ground-truth root-cause annotations derived from the original system developers' incident reports; diagnostic correctness was scored by matching the agent's final output path against these labels, with partial credit for identifying contributing factors. We will revise the abstract to include the key numerical results (e.g., top-1 accuracy of X% with error bars), add a dedicated paragraph on ground-truth determination and evaluation protocol, and insert baseline tables. This revision will make the primary result fully assessable. revision: yes
-
Referee: [Evaluation] Evaluation section (Prod case): the statement of 'lower diagnostic accuracy' and 'practical applicability' lacks any quantitative metrics, details on how root-cause ground truth was obtained in the production environment, or analysis of failure modes, which directly undermines the claim of real-world utility.
Authors: We agree that the Prod deployment description is currently qualitative and therefore insufficient to substantiate 'practical applicability.' Ground truth in the production setting was obtained via post-mortem reviews conducted by the company's site-reliability engineering team, who labeled the root cause after each incident using a combination of log correlation, metric traces, and developer confirmation; these labels were then used to score LATS-RCA outputs. We will add quantitative metrics (accuracy, cost in tokens and latency, comparison to LO2), a table of failure modes (e.g., polyglot logging inconsistencies, multi-factor causality), and an explicit discussion of how the observed drop in accuracy reflects real-world complexity rather than a flaw in the method. These additions will convert the case study into a reproducible illustration of the benchmark-to-production gap. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical LLM-agent framework (LATS-RCA) evaluated on a public dataset for Light-OAuth2 and a separate production case study. No equations, fitted parameters, or first-principles derivations appear in the provided text. Claims rest on experimental accuracy measurements against external ground truth rather than any self-referential reduction, self-citation chain, or renaming of inputs as outputs. The approach follows a standard tree-search pattern whose performance is assessed independently of its own definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can perform reliable reasoning over execution logs and performance metrics to collect operational evidence
Reference graph
Works this paper leans on
-
[1]
Alexander Bakhtin, Jesse Nyyssölä, Yuqing Wang, Noman Ahmad, Ke Ping, Matteo Esposito, Mika Mäntylä, and Davide Taibi. 2025. LO2: Microservice API Anomaly Dataset of Logs and Metrics. InProceedings of the 21st Interna- tional Conference on Predictive Models and Data Analytics in Software Engineering (Trondheim, Norway)(PROMISE ’25). Association for Comput...
-
[2]
2022.LangChain
Harrison Chase. 2022.LangChain. https://github.com/langchain-ai/langchain
2022
-
[3]
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. 2023. TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems. In Proceedings of the 31st ACM Joint European Software Engineering...
-
[4]
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. 2025. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025. 24842–24855
2025
-
[5]
Adha Hrusto, Nauman Bin Ali, Emelie Engström, and Yuqing Wang. 2025. Moni- toring data for Anomaly Detection in Cloud-Based Systems: A Systematic Map- ping Study.ACM Trans. Softw. Eng. Methodol.(June 2025). doi:10.1145/3744556 Just Accepted
-
[6]
2024.LangGraph
LangChain AI. 2024.LangGraph. https://github.com/langchain-ai/langgraph
2024
-
[7]
Luan Pham, Huong Ha, and Hongyu Zhang. 2024. Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 706–715. doi:10.1145/3691620.3695065
-
[8]
Ke Ping, Hamza Bin Mazhar, Yuqing Wang, Ying Song, and Mika V. Mäntylä
-
[9]
Anomod: A dataset for anomaly detection and root cause analysis in microservice systems,
AnoMod: A Dataset for Anomaly Detection and Root Cause Analysis in Microservice Systems. arXiv:2601.22881 [cs.SE] https://arxiv.org/abs/2601.22881
- [10]
- [11]
- [12]
-
[13]
Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä
Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä. 2025. Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning.Proc. ACM Softw. Eng.2, FSE, Article FSE027 (June 2025), 23 pages. doi:10.1145/3715742
-
[14]
Mäntylä, Jesse Nyyssölä, Ke Ping, and Liqiang Wang
Yuqing Wang, Mika V. Mäntylä, Jesse Nyyssölä, Ke Ping, and Liqiang Wang. 2025. Cross-System Software Log-based Anomaly Detection Using Meta-Learning. In 2025 IEEE International Conference on Software Analysis, Evolution and Reengi- neering (SANER). 454–464. doi:10.1109/SANER64311.2025.00049
-
[15]
Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. 2024. RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management(Boise, ID, USA)(CIKM ’24). Associ...
-
[16]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models.arXiv preprint arXiv:2210.03629(2023). https://arxiv.org/abs/2210.03629
work page internal anchor Pith review arXiv 2023
-
[17]
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu- Xiong Wang. 2024. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adria...
2024
-
[18]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, bench- mark system, and empirical study.IEEE Transactions on Software Engineering47, 2 (2018), 243–260. doi:10.1109/TSE.2018.2887384
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.