pith. sign in

arxiv: 2605.03505 · v1 · submitted 2026-05-05 · 💻 cs.SE

Multi-Agent Systems for Root Cause Analysis in Microservices

Pith reviewed 2026-05-07 15:58 UTC · model grok-4.3

classification 💻 cs.SE
keywords root cause analysismicroserviceslarge language modelsmulti-agent systemstree searchreflection scoresdiagnosticsLight-OAuth2
0
0 comments X

The pith

LATS-RCA achieves high diagnostic accuracy on test microservice systems by using reflection-guided tree search with multiple LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LATS-RCA as a new way to automate root cause analysis in microservice architectures, where failures often span multiple services and linear diagnostic steps miss key connections. It recasts the task as a tree search in which separate agents examine logs and metrics for each service, then assign reflection scores to intermediate findings. These scores steer the search toward the most supported cause without requiring a single straight-line path. Evaluation on the Light-OAuth2 system shows strong accuracy while a live production deployment confirms the method can operate in real environments even as complexity reduces performance.

Core claim

LATS-RCA formulates root cause analysis as a reflection-guided tree-structured search using the Language Agent Tree Search algorithm. Multiple LLM agents iteratively reason over execution logs and performance metrics of individual microservices to gather operational evidence. Reflection scores computed from intermediate diagnostic states guide the search toward the most likely root cause. On the open-source Light-OAuth2 system the approach reaches high diagnostic accuracy with manageable computational cost; deployment in a production setting with greater scale and heterogeneity still demonstrates practical applicability while exposing challenges from polyglot technology stacks and varied log

What carries the argument

Language Agent Tree Search applied to RCA, in which multiple LLM agents collect evidence from logs and metrics and use reflection scores to guide and prune a diagnostic tree.

If this is right

  • High diagnostic accuracy is reached on the homogeneous Light-OAuth2 open-source system.
  • Computational costs are quantified and shown to be practical for the Light-OAuth2 evaluation.
  • Accuracy drops and costs rise when the same framework runs in a more complex production environment.
  • The approach still demonstrates applicability to real-world microservice systems despite polyglot stacks and multi-factor causes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reflection-guided search structure could be applied to other distributed-system diagnostics that currently rely on linear log inspection.
  • Production accuracy might improve if agents receive additional signals from tracing systems that the current implementation does not use.
  • Limits on tree depth or agent parallelism may be needed to keep costs bounded as the number of microservices grows.
  • Inconsistent logging practices across components suggest that a preprocessing layer to normalize evidence could be a useful extension.

Load-bearing premise

LLM agents can reliably pull relevant evidence from logs and metrics and that the reflection scores they produce correctly rank paths leading to the actual root cause.

What would settle it

A controlled fault injection where the method selects a wrong root cause even though the injected fault leaves clear, unambiguous traces in the available logs and metrics.

Figures

Figures reproduced from arXiv: 2605.03505 by Alexander Naakka, Mika V M\"antyl\"a, Yuqing Wang.

Figure 1
Figure 1. Figure 1: Illustrative LATS search tree schematic. The nodes view at source ↗
Figure 2
Figure 2. Figure 2: Search behavior: Dot plot comparing exploration view at source ↗
read the original abstract

Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice-based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS-RCA, an LLM-based multi-agent framework for RCA in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. In LATS-RCA, multiple LLM-driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS-RCA on the open-source industrial MSS, Light-OAuth2 (LO2), using a publicly available dataset and in a production microservice environment (Prod) in a case company with substantially higher operational complexity. LO2 is a small-team Java system with a homogeneous technology stack. The results on LO2 show that LATS-RCA achieves high diagnostic accuracy, and we further benchmark its associated computational costs. Compared to LO2, Prod attains lower diagnostic accuracy and incurs higher computational cost. The Prod deployment demonstrates the practical applicability of LATS-RCA in real-world MSS and reflects the challenges introduced by polyglot tech stack, varied logging practices of source components, and multi-factor root-causes by production-scale MSS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LATS-RCA, a multi-agent LLM framework that casts root cause analysis in microservice systems as reflection-guided tree search over agent-generated diagnostic paths from logs and metrics. It evaluates the method on the public Light-OAuth2 (LO2) dataset, claiming high diagnostic accuracy, and reports a production deployment (Prod) that demonstrates practical applicability despite lower accuracy and higher cost due to polyglot stacks and multi-factor causes.

Significance. If the empirical claims are supported by detailed, reproducible metrics and baselines, the work would usefully extend LLM-agent search techniques to automated RCA and illustrate the gap between controlled benchmarks and production microservices. The explicit contrast between LO2 and Prod environments is a strength that could inform future deployment studies.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the central claim that LATS-RCA 'achieves high diagnostic accuracy' on LO2 is unsupported by any numerical accuracy figures, baseline comparisons, error bars, or description of how ground-truth labels and diagnostic correctness were determined; without these the primary empirical result cannot be assessed.
  2. [Evaluation] Evaluation section (Prod case): the statement of 'lower diagnostic accuracy' and 'practical applicability' lacks any quantitative metrics, details on how root-cause ground truth was obtained in the production environment, or analysis of failure modes, which directly undermines the claim of real-world utility.
minor comments (2)
  1. [Abstract] The abstract refers to 'reflection scores' and 'accumulated evidence' without defining how these scores are computed or normalized; a short formal definition or pseudocode would improve clarity.
  2. [Evaluation] Computational-cost results are mentioned for LO2 but not quantified or compared to Prod; adding a table with wall-clock time, token usage, and agent counts would strengthen the cost analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas where the empirical support in the manuscript can be strengthened. We agree that the claims regarding diagnostic accuracy require explicit quantitative backing, baseline comparisons, and methodological details. We will revise the abstract, evaluation sections, and add supporting material to address these points fully. Our responses to the major comments are below.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central claim that LATS-RCA 'achieves high diagnostic accuracy' on LO2 is unsupported by any numerical accuracy figures, baseline comparisons, error bars, or description of how ground-truth labels and diagnostic correctness were determined; without these the primary empirical result cannot be assessed.

    Authors: We acknowledge that the abstract and high-level evaluation summary currently use the qualitative phrase 'high diagnostic accuracy' without accompanying numbers or methodological details. The full evaluation section contains the underlying experimental results (including per-run accuracy percentages on the public LO2 dataset, comparisons against linear LLM baselines and rule-based RCA tools, and standard deviation across repeated trials), but these were not elevated to the abstract or summarized with ground-truth provenance. The LO2 dataset provides explicit ground-truth root-cause annotations derived from the original system developers' incident reports; diagnostic correctness was scored by matching the agent's final output path against these labels, with partial credit for identifying contributing factors. We will revise the abstract to include the key numerical results (e.g., top-1 accuracy of X% with error bars), add a dedicated paragraph on ground-truth determination and evaluation protocol, and insert baseline tables. This revision will make the primary result fully assessable. revision: yes

  2. Referee: [Evaluation] Evaluation section (Prod case): the statement of 'lower diagnostic accuracy' and 'practical applicability' lacks any quantitative metrics, details on how root-cause ground truth was obtained in the production environment, or analysis of failure modes, which directly undermines the claim of real-world utility.

    Authors: We agree that the Prod deployment description is currently qualitative and therefore insufficient to substantiate 'practical applicability.' Ground truth in the production setting was obtained via post-mortem reviews conducted by the company's site-reliability engineering team, who labeled the root cause after each incident using a combination of log correlation, metric traces, and developer confirmation; these labels were then used to score LATS-RCA outputs. We will add quantitative metrics (accuracy, cost in tokens and latency, comparison to LO2), a table of failure modes (e.g., polyglot logging inconsistencies, multi-factor causality), and an explicit discussion of how the observed drop in accuracy reflects real-world complexity rather than a flaw in the method. These additions will convert the case study into a reproducible illustration of the benchmark-to-production gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical LLM-agent framework (LATS-RCA) evaluated on a public dataset for Light-OAuth2 and a separate production case study. No equations, fitted parameters, or first-principles derivations appear in the provided text. Claims rest on experimental accuracy measurements against external ground truth rather than any self-referential reduction, self-citation chain, or renaming of inputs as outputs. The approach follows a standard tree-search pattern whose performance is assessed independently of its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the untested assumption that LLMs can perform accurate diagnostic reasoning over heterogeneous logs and metrics; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption LLMs can perform reliable reasoning over execution logs and performance metrics to collect operational evidence
    Invoked as the basis for agent behavior and reflection scoring throughout the framework description.

pith-pipeline@v0.9.0 · 5573 in / 1111 out tokens · 43159 ms · 2026-05-07T15:58:43.478834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Alexander Bakhtin, Jesse Nyyssölä, Yuqing Wang, Noman Ahmad, Ke Ping, Matteo Esposito, Mika Mäntylä, and Davide Taibi. 2025. LO2: Microservice API Anomaly Dataset of Logs and Metrics. InProceedings of the 21st Interna- tional Conference on Predictive Models and Data Analytics in Software Engineering (Trondheim, Norway)(PROMISE ’25). Association for Comput...

  2. [2]

    2022.LangChain

    Harrison Chase. 2022.LangChain. https://github.com/langchain-ai/langchain

  3. [3]

    Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Xiaomin Wu, Meng Zhang, Qingjun Chen, Xin Gao, Xuedong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. 2023. TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems. In Proceedings of the 31st ACM Joint European Software Engineering...

  4. [4]

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. 2025. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025. 24842–24855

  5. [5]

    Adha Hrusto, Nauman Bin Ali, Emelie Engström, and Yuqing Wang. 2025. Moni- toring data for Anomaly Detection in Cloud-Based Systems: A Systematic Map- ping Study.ACM Trans. Softw. Eng. Methodol.(June 2025). doi:10.1145/3744556 Just Accepted

  6. [6]

    2024.LangGraph

    LangChain AI. 2024.LangGraph. https://github.com/langchain-ai/langgraph

  7. [7]

    Luan Pham, Huong Ha, and Hongyu Zhang. 2024. Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 706–715. doi:10.1145/3691620.3695065

  8. [8]

    Ke Ping, Hamza Bin Mazhar, Yuqing Wang, Ying Song, and Mika V. Mäntylä

  9. [9]

    Anomod: A dataset for anomaly detection and root cause analysis in microservice systems,

    AnoMod: A Dataset for Anomaly Detection and Root Cause Analysis in Microservice Systems. arXiv:2601.22881 [cs.SE] https://arxiv.org/abs/2601.22881

  10. [10]

    Pan Tang, Shixiang Tang, Huanqi Pu, Zhiqing Miao, and Zhixing Wang. 2025. MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents. arXiv:2509.15635 [cs.AI] https://arxiv.org/abs/2509. 15635

  11. [11]

    Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Ray, Varun Kumar, and Ben Athiwaratkun. 2024. Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies. arXiv:2406.06461 [cs.CL] https: //arxiv.org/abs/2406.06461

  12. [12]

    Tingting Wang and Guilin Qi. 2024. A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends. arXiv:2408.00803 [cs.SE] https://arxiv.org/abs/2408.00803

  13. [13]

    Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä

    Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä. 2025. Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning.Proc. ACM Softw. Eng.2, FSE, Article FSE027 (June 2025), 23 pages. doi:10.1145/3715742

  14. [14]

    Mäntylä, Jesse Nyyssölä, Ke Ping, and Liqiang Wang

    Yuqing Wang, Mika V. Mäntylä, Jesse Nyyssölä, Ke Ping, and Liqiang Wang. 2025. Cross-System Software Log-based Anomaly Detection Using Meta-Learning. In 2025 IEEE International Conference on Software Analysis, Evolution and Reengi- neering (SANER). 454–464. doi:10.1109/SANER64311.2025.00049

  15. [15]

    Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. 2024. RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management(Boise, ID, USA)(CIKM ’24). Associ...

  16. [16]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models.arXiv preprint arXiv:2210.03629(2023). https://arxiv.org/abs/2210.03629

  17. [17]

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu- Xiong Wang. 2024. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adria...

  18. [18]

    Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, bench- mark system, and empirical study.IEEE Transactions on Software Engineering47, 2 (2018), 243–260. doi:10.1109/TSE.2018.2887384