pith. sign in

arxiv: 2605.15581 · v1 · pith:G5AXAPQTnew · submitted 2026-05-15 · 💻 cs.AI

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

Pith reviewed 2026-05-20 19:18 UTC · model grok-4.3

classification 💻 cs.AI
keywords root cause analysisLLM agentsmicroservicesAIOpsfault localizationself-repairstage decompositionincident diagnosis
0
0 comments X

The pith

STAR framework improves RCA agent performance by localizing errors to one of four specific workflow stages and repairing them through targeted replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STAR as a way to make LLM-based root cause analysis agents more reliable in microservice environments by breaking the diagnosis process into four distinct stages rather than treating failures as a single end-to-end problem. These stages are Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report. The framework audits each stage, routes tasks efficiently, identifies the key faulty stage using counterfactual checks, and applies patch-and-replay fixes. If this holds, agents can correct most mistakes quickly and deliver better root cause identification and fault classification on both benchmarks and real production data. A reader would care because it turns fragile agent reasoning into something more debuggable and self-correcting for incident response.

Core claim

STAR decomposes an RCA workflow into four structured stages—Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report—and treats agent failure as a stage-localizable reasoning bug. It performs stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair, which leads to consistent gains in root cause localization and fault type classification while repairing most incorrect traces in one or two rounds.

What carries the argument

The four-stage decomposition of RCA workflows into Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report, which turns monolithic errors into stage-localizable bugs that can be audited and repaired separately.

If this is right

  • Root cause localization and fault type classification both improve over strong baselines on large-scale benchmarks and production datasets.
  • The decisive faulty stage is identified with high accuracy across different agent workflows and foundation models.
  • Most initially incorrect reasoning traces get repaired within one or two replay rounds.
  • Fast/Slow Routing and counterfactual stage evaluation each contribute measurable gains to the repair process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stage-wise auditing idea could be tested on other multi-step agent tasks such as automated planning or code repair to check if modularity helps reliability more broadly.
  • Production monitoring systems might adopt this pattern to shorten the time needed to resolve incidents by making agent decisions easier to inspect and fix.
  • If the four-stage split proves robust, future work could explore whether adding more granular sub-stages further reduces repair rounds without increasing overhead.

Load-bearing premise

That breaking any RCA workflow into exactly these four stages captures the main sources of error without missing cross-stage interactions or creating new problems during repair.

What would settle it

A test run on new RCA traces where the four-stage model either fails to identify the actual faulty stage with high accuracy or where the repairs do not reduce the overall error rate compared to the original agent.

Figures

Figures reproduced from arXiv: 2605.15581 by Junle Wang, Wenjun Wu, Xingchuang Liao.

Figure 1
Figure 1. Figure 1: Overview of our proposed framework STAR. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Main prompt of stage critic Cs. For each stage s, STAR invokes a stage critic Cs(main prompt is shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Counterfactual repair iteration distribution when STAR [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison Across Microservice Fault Stages and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes STAR, a Stage-attributed Triage and Repair framework for LLM-based root cause analysis (RCA) agents in microservices. It decomposes any RCA workflow into four stages—Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR)—and treats failures as stage-localizable bugs. STAR performs stage-wise auditing on top of LangGraph, applies budget-aware Fast/Slow Routing, uses counterfactual candidate evaluation to localize the decisive faulty stage, and executes stage-specific patch-and-replay repair. Experiments on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models, report consistent gains in root cause localization and fault type classification over strong baselines, high accuracy in stage localization, and that most incorrect traces are repaired in one or two replay rounds.

Significance. If the localization accuracy and repair-efficiency results hold after addressing inter-stage dependencies, the work offers a practical path toward more debuggable and self-repairing agentic systems in AIOps. The explicit modeling of failure location via counterfactual evaluation and the separation of fast/slow routing are concrete engineering contributions that could transfer to other multi-step LLM agent pipelines.

major comments (2)
  1. Abstract and Evaluation section: the central claims that STAR 'identifies the decisive faulty stage with high accuracy' and 'repairs most initially incorrect traces within one or two replay rounds' rest on the assumption that RCA errors are cleanly attributable to a single stage. No measurement of residual error rates after single-stage patch-and-replay or comparison against a joint multi-stage correction baseline is reported, leaving open whether shared LangGraph state allows upstream errors (e.g., in EP) to systematically bias downstream stages (HS, AS) and whether stage-specific repair merely masks symptoms.
  2. §3 (Framework description): the four-stage decomposition is presented as sufficient for localizing and repairing errors, yet the manuscript provides no explicit test for cross-stage error propagation or new failure modes introduced by the repair loop itself. A controlled experiment that injects isolated stage errors and measures downstream impact after repair would be required to substantiate the load-bearing claim that stage-local repair is adequate.
minor comments (3)
  1. Abstract: the phrases 'strong baselines' and 'high accuracy' should be accompanied by the specific baseline names and quantitative accuracy figures (e.g., 'X% stage-localization accuracy') for immediate readability.
  2. Evaluation section: details on statistical significance testing, number of runs, and potential confounds (e.g., prompt sensitivity, model temperature) are missing; these should be added to support the reported improvements.
  3. Notation: the distinction between 'decisive faulty stage' and 'initially incorrect traces' should be defined more precisely in the method section to avoid ambiguity when discussing repair rounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, providing clarifications on our design choices while acknowledging areas where additional experiments will strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract and Evaluation section: the central claims that STAR 'identifies the decisive faulty stage with high accuracy' and 'repairs most initially incorrect traces within one or two replay rounds' rest on the assumption that RCA errors are cleanly attributable to a single stage. No measurement of residual error rates after single-stage patch-and-replay or comparison against a joint multi-stage correction baseline is reported, leaving open whether shared LangGraph state allows upstream errors (e.g., in EP) to systematically bias downstream stages (HS, AS) and whether stage-specific repair merely masks symptoms.

    Authors: We agree that the current evaluation does not report residual error rates after single-stage repair or include a direct comparison to a joint multi-stage correction baseline. The counterfactual evaluation procedure replaces stage outputs with alternative candidates and measures the resulting change in final diagnosis quality; this isolates the decisive stage even when shared LangGraph state exists. Nevertheless, to address potential upstream bias and symptom masking more explicitly, we will add a new experiment in the revised evaluation section that (i) measures residual errors after single-stage patch-and-replay and (ii) compares against a joint multi-stage repair baseline that corrects all stages simultaneously. These additions will quantify any remaining inter-stage dependencies. revision: yes

  2. Referee: §3 (Framework description): the four-stage decomposition is presented as sufficient for localizing and repairing errors, yet the manuscript provides no explicit test for cross-stage error propagation or new failure modes introduced by the repair loop itself. A controlled experiment that injects isolated stage errors and measures downstream impact after repair would be required to substantiate the load-bearing claim that stage-local repair is adequate.

    Authors: The four-stage decomposition follows the structure of typical RCA agent traces observed across the two workflows we evaluate. Stage-wise auditing and counterfactual localization are intended to detect propagation by identifying which single stage correction yields the largest improvement. We acknowledge, however, that the manuscript lacks a controlled injection study. We will add such an experiment to §4 (or a new subsection), in which we synthetically corrupt individual stages (e.g., noisy evidence in EP or malformed hypotheses in HS) while keeping others correct, then measure both downstream propagation and the effectiveness of stage-specific repair. The same setup will be used to check for failure modes introduced by the replay loop itself. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external data

full rationale

The paper introduces STAR as an engineering framework that decomposes RCA workflows into four stages (Evidence Package, Hypothesis Set, Analysis Structure, Decision Report) and implements stage-wise auditing, Fast/Slow Routing, counterfactual evaluation, and patch-and-replay repair on top of LangGraph. All reported improvements in root cause localization, fault classification, decisive-stage accuracy, and repair rounds are obtained from direct experiments on a public large-scale benchmark and a real-world production dataset using two RCA agent workflows and three foundation models. No equations, fitted parameters, or self-citation chains are used to derive the performance claims; the four-stage decomposition is presented as a design choice whose value is assessed by external evaluation rather than by construction from the framework's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The four-stage decomposition is treated as a modeling choice rather than derived.

pith-pipeline@v0.9.0 · 5834 in / 1225 out tokens · 34098 ms · 2026-05-20T19:18:48.652140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Causeinfer: Automatic and distributed per- formance diagnosis with hierarchical causality graph in large distributed systems[C]

    Chen P, Qi Y , Zheng P, et al. Causeinfer: Automatic and distributed per- formance diagnosis with hierarchical causality graph in large distributed systems[C]. IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 2014: 1887-1895

  2. [2]

    Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture[C]

    Lin W, Ma M, Pan D, et al. Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture[C]. 2018 IEEE 37th International Performance Computing and Communi- cations Conference (IPCCC). IEEE, 2018: 1-8

  3. [3]

    Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]

    Ma M, Lin W, Pan D, et al. Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]. 2019 IEEE Inter- national Conference on Web Services (ICWS). IEEE, 2019: 60-67

  4. [4]

    Diagnosing root causes of intermittent slow queries in cloud databases[J]

    Ma M, Yin Z, Zhang S, et al. Diagnosing root causes of intermittent slow queries in cloud databases[J]. Proceedings of the VLDB Endowment, 2020, 13(8): 1176-1189

  5. [5]

    Microhecl: High-efficient root cause localization in large-scale microservice systems[C]

    Liu D, He C, Peng X, et al. Microhecl: High-efficient root cause localization in large-scale microservice systems[C]. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software En- gineering in Practice (ICSE-SEIP). IEEE, 2021: 338-347

  6. [6]

    Practical root cause localization for microser- vice systems via trace analysis[C]

    Li Z, Chen J, Jiao R, et al. Practical root cause localization for microser- vice systems via trace analysis[C]. 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021: 1-10

  7. [7]

    Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

    Lee C, Yang T, Chen Z, et al. Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)[J]. IEEE, Los Alamitos, CA, 1750, 1762

  8. [8]

    Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data[C]

    Yu G, Chen P, Li Y , et al. Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data[C]. Proceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 2023: 553-565

  9. [9]

    TraceNet: Operation aware root cause localization of microservice system anomalies[C]

    Yang J, Guo Y , Chen Y , et al. TraceNet: Operation aware root cause localization of microservice system anomalies[C]. 2023 IEEE Interna- tional Conference on Communications Workshops (ICC Workshops). IEEE, 2023: 758-763

  10. [10]

    React: Synergizing reasoning and acting in language models[C]

    Yao S, Zhao J, Yu D, et al. React: Synergizing reasoning and acting in language models[C]. The eleventh international conference on learning representations. 2022

  11. [11]

    Reflexion: Language agents with verbal reinforcement learning[J]

    Shinn N, Cassano F, Gopinath A, et al. Reflexion: Language agents with verbal reinforcement learning[J]. Advances in neural information processing systems, 2023, 36: 8634-8652

  12. [12]

    Self-refine: Iterative refinement with self-feedback[J]

    Madaan A, Tandon N, Gupta P, et al. Self-refine: Iterative refinement with self-feedback[J]. Advances in neural information processing sys- tems, 2023, 36: 46534-46594

  13. [13]

    Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]

    Wu Q, Bansal G, Zhang J, et al. Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]. First conference on language modeling. 2024

  14. [14]

    Tree of thoughts: Deliberate problem solving with large language models[J]

    Yao S, Yu D, Zhao J, et al. Tree of thoughts: Deliberate problem solving with large language models[J]. Advances in neural information processing systems, 2023, 36: 11809-11822

  15. [15]

    G-eval: NLG evaluation using gpt-4 with better human alignment[C]

    Liu Y , Iter D, Xu Y , et al. G-eval: NLG evaluation using gpt-4 with better human alignment[C]. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023: 2511-2522

  16. [16]

    Judging llm-as-a-judge with mt- bench and chatbot arena[J]

    Zheng L, Chiang W L, Sheng Y , et al. Judging llm-as-a-judge with mt- bench and chatbot arena[J]. Advances in neural information processing systems, 2023, 36: 46595-46623

  17. [17]

    GPT-4 Technical Report

    Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023

  18. [18]

    MetaGPT: Meta programming for a multi-agent collaborative framework[C]

    Hong S, Zhuge M, Chen J, et al. MetaGPT: Meta programming for a multi-agent collaborative framework[C]. The twelfth international conference on learning representations. 2023

  19. [19]

    Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]

    Wang Y , Zhu Z, Fu Q, et al. Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineer- ing. 2024: 1057-1068

  20. [20]

    Trace-based multi-dimensional root cause localization of performance issues in microservice systems[C]

    Zhang C, Dong Z, Peng X, et al. Trace-based multi-dimensional root cause localization of performance issues in microservice systems[C]. Proceedings of the IEEE/ACM 46th International Conference on Soft- ware Engineering. 2024: 1-12

  21. [21]

    Micronet: Operation aware root cause identification of microservice system anomalies[J]

    Yang J, Guo Y , Chen Y , et al. Micronet: Operation aware root cause identification of microservice system anomalies[J]. IEEE Transactions on Network and Service Management, 2024, 21(4): 4255-4267

  22. [22]

    Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph[C]

    Yao Z, Pei C, Chen W, et al. Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph[C]. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 2024: 50-61

  23. [23]

    Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]

    Yao Z, Ye H, Pei C, et al. Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]. 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024: 391-402

  24. [24]

    Large language models are not fair evalua- tors[C]

    Wang P, Li L, Chen L, et al. Large language models are not fair evalua- tors[C]. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). 2024: 9440-9450

  25. [25]

    A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]

    Sun Y , Wang J, Li Z, et al. A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]. arXiv preprint arXiv:2407.14532, 2024

  26. [26]

    mABC: multi-Agent Blockchain- Inspired Collaboration for root cause analysis in micro-services archi- tecture[C]

    Zhang W, Guo H, Yang J, et al. mABC: multi-Agent Blockchain- Inspired Collaboration for root cause analysis in micro-services archi- tecture[C]. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024: 4017-4033

  27. [27]

    Dapper, a large-scale distributed systems tracing infrastructure[J]

    Sigelman B H, Barroso L A, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure[J]. 2010

  28. [28]

    Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]

    Pei C, Wang Z, Liu F, et al. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]. Companion Proceedings of the ACM on Web Conference 2025. 2025: 422-431

  29. [29]

    arXiv preprint arXiv:2505.00212 , year=

    Zhang S, Yin M, Zhang J, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems[J]. arXiv preprint arXiv:2505.00212, 2025