STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
Pith reviewed 2026-05-20 19:18 UTC · model grok-4.3
The pith
STAR framework improves RCA agent performance by localizing errors to one of four specific workflow stages and repairing them through targeted replay.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STAR decomposes an RCA workflow into four structured stages—Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report—and treats agent failure as a stage-localizable reasoning bug. It performs stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair, which leads to consistent gains in root cause localization and fault type classification while repairing most incorrect traces in one or two rounds.
What carries the argument
The four-stage decomposition of RCA workflows into Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report, which turns monolithic errors into stage-localizable bugs that can be audited and repaired separately.
If this is right
- Root cause localization and fault type classification both improve over strong baselines on large-scale benchmarks and production datasets.
- The decisive faulty stage is identified with high accuracy across different agent workflows and foundation models.
- Most initially incorrect reasoning traces get repaired within one or two replay rounds.
- Fast/Slow Routing and counterfactual stage evaluation each contribute measurable gains to the repair process.
Where Pith is reading between the lines
- The same stage-wise auditing idea could be tested on other multi-step agent tasks such as automated planning or code repair to check if modularity helps reliability more broadly.
- Production monitoring systems might adopt this pattern to shorten the time needed to resolve incidents by making agent decisions easier to inspect and fix.
- If the four-stage split proves robust, future work could explore whether adding more granular sub-stages further reduces repair rounds without increasing overhead.
Load-bearing premise
That breaking any RCA workflow into exactly these four stages captures the main sources of error without missing cross-stage interactions or creating new problems during repair.
What would settle it
A test run on new RCA traces where the four-stage model either fails to identify the actual faulty stage with high accuracy or where the repairs do not reduce the overall error rate compared to the original agent.
Figures
read the original abstract
LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes STAR, a Stage-attributed Triage and Repair framework for LLM-based root cause analysis (RCA) agents in microservices. It decomposes any RCA workflow into four stages—Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR)—and treats failures as stage-localizable bugs. STAR performs stage-wise auditing on top of LangGraph, applies budget-aware Fast/Slow Routing, uses counterfactual candidate evaluation to localize the decisive faulty stage, and executes stage-specific patch-and-replay repair. Experiments on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models, report consistent gains in root cause localization and fault type classification over strong baselines, high accuracy in stage localization, and that most incorrect traces are repaired in one or two replay rounds.
Significance. If the localization accuracy and repair-efficiency results hold after addressing inter-stage dependencies, the work offers a practical path toward more debuggable and self-repairing agentic systems in AIOps. The explicit modeling of failure location via counterfactual evaluation and the separation of fast/slow routing are concrete engineering contributions that could transfer to other multi-step LLM agent pipelines.
major comments (2)
- Abstract and Evaluation section: the central claims that STAR 'identifies the decisive faulty stage with high accuracy' and 'repairs most initially incorrect traces within one or two replay rounds' rest on the assumption that RCA errors are cleanly attributable to a single stage. No measurement of residual error rates after single-stage patch-and-replay or comparison against a joint multi-stage correction baseline is reported, leaving open whether shared LangGraph state allows upstream errors (e.g., in EP) to systematically bias downstream stages (HS, AS) and whether stage-specific repair merely masks symptoms.
- §3 (Framework description): the four-stage decomposition is presented as sufficient for localizing and repairing errors, yet the manuscript provides no explicit test for cross-stage error propagation or new failure modes introduced by the repair loop itself. A controlled experiment that injects isolated stage errors and measures downstream impact after repair would be required to substantiate the load-bearing claim that stage-local repair is adequate.
minor comments (3)
- Abstract: the phrases 'strong baselines' and 'high accuracy' should be accompanied by the specific baseline names and quantitative accuracy figures (e.g., 'X% stage-localization accuracy') for immediate readability.
- Evaluation section: details on statistical significance testing, number of runs, and potential confounds (e.g., prompt sensitivity, model temperature) are missing; these should be added to support the reported improvements.
- Notation: the distinction between 'decisive faulty stage' and 'initially incorrect traces' should be defined more precisely in the method section to avoid ambiguity when discussing repair rounds.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below, providing clarifications on our design choices while acknowledging areas where additional experiments will strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract and Evaluation section: the central claims that STAR 'identifies the decisive faulty stage with high accuracy' and 'repairs most initially incorrect traces within one or two replay rounds' rest on the assumption that RCA errors are cleanly attributable to a single stage. No measurement of residual error rates after single-stage patch-and-replay or comparison against a joint multi-stage correction baseline is reported, leaving open whether shared LangGraph state allows upstream errors (e.g., in EP) to systematically bias downstream stages (HS, AS) and whether stage-specific repair merely masks symptoms.
Authors: We agree that the current evaluation does not report residual error rates after single-stage repair or include a direct comparison to a joint multi-stage correction baseline. The counterfactual evaluation procedure replaces stage outputs with alternative candidates and measures the resulting change in final diagnosis quality; this isolates the decisive stage even when shared LangGraph state exists. Nevertheless, to address potential upstream bias and symptom masking more explicitly, we will add a new experiment in the revised evaluation section that (i) measures residual errors after single-stage patch-and-replay and (ii) compares against a joint multi-stage repair baseline that corrects all stages simultaneously. These additions will quantify any remaining inter-stage dependencies. revision: yes
-
Referee: §3 (Framework description): the four-stage decomposition is presented as sufficient for localizing and repairing errors, yet the manuscript provides no explicit test for cross-stage error propagation or new failure modes introduced by the repair loop itself. A controlled experiment that injects isolated stage errors and measures downstream impact after repair would be required to substantiate the load-bearing claim that stage-local repair is adequate.
Authors: The four-stage decomposition follows the structure of typical RCA agent traces observed across the two workflows we evaluate. Stage-wise auditing and counterfactual localization are intended to detect propagation by identifying which single stage correction yields the largest improvement. We acknowledge, however, that the manuscript lacks a controlled injection study. We will add such an experiment to §4 (or a new subsection), in which we synthetically corrupt individual stages (e.g., noisy evidence in EP or malformed hypotheses in HS) while keeping others correct, then measure both downstream propagation and the effectiveness of stage-specific repair. The same setup will be used to check for failure modes introduced by the replay loop itself. revision: yes
Circularity Check
No circularity: empirical framework evaluated on external data
full rationale
The paper introduces STAR as an engineering framework that decomposes RCA workflows into four stages (Evidence Package, Hypothesis Set, Analysis Structure, Decision Report) and implements stage-wise auditing, Fast/Slow Routing, counterfactual evaluation, and patch-and-replay repair on top of LangGraph. All reported improvements in root cause localization, fault classification, decisive-stage accuracy, and repair rounds are obtained from direct experiments on a public large-scale benchmark and a real-world production dataset using two RCA agent workflows and three foundation models. No equations, fitted parameters, or self-citation chains are used to derive the performance claims; the four-stage decomposition is presented as a design choice whose value is assessed by external evaluation rather than by construction from the framework's own outputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
STAR explicitly decomposes an RCA workflow into four structured stages, namely Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
decisive stage localization via counterfactual candidate evaluation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chen P, Qi Y , Zheng P, et al. Causeinfer: Automatic and distributed per- formance diagnosis with hierarchical causality graph in large distributed systems[C]. IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 2014: 1887-1895
work page 2014
-
[2]
Lin W, Ma M, Pan D, et al. Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture[C]. 2018 IEEE 37th International Performance Computing and Communi- cations Conference (IPCCC). IEEE, 2018: 1-8
work page 2018
-
[3]
Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]
Ma M, Lin W, Pan D, et al. Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]. 2019 IEEE Inter- national Conference on Web Services (ICWS). IEEE, 2019: 60-67
work page 2019
-
[4]
Diagnosing root causes of intermittent slow queries in cloud databases[J]
Ma M, Yin Z, Zhang S, et al. Diagnosing root causes of intermittent slow queries in cloud databases[J]. Proceedings of the VLDB Endowment, 2020, 13(8): 1176-1189
work page 2020
-
[5]
Microhecl: High-efficient root cause localization in large-scale microservice systems[C]
Liu D, He C, Peng X, et al. Microhecl: High-efficient root cause localization in large-scale microservice systems[C]. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software En- gineering in Practice (ICSE-SEIP). IEEE, 2021: 338-347
work page 2021
-
[6]
Practical root cause localization for microser- vice systems via trace analysis[C]
Li Z, Chen J, Jiao R, et al. Practical root cause localization for microser- vice systems via trace analysis[C]. 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021: 1-10
work page 2021
-
[7]
Eadro: An end-to-end troubleshooting framework for microservices on multi-source data
Lee C, Yang T, Chen Z, et al. Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)[J]. IEEE, Los Alamitos, CA, 1750, 1762
work page 2023
-
[8]
Yu G, Chen P, Li Y , et al. Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data[C]. Proceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 2023: 553-565
work page 2023
-
[9]
TraceNet: Operation aware root cause localization of microservice system anomalies[C]
Yang J, Guo Y , Chen Y , et al. TraceNet: Operation aware root cause localization of microservice system anomalies[C]. 2023 IEEE Interna- tional Conference on Communications Workshops (ICC Workshops). IEEE, 2023: 758-763
work page 2023
-
[10]
React: Synergizing reasoning and acting in language models[C]
Yao S, Zhao J, Yu D, et al. React: Synergizing reasoning and acting in language models[C]. The eleventh international conference on learning representations. 2022
work page 2022
-
[11]
Reflexion: Language agents with verbal reinforcement learning[J]
Shinn N, Cassano F, Gopinath A, et al. Reflexion: Language agents with verbal reinforcement learning[J]. Advances in neural information processing systems, 2023, 36: 8634-8652
work page 2023
-
[12]
Self-refine: Iterative refinement with self-feedback[J]
Madaan A, Tandon N, Gupta P, et al. Self-refine: Iterative refinement with self-feedback[J]. Advances in neural information processing sys- tems, 2023, 36: 46534-46594
work page 2023
-
[13]
Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]
Wu Q, Bansal G, Zhang J, et al. Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]. First conference on language modeling. 2024
work page 2024
-
[14]
Tree of thoughts: Deliberate problem solving with large language models[J]
Yao S, Yu D, Zhao J, et al. Tree of thoughts: Deliberate problem solving with large language models[J]. Advances in neural information processing systems, 2023, 36: 11809-11822
work page 2023
-
[15]
G-eval: NLG evaluation using gpt-4 with better human alignment[C]
Liu Y , Iter D, Xu Y , et al. G-eval: NLG evaluation using gpt-4 with better human alignment[C]. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023: 2511-2522
work page 2023
-
[16]
Judging llm-as-a-judge with mt- bench and chatbot arena[J]
Zheng L, Chiang W L, Sheng Y , et al. Judging llm-as-a-judge with mt- bench and chatbot arena[J]. Advances in neural information processing systems, 2023, 36: 46595-46623
work page 2023
-
[17]
Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
MetaGPT: Meta programming for a multi-agent collaborative framework[C]
Hong S, Zhuge M, Chen J, et al. MetaGPT: Meta programming for a multi-agent collaborative framework[C]. The twelfth international conference on learning representations. 2023
work page 2023
-
[19]
Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]
Wang Y , Zhu Z, Fu Q, et al. Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineer- ing. 2024: 1057-1068
work page 2024
-
[20]
Zhang C, Dong Z, Peng X, et al. Trace-based multi-dimensional root cause localization of performance issues in microservice systems[C]. Proceedings of the IEEE/ACM 46th International Conference on Soft- ware Engineering. 2024: 1-12
work page 2024
-
[21]
Micronet: Operation aware root cause identification of microservice system anomalies[J]
Yang J, Guo Y , Chen Y , et al. Micronet: Operation aware root cause identification of microservice system anomalies[J]. IEEE Transactions on Network and Service Management, 2024, 21(4): 4255-4267
work page 2024
-
[22]
Yao Z, Pei C, Chen W, et al. Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph[C]. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 2024: 50-61
work page 2024
-
[23]
Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]
Yao Z, Ye H, Pei C, et al. Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]. 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024: 391-402
work page 2024
-
[24]
Large language models are not fair evalua- tors[C]
Wang P, Li L, Chen L, et al. Large language models are not fair evalua- tors[C]. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). 2024: 9440-9450
work page 2024
-
[25]
A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]
Sun Y , Wang J, Li Z, et al. A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]. arXiv preprint arXiv:2407.14532, 2024
-
[26]
Zhang W, Guo H, Yang J, et al. mABC: multi-Agent Blockchain- Inspired Collaboration for root cause analysis in micro-services archi- tecture[C]. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024: 4017-4033
work page 2024
-
[27]
Dapper, a large-scale distributed systems tracing infrastructure[J]
Sigelman B H, Barroso L A, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure[J]. 2010
work page 2010
-
[28]
Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]
Pei C, Wang Z, Liu F, et al. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]. Companion Proceedings of the ACM on Web Conference 2025. 2025: 422-431
work page 2025
-
[29]
arXiv preprint arXiv:2505.00212 , year=
Zhang S, Yin M, Zhang J, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems[J]. arXiv preprint arXiv:2505.00212, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.