STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

Junle Wang; Wenjun Wu; Xingchuang Liao

arxiv: 2605.15581 · v1 · pith:G5AXAPQTnew · submitted 2026-05-15 · 💻 cs.AI

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

Junle Wang , Xingchuang Liao , Wenjun Wu This is my paper

Pith reviewed 2026-05-20 19:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords root cause analysisLLM agentsmicroservicesAIOpsfault localizationself-repairstage decompositionincident diagnosis

0 comments

The pith

STAR framework improves RCA agent performance by localizing errors to one of four specific workflow stages and repairing them through targeted replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STAR as a way to make LLM-based root cause analysis agents more reliable in microservice environments by breaking the diagnosis process into four distinct stages rather than treating failures as a single end-to-end problem. These stages are Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report. The framework audits each stage, routes tasks efficiently, identifies the key faulty stage using counterfactual checks, and applies patch-and-replay fixes. If this holds, agents can correct most mistakes quickly and deliver better root cause identification and fault classification on both benchmarks and real production data. A reader would care because it turns fragile agent reasoning into something more debuggable and self-correcting for incident response.

Core claim

STAR decomposes an RCA workflow into four structured stages—Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report—and treats agent failure as a stage-localizable reasoning bug. It performs stage-wise auditing, budget-aware Fast/Slow Routing, decisive stage localization via counterfactual candidate evaluation, and stage-specific patch-and-replay repair, which leads to consistent gains in root cause localization and fault type classification while repairing most incorrect traces in one or two rounds.

What carries the argument

The four-stage decomposition of RCA workflows into Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report, which turns monolithic errors into stage-localizable bugs that can be audited and repaired separately.

If this is right

Root cause localization and fault type classification both improve over strong baselines on large-scale benchmarks and production datasets.
The decisive faulty stage is identified with high accuracy across different agent workflows and foundation models.
Most initially incorrect reasoning traces get repaired within one or two replay rounds.
Fast/Slow Routing and counterfactual stage evaluation each contribute measurable gains to the repair process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stage-wise auditing idea could be tested on other multi-step agent tasks such as automated planning or code repair to check if modularity helps reliability more broadly.
Production monitoring systems might adopt this pattern to shorten the time needed to resolve incidents by making agent decisions easier to inspect and fix.
If the four-stage split proves robust, future work could explore whether adding more granular sub-stages further reduces repair rounds without increasing overhead.

Load-bearing premise

That breaking any RCA workflow into exactly these four stages captures the main sources of error without missing cross-stage interactions or creating new problems during repair.

What would settle it

A test run on new RCA traces where the four-stage model either fails to identify the actual faulty stage with high accuracy or where the repairs do not reduce the overall error rate compared to the original agent.

Figures

Figures reproduced from arXiv: 2605.15581 by Junle Wang, Wenjun Wu, Xingchuang Liao.

**Figure 2.** Figure 2: Main prompt of stage critic Cs. For each stage s, STAR invokes a stage critic Cs(main prompt is shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Counterfactual repair iteration distribution when STAR [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 3.** Figure 3: Comparison Across Microservice Fault Stages and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAR shows a workable four-stage repair loop for LLM RCA agents that lifts localization and classification on the tested sets, but the gains rest on assuming errors stay mostly local to one stage.

read the letter

STAR splits an RCA agent trace into four stages—Evidence Package, Hypothesis Set, Analysis Structure, and Decision Report—then audits them, routes some calls through a slower path when the budget allows, picks the decisive faulty stage with counterfactual checks, and replays a targeted patch. The main practical result is that most bad traces get corrected in one or two rounds and both root-cause localization and fault-type accuracy improve over the baselines they tried. They run the tests on a public benchmark and a production dataset, using two different agent workflows and three foundation models, and they credit the routing and counterfactual pieces for much of the lift. The LangGraph implementation is concrete enough that someone could re-use the pattern without starting from scratch. That is the useful engineering contribution here. The four-stage split is a reasonable way to make the reasoning trace more inspectable, and treating failures as stage-local bugs rather than one big end-to-end mistake is a clear shift from most prior agent-debugging work. The evaluation covers both synthetic and real logs, which is better than many agent papers that stay only on benchmarks. The soft spot is the assumption that errors are cleanly attributable to one stage. In a LangGraph setup the stages share state, so a weak Evidence Package can systematically skew the Hypothesis Set and Analysis Structure that come after it. Single-stage patch-and-replay may then only mask the symptom while the upstream bias remains. The abstract does not report a joint multi-stage correction baseline or residual error rates after one-stage repair, so it is not yet clear how much the reported localization accuracy depends on the stages being fairly independent. Minor gaps include the lack of visible statistical significance numbers or variance across runs in the summary. This paper is for people building or studying LLM agents for microservice incident response. Anyone who needs more debuggable agent traces will find the stage decomposition and replay loop worth trying. It has enough new machinery and real-data results to justify sending it to referees rather than desk-rejecting it. I would recommend peer review, with the main request being a check on cross-stage propagation.

Referee Report

2 major / 3 minor

Summary. The paper proposes STAR, a Stage-attributed Triage and Repair framework for LLM-based root cause analysis (RCA) agents in microservices. It decomposes any RCA workflow into four stages—Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR)—and treats failures as stage-localizable bugs. STAR performs stage-wise auditing on top of LangGraph, applies budget-aware Fast/Slow Routing, uses counterfactual candidate evaluation to localize the decisive faulty stage, and executes stage-specific patch-and-replay repair. Experiments on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models, report consistent gains in root cause localization and fault type classification over strong baselines, high accuracy in stage localization, and that most incorrect traces are repaired in one or two replay rounds.

Significance. If the localization accuracy and repair-efficiency results hold after addressing inter-stage dependencies, the work offers a practical path toward more debuggable and self-repairing agentic systems in AIOps. The explicit modeling of failure location via counterfactual evaluation and the separation of fast/slow routing are concrete engineering contributions that could transfer to other multi-step LLM agent pipelines.

major comments (2)

Abstract and Evaluation section: the central claims that STAR 'identifies the decisive faulty stage with high accuracy' and 'repairs most initially incorrect traces within one or two replay rounds' rest on the assumption that RCA errors are cleanly attributable to a single stage. No measurement of residual error rates after single-stage patch-and-replay or comparison against a joint multi-stage correction baseline is reported, leaving open whether shared LangGraph state allows upstream errors (e.g., in EP) to systematically bias downstream stages (HS, AS) and whether stage-specific repair merely masks symptoms.
§3 (Framework description): the four-stage decomposition is presented as sufficient for localizing and repairing errors, yet the manuscript provides no explicit test for cross-stage error propagation or new failure modes introduced by the repair loop itself. A controlled experiment that injects isolated stage errors and measures downstream impact after repair would be required to substantiate the load-bearing claim that stage-local repair is adequate.

minor comments (3)

Abstract: the phrases 'strong baselines' and 'high accuracy' should be accompanied by the specific baseline names and quantitative accuracy figures (e.g., 'X% stage-localization accuracy') for immediate readability.
Evaluation section: details on statistical significance testing, number of runs, and potential confounds (e.g., prompt sensitivity, model temperature) are missing; these should be added to support the reported improvements.
Notation: the distinction between 'decisive faulty stage' and 'initially incorrect traces' should be defined more precisely in the method section to avoid ambiguity when discussing repair rounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, providing clarifications on our design choices while acknowledging areas where additional experiments will strengthen the manuscript.

read point-by-point responses

Referee: Abstract and Evaluation section: the central claims that STAR 'identifies the decisive faulty stage with high accuracy' and 'repairs most initially incorrect traces within one or two replay rounds' rest on the assumption that RCA errors are cleanly attributable to a single stage. No measurement of residual error rates after single-stage patch-and-replay or comparison against a joint multi-stage correction baseline is reported, leaving open whether shared LangGraph state allows upstream errors (e.g., in EP) to systematically bias downstream stages (HS, AS) and whether stage-specific repair merely masks symptoms.

Authors: We agree that the current evaluation does not report residual error rates after single-stage repair or include a direct comparison to a joint multi-stage correction baseline. The counterfactual evaluation procedure replaces stage outputs with alternative candidates and measures the resulting change in final diagnosis quality; this isolates the decisive stage even when shared LangGraph state exists. Nevertheless, to address potential upstream bias and symptom masking more explicitly, we will add a new experiment in the revised evaluation section that (i) measures residual errors after single-stage patch-and-replay and (ii) compares against a joint multi-stage repair baseline that corrects all stages simultaneously. These additions will quantify any remaining inter-stage dependencies. revision: yes
Referee: §3 (Framework description): the four-stage decomposition is presented as sufficient for localizing and repairing errors, yet the manuscript provides no explicit test for cross-stage error propagation or new failure modes introduced by the repair loop itself. A controlled experiment that injects isolated stage errors and measures downstream impact after repair would be required to substantiate the load-bearing claim that stage-local repair is adequate.

Authors: The four-stage decomposition follows the structure of typical RCA agent traces observed across the two workflows we evaluate. Stage-wise auditing and counterfactual localization are intended to detect propagation by identifying which single stage correction yields the largest improvement. We acknowledge, however, that the manuscript lacks a controlled injection study. We will add such an experiment to §4 (or a new subsection), in which we synthetically corrupt individual stages (e.g., noisy evidence in EP or malformed hypotheses in HS) while keeping others correct, then measure both downstream propagation and the effectiveness of stage-specific repair. The same setup will be used to check for failure modes introduced by the replay loop itself. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external data

full rationale

The paper introduces STAR as an engineering framework that decomposes RCA workflows into four stages (Evidence Package, Hypothesis Set, Analysis Structure, Decision Report) and implements stage-wise auditing, Fast/Slow Routing, counterfactual evaluation, and patch-and-replay repair on top of LangGraph. All reported improvements in root cause localization, fault classification, decisive-stage accuracy, and repair rounds are obtained from direct experiments on a public large-scale benchmark and a real-world production dataset using two RCA agent workflows and three foundation models. No equations, fitted parameters, or self-citation chains are used to derive the performance claims; the four-stage decomposition is presented as a design choice whose value is assessed by external evaluation rather than by construction from the framework's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The four-stage decomposition is treated as a modeling choice rather than derived.

pith-pipeline@v0.9.0 · 5834 in / 1225 out tokens · 34098 ms · 2026-05-20T19:18:48.652140+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

STAR explicitly decomposes an RCA workflow into four structured stages, namely Evidence Package (EP), Hypothesis Set (HS), Analysis Structure (AS), and Decision Report (DR)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decisive stage localization via counterfactual candidate evaluation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Causeinfer: Automatic and distributed per- formance diagnosis with hierarchical causality graph in large distributed systems[C]

Chen P, Qi Y , Zheng P, et al. Causeinfer: Automatic and distributed per- formance diagnosis with hierarchical causality graph in large distributed systems[C]. IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 2014: 1887-1895

work page 2014
[2]

Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture[C]

Lin W, Ma M, Pan D, et al. Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture[C]. 2018 IEEE 37th International Performance Computing and Communi- cations Conference (IPCCC). IEEE, 2018: 1-8

work page 2018
[3]

Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]

Ma M, Lin W, Pan D, et al. Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]. 2019 IEEE Inter- national Conference on Web Services (ICWS). IEEE, 2019: 60-67

work page 2019
[4]

Diagnosing root causes of intermittent slow queries in cloud databases[J]

Ma M, Yin Z, Zhang S, et al. Diagnosing root causes of intermittent slow queries in cloud databases[J]. Proceedings of the VLDB Endowment, 2020, 13(8): 1176-1189

work page 2020
[5]

Microhecl: High-efficient root cause localization in large-scale microservice systems[C]

Liu D, He C, Peng X, et al. Microhecl: High-efficient root cause localization in large-scale microservice systems[C]. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software En- gineering in Practice (ICSE-SEIP). IEEE, 2021: 338-347

work page 2021
[6]

Practical root cause localization for microser- vice systems via trace analysis[C]

Li Z, Chen J, Jiao R, et al. Practical root cause localization for microser- vice systems via trace analysis[C]. 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021: 1-10

work page 2021
[7]

Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

Lee C, Yang T, Chen Z, et al. Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)[J]. IEEE, Los Alamitos, CA, 1750, 1762

work page 2023
[8]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data[C]

Yu G, Chen P, Li Y , et al. Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data[C]. Proceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 2023: 553-565

work page 2023
[9]

TraceNet: Operation aware root cause localization of microservice system anomalies[C]

Yang J, Guo Y , Chen Y , et al. TraceNet: Operation aware root cause localization of microservice system anomalies[C]. 2023 IEEE Interna- tional Conference on Communications Workshops (ICC Workshops). IEEE, 2023: 758-763

work page 2023
[10]

React: Synergizing reasoning and acting in language models[C]

Yao S, Zhao J, Yu D, et al. React: Synergizing reasoning and acting in language models[C]. The eleventh international conference on learning representations. 2022

work page 2022
[11]

Reflexion: Language agents with verbal reinforcement learning[J]

Shinn N, Cassano F, Gopinath A, et al. Reflexion: Language agents with verbal reinforcement learning[J]. Advances in neural information processing systems, 2023, 36: 8634-8652

work page 2023
[12]

Self-refine: Iterative refinement with self-feedback[J]

Madaan A, Tandon N, Gupta P, et al. Self-refine: Iterative refinement with self-feedback[J]. Advances in neural information processing sys- tems, 2023, 36: 46534-46594

work page 2023
[13]

Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]

Wu Q, Bansal G, Zhang J, et al. Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]. First conference on language modeling. 2024

work page 2024
[14]

Tree of thoughts: Deliberate problem solving with large language models[J]

Yao S, Yu D, Zhao J, et al. Tree of thoughts: Deliberate problem solving with large language models[J]. Advances in neural information processing systems, 2023, 36: 11809-11822

work page 2023
[15]

G-eval: NLG evaluation using gpt-4 with better human alignment[C]

Liu Y , Iter D, Xu Y , et al. G-eval: NLG evaluation using gpt-4 with better human alignment[C]. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023: 2511-2522

work page 2023
[16]

Judging llm-as-a-judge with mt- bench and chatbot arena[J]

Zheng L, Chiang W L, Sheng Y , et al. Judging llm-as-a-judge with mt- bench and chatbot arena[J]. Advances in neural information processing systems, 2023, 36: 46595-46623

work page 2023
[17]

GPT-4 Technical Report

Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

MetaGPT: Meta programming for a multi-agent collaborative framework[C]

Hong S, Zhuge M, Chen J, et al. MetaGPT: Meta programming for a multi-agent collaborative framework[C]. The twelfth international conference on learning representations. 2023

work page 2023
[19]

Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]

Wang Y , Zhu Z, Fu Q, et al. Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineer- ing. 2024: 1057-1068

work page 2024
[20]

Trace-based multi-dimensional root cause localization of performance issues in microservice systems[C]

Zhang C, Dong Z, Peng X, et al. Trace-based multi-dimensional root cause localization of performance issues in microservice systems[C]. Proceedings of the IEEE/ACM 46th International Conference on Soft- ware Engineering. 2024: 1-12

work page 2024
[21]

Micronet: Operation aware root cause identification of microservice system anomalies[J]

Yang J, Guo Y , Chen Y , et al. Micronet: Operation aware root cause identification of microservice system anomalies[J]. IEEE Transactions on Network and Service Management, 2024, 21(4): 4255-4267

work page 2024
[22]

Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph[C]

Yao Z, Pei C, Chen W, et al. Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph[C]. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 2024: 50-61

work page 2024
[23]

Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]

Yao Z, Ye H, Pei C, et al. Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]. 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024: 391-402

work page 2024
[24]

Large language models are not fair evalua- tors[C]

Wang P, Li L, Chen L, et al. Large language models are not fair evalua- tors[C]. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). 2024: 9440-9450

work page 2024
[25]

A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]

Sun Y , Wang J, Li Z, et al. A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]. arXiv preprint arXiv:2407.14532, 2024

work page arXiv 2024
[26]

mABC: multi-Agent Blockchain- Inspired Collaboration for root cause analysis in micro-services archi- tecture[C]

Zhang W, Guo H, Yang J, et al. mABC: multi-Agent Blockchain- Inspired Collaboration for root cause analysis in micro-services archi- tecture[C]. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024: 4017-4033

work page 2024
[27]

Dapper, a large-scale distributed systems tracing infrastructure[J]

Sigelman B H, Barroso L A, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure[J]. 2010

work page 2010
[28]

Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]

Pei C, Wang Z, Liu F, et al. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]. Companion Proceedings of the ACM on Web Conference 2025. 2025: 422-431

work page 2025
[29]

arXiv preprint arXiv:2505.00212 , year=

Zhang S, Yin M, Zhang J, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems[J]. arXiv preprint arXiv:2505.00212, 2025

work page arXiv 2025

[1] [1]

Causeinfer: Automatic and distributed per- formance diagnosis with hierarchical causality graph in large distributed systems[C]

Chen P, Qi Y , Zheng P, et al. Causeinfer: Automatic and distributed per- formance diagnosis with hierarchical causality graph in large distributed systems[C]. IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 2014: 1887-1895

work page 2014

[2] [2]

Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture[C]

Lin W, Ma M, Pan D, et al. Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture[C]. 2018 IEEE 37th International Performance Computing and Communi- cations Conference (IPCCC). IEEE, 2018: 1-8

work page 2018

[3] [3]

Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]

Ma M, Lin W, Pan D, et al. Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications[C]. 2019 IEEE Inter- national Conference on Web Services (ICWS). IEEE, 2019: 60-67

work page 2019

[4] [4]

Diagnosing root causes of intermittent slow queries in cloud databases[J]

Ma M, Yin Z, Zhang S, et al. Diagnosing root causes of intermittent slow queries in cloud databases[J]. Proceedings of the VLDB Endowment, 2020, 13(8): 1176-1189

work page 2020

[5] [5]

Microhecl: High-efficient root cause localization in large-scale microservice systems[C]

Liu D, He C, Peng X, et al. Microhecl: High-efficient root cause localization in large-scale microservice systems[C]. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software En- gineering in Practice (ICSE-SEIP). IEEE, 2021: 338-347

work page 2021

[6] [6]

Practical root cause localization for microser- vice systems via trace analysis[C]

Li Z, Chen J, Jiao R, et al. Practical root cause localization for microser- vice systems via trace analysis[C]. 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021: 1-10

work page 2021

[7] [7]

Eadro: An end-to-end troubleshooting framework for microservices on multi-source data

Lee C, Yang T, Chen Z, et al. Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)[J]. IEEE, Los Alamitos, CA, 1750, 1762

work page 2023

[8] [8]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data[C]

Yu G, Chen P, Li Y , et al. Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data[C]. Proceedings of the 31st ACM joint European software engineering conference and symposium on the foundations of software engineering. 2023: 553-565

work page 2023

[9] [9]

TraceNet: Operation aware root cause localization of microservice system anomalies[C]

Yang J, Guo Y , Chen Y , et al. TraceNet: Operation aware root cause localization of microservice system anomalies[C]. 2023 IEEE Interna- tional Conference on Communications Workshops (ICC Workshops). IEEE, 2023: 758-763

work page 2023

[10] [10]

React: Synergizing reasoning and acting in language models[C]

Yao S, Zhao J, Yu D, et al. React: Synergizing reasoning and acting in language models[C]. The eleventh international conference on learning representations. 2022

work page 2022

[11] [11]

Reflexion: Language agents with verbal reinforcement learning[J]

Shinn N, Cassano F, Gopinath A, et al. Reflexion: Language agents with verbal reinforcement learning[J]. Advances in neural information processing systems, 2023, 36: 8634-8652

work page 2023

[12] [12]

Self-refine: Iterative refinement with self-feedback[J]

Madaan A, Tandon N, Gupta P, et al. Self-refine: Iterative refinement with self-feedback[J]. Advances in neural information processing sys- tems, 2023, 36: 46534-46594

work page 2023

[13] [13]

Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]

Wu Q, Bansal G, Zhang J, et al. Autogen: Enabling next-gen LLM appli- cations via multi-agent conversations[C]. First conference on language modeling. 2024

work page 2024

[14] [14]

Tree of thoughts: Deliberate problem solving with large language models[J]

Yao S, Yu D, Zhao J, et al. Tree of thoughts: Deliberate problem solving with large language models[J]. Advances in neural information processing systems, 2023, 36: 11809-11822

work page 2023

[15] [15]

G-eval: NLG evaluation using gpt-4 with better human alignment[C]

Liu Y , Iter D, Xu Y , et al. G-eval: NLG evaluation using gpt-4 with better human alignment[C]. Proceedings of the 2023 conference on empirical methods in natural language processing. 2023: 2511-2522

work page 2023

[16] [16]

Judging llm-as-a-judge with mt- bench and chatbot arena[J]

Zheng L, Chiang W L, Sheng Y , et al. Judging llm-as-a-judge with mt- bench and chatbot arena[J]. Advances in neural information processing systems, 2023, 36: 46595-46623

work page 2023

[17] [17]

GPT-4 Technical Report

Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

MetaGPT: Meta programming for a multi-agent collaborative framework[C]

Hong S, Zhuge M, Chen J, et al. MetaGPT: Meta programming for a multi-agent collaborative framework[C]. The twelfth international conference on learning representations. 2023

work page 2023

[19] [19]

Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]

Wang Y , Zhu Z, Fu Q, et al. Mrca: Metric-level root cause analysis for microservices via multi-modal data[C]. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineer- ing. 2024: 1057-1068

work page 2024

[20] [20]

Trace-based multi-dimensional root cause localization of performance issues in microservice systems[C]

Zhang C, Dong Z, Peng X, et al. Trace-based multi-dimensional root cause localization of performance issues in microservice systems[C]. Proceedings of the IEEE/ACM 46th International Conference on Soft- ware Engineering. 2024: 1-12

work page 2024

[21] [21]

Micronet: Operation aware root cause identification of microservice system anomalies[J]

Yang J, Guo Y , Chen Y , et al. Micronet: Operation aware root cause identification of microservice system anomalies[J]. IEEE Transactions on Network and Service Management, 2024, 21(4): 4255-4267

work page 2024

[22] [22]

Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph[C]

Yao Z, Pei C, Chen W, et al. Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph[C]. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 2024: 50-61

work page 2024

[23] [23]

Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]

Yao Z, Ye H, Pei C, et al. Sparserca: Unsupervised root cause analysis in sparse microservice testing traces[C]. 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024: 391-402

work page 2024

[24] [24]

Large language models are not fair evalua- tors[C]

Wang P, Li L, Chen L, et al. Large language models are not fair evalua- tors[C]. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). 2024: 9440-9450

work page 2024

[25] [25]

A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]

Sun Y , Wang J, Li Z, et al. A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management[J]. arXiv preprint arXiv:2407.14532, 2024

work page arXiv 2024

[26] [26]

mABC: multi-Agent Blockchain- Inspired Collaboration for root cause analysis in micro-services archi- tecture[C]

Zhang W, Guo H, Yang J, et al. mABC: multi-Agent Blockchain- Inspired Collaboration for root cause analysis in micro-services archi- tecture[C]. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024: 4017-4033

work page 2024

[27] [27]

Dapper, a large-scale distributed systems tracing infrastructure[J]

Sigelman B H, Barroso L A, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure[J]. 2010

work page 2010

[28] [28]

Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]

Pei C, Wang Z, Liu F, et al. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis[C]. Companion Proceedings of the ACM on Web Conference 2025. 2025: 422-431

work page 2025

[29] [29]

arXiv preprint arXiv:2505.00212 , year=

Zhang S, Yin M, Zhang J, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems[J]. arXiv preprint arXiv:2505.00212, 2025

work page arXiv 2025