OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices

Dan Pei; Jiamin Jiang; Jingfei Feng; Lei Tao; Qingliang Zhang; Shenglin Zhang; Xidao Wen; Yongqian Sun; Yu Luo

arxiv: 2510.24145 · v3 · submitted 2025-10-28 · 💻 cs.AI

OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices

Yu Luo , Jiamin Jiang , Jingfei Feng , Lei Tao , Qingliang Zhang , Xidao Wen , Yongqian Sun , Shenglin Zhang

show 1 more author

Dan Pei

This is my paper

Pith reviewed 2026-05-18 03:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords incident managementmulti-agent systemmicroservicesself-evolvingobservability dataautomated diagnosissystem reliability

0 comments

The pith

OpsAgent achieves state-of-the-art incident management in microservices through a self-evolving multi-agent system without requiring task-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces OpsAgent to address the challenges of managing incidents in large microservice systems where manual review of vast observability data is impractical. The authors claim that a training-free processor can structure metrics, logs, and traces into text that supports effective multi-agent collaboration for diagnosis. They further propose a dual self-evolution process combining model updates and experience accumulation to enable the system to improve over time. Validation comes from superior benchmark results on OPENRCA and successful use in a production setting at Lenovo. If these elements hold, OpsAgent offers a deployable alternative that is generalizable across systems and sustainable without high ongoing costs.

Core claim

The central discovery is that OpsAgent, by using a training-free data processor to convert heterogeneous observability data into structured textual descriptions and a multi-agent collaboration framework for transparent inference, combined with a dual self-evolution mechanism for internal updates and external experience accumulation, delivers state-of-the-art performance on the OPENRCA benchmark while proving generalizable, interpretable, cost-efficient, and self-evolving in both experiments and real industrial deployment.

What carries the argument

Dual self-evolution mechanism that pairs internal model updates with external experience accumulation within a multi-agent framework supported by training-free data processing.

Load-bearing premise

The training-free data processor reliably converts heterogeneous observability data into structured textual descriptions that preserve all diagnostic information needed for accurate multi-agent inference.

What would settle it

Demonstrating a case where key diagnostic information is lost in the textual conversion step, causing the multi-agent system to miss the correct root cause of an incident that would be identifiable from the original metrics, logs, and traces.

Figures

Figures reproduced from arXiv: 2510.24145 by Dan Pei, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Shenglin Zhang, Xidao Wen, Yongqian Sun, Yu Luo.

**Figure 1.** Figure 1: From lightweight LLM to MAS-based IM. OpsAgent turns a lightweight LLM into a deployable and sustainable IM system by incorporating (1) training-free data processor (Section III-B), (2) multi-agent collaboration (Section III-C), and (3) self-evolution mechanism (Section III-D). To address these challenges, we present OpsAgent (as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Training-free Data Processor. The processor handles three types of observability data separately: metrics (left), logs (middle), and traces (right). Unlike DL-based IM methods that require large-scale data to learn feature distributions [3], [8], [12], [18], our data processor adopts a training-free approach as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustrative example of data descriptions. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-agent Collaboration. Agents with predefined roles (via agent profile) cooperate under a structured workflow and cross-review mechanism to enhance reasoning from multiple perspectives. The Root Cause Report not only guides online incident mitigation but also feeds offline training, closing the loop for sustainable capability growth. online deployment and offline training use the Root Cause Report to c… view at source ↗

**Figure 5.** Figure 5: Self-evolution Mechanism. Internally, agents are fine-tuned via PPO training with a carefully designed reward model (top). Externally, a reflection process distills reusable knowledge into a task-specific knowledge base, which is later leveraged through RAG for knowledge injection (bottom). 2) Reflection: While internal parameter optimization via PPO training enhances task-specific reasoning capability, it… view at source ↗

**Figure 6.** Figure 6: Mean scores by dimension. Results for OpsAgent on the test set, trained with 60% of incident cases for self-evolution [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Score distributions by dimension. Results for OpsAgent on the test set, trained with 60% of incident cases for self-evolution. We assess outputs produced by Qwen2.5-14B-Instruct-1M on the test set, covering 133 incident cases. For each incident case, they independently reviewed the model’s root cause report and assigned 0–5 ratings on four dimensions, with 3 indicating a neutral (neither good nor poor) sco… view at source ↗

read the original abstract

Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real-world industrial settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpsAgent puts together a training-free processor, multi-agent diagnosis, and dual self-evolution for microservice incidents, but the SOTA claims rest on unverified data conversion fidelity.

read the letter

OpsAgent builds a multi-agent system that converts metrics, logs, and traces into text without any training, then has agents collaborate on diagnosis while adding internal model updates plus external experience to keep improving. The authors show results on the OPENRCA benchmark and report a production deployment at Lenovo. That combination of pieces is the main new angle for incident management work. The practical focus stands out: they target real deployment costs, interpretability for on-call engineers, and long-term sustainability without constant retraining. The Lenovo case gives some external grounding that pure benchmark papers often lack. The architecture description is clear enough that someone could sketch an implementation from it. The soft spot is the training-free processor. Because it is rule-based, any missed temporal links or low-signal anomalies in the raw data stay lost, and the paper does not appear to include a direct audit of how much diagnostic information survives the conversion. Without that check, the performance edge and the interpretability claims are harder to trust at face value. The experiments are described as comprehensive, but the abstract-level summary leaves the exact baselines, effect sizes, and statistical tests unclear, so a referee would need to see the full tables. This paper is aimed at site-reliability teams and applied researchers who work on cloud monitoring tools. A reader looking for concrete ideas on running multi-agent setups in production would get usable takeaways from the design and the industrial example. It deserves a serious referee because the problem is real, the architecture is explicit, and the deployment evidence is worth checking even if the processor step needs more proof.

Referee Report

3 major / 2 minor

Summary. The paper introduces OpsAgent, a lightweight self-evolving multi-agent system for incident management in large-scale microservice architectures. It consists of a training-free data processor that converts heterogeneous observability data (metrics, logs, traces) into structured textual descriptions, a multi-agent collaboration framework intended to make diagnostic inference transparent and auditable, and a dual self-evolution mechanism combining internal model updates with external experience accumulation. The authors report comprehensive experiments on the OPENRCA benchmark demonstrating state-of-the-art performance together with claims of generalizability, interpretability, cost-efficiency, and self-evolution, plus a production deployment at Lenovo.

Significance. If the empirical claims are substantiated with quantitative evidence, the work could offer a practically relevant advance in automated incident management by addressing generalization across systems, interpretability of decisions, and long-term sustainability through self-evolution. The training-free processor and multi-agent transparency are potentially attractive for industrial adoption where labeled data and high deployment costs are barriers. The real-world Lenovo deployment adds credibility if accompanied by concrete performance indicators.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on OPENRCA plus generalizability, interpretability, cost-efficiency, and self-evolution, yet supplies no quantitative metrics, baseline comparisons, statistical significance tests, or details on how these properties were measured. Without these, the central empirical claim cannot be evaluated.
[§3.1] §3.1 (Data Processor): The training-free, rule-based processor is presented as reliably converting heterogeneous observability data into structured text while preserving all diagnostic information. No quantitative fidelity audit (e.g., preservation of temporal correlations, low-amplitude anomalies, or cross-signal causal links) is reported; any information loss is irrecoverable downstream and directly affects the validity of all subsequent performance and interpretability claims.
[§4.2–4.3] §4.2–4.3 (Evaluation and Ablation): The manuscript must provide explicit tables or figures showing precision/recall/F1 against named baselines on OPENRCA, together with ablation results isolating the contribution of the data processor versus the multi-agent framework. Current description leaves these comparisons unspecified.

minor comments (2)

[§3.1] Clarify the exact template or rule set used by the data processor to generate textual descriptions; an example input-output pair would improve reproducibility.
[§4] Define the precise metrics used to quantify 'interpretability' and 'cost-efficiency' (e.g., token usage, latency, human audit time).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and substantiation of the empirical claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on OPENRCA plus generalizability, interpretability, cost-efficiency, and self-evolution, yet supplies no quantitative metrics, baseline comparisons, statistical significance tests, or details on how these properties were measured. Without these, the central empirical claim cannot be evaluated.

Authors: We agree that the abstract would benefit from explicit quantitative support for the SOTA claim. In the revision we will insert the key OPENRCA F1 score, baseline comparisons, and a brief note on the evaluation protocol. Section 4 already contains the full set of tables with precision/recall/F1, baseline names, and ablation results; we will add statistical significance tests (paired t-tests or Wilcoxon) and a dedicated paragraph clarifying how generalizability (cross-system transfer), interpretability (human audit of agent traces), cost-efficiency (token and latency measurements), and self-evolution (performance delta after experience accumulation) were quantified. revision: yes
Referee: [§3.1] §3.1 (Data Processor): The training-free, rule-based processor is presented as reliably converting heterogeneous observability data into structured text while preserving all diagnostic information. No quantitative fidelity audit (e.g., preservation of temporal correlations, low-amplitude anomalies, or cross-signal causal links) is reported; any information loss is irrecoverable downstream and directly affects the validity of all subsequent performance and interpretability claims.

Authors: We accept that an explicit fidelity audit would strengthen the claim. Although the processor performs a direct, one-to-one mapping of each metric, log line, and span without summarization or filtering, we will add a quantitative evaluation in the revised §3.1 (or an appendix). This will report (i) correlation coefficients between original time-series and reconstructed signals from the textual descriptions, (ii) recall of injected low-amplitude anomalies, and (iii) preservation of known causal links on a sample of OPENRCA incidents. revision: yes
Referee: [§4.2–4.3] §4.2–4.3 (Evaluation and Ablation): The manuscript must provide explicit tables or figures showing precision/recall/F1 against named baselines on OPENRCA, together with ablation results isolating the contribution of the data processor versus the multi-agent framework. Current description leaves these comparisons unspecified.

Authors: The manuscript already reports these comparisons in §4.2–4.3, but we agree the presentation can be tightened. We will reorganize the section to include (a) a single consolidated table listing precision, recall, and F1 for OpsAgent and every named baseline on OPENRCA, and (b) a dedicated ablation table (plus accompanying figure) that isolates the data-processor component from the multi-agent collaboration framework, reporting incremental gains when each module is added or removed. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical benchmark results and system description

full rationale

The paper presents OpsAgent as a multi-agent system with a training-free data processor and dual self-evolution mechanism, but advances no mathematical derivations, equations, or fitted parameters that reduce to their own inputs. Central claims of SOTA performance, generalizability, interpretability, and real-world deployability are supported by experiments on the OPENRCA benchmark and a production deployment at Lenovo. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core results; the architecture is described directly and evaluated externally. This is a standard empirical systems paper whose validity hinges on observable performance metrics rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified assumption that heterogeneous observability data can be losslessly turned into text and that multi-agent collaboration plus self-evolution will produce reliable, generalizable diagnoses. No free parameters or invented physical entities are mentioned.

axioms (2)

domain assumption Heterogeneous observability data from microservices can be converted into structured textual descriptions without model training while preserving diagnostic value.
This is the foundation of the training-free data processor described in the abstract.
domain assumption Multi-agent collaboration produces transparent and auditable diagnostic inferences superior to single-model approaches.
Invoked to justify the multi-agent framework for interpretability.

pith-pipeline@v0.9.0 · 5761 in / 1415 out tokens · 29449 ms · 2026-05-18T03:39:07.888827+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

training-free data processor ... converts heterogeneous observability data into structured textual descriptions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual self-evolution mechanism that integrates internal model updates with external experience accumulation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
cs.AI 2026-05 unverdicted novelty 7.0

SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
cs.AI 2026-05 unverdicted novelty 5.0

SREGym supplies 90 high-fidelity SRE tasks in a live environment to measure how well frontier AI agents handle diverse faults, noises, and complex failure modes such as metastable and correlated failures.
Position: agentic AI orchestration should be Bayes-consistent
cs.AI 2026-05 unverdicted novelty 4.0

Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Openrca: Can large language models locate the root cause of software failures?

J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “Openrca: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[2]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wenet al., “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 674–688

work page 2024
[3]

Art: A unified unsupervised framework for incident management in mi- croservice systems,

Y . Sun, B. Shi, M. Mao, M. Ma, S. Xia, S. Zhang, and D. Pei, “Art: A unified unsupervised framework for incident management in mi- croservice systems,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1183–1194

work page 2024
[4]

Incident report of google cloud outage,

“Incident report of google cloud outage,” https://status.cloud.google. com/incidents/ow5i3PPK96RduMcb1SsW, 2025

work page 2025
[5]

Failure diagnosis in microservice systems: A comprehensive survey and analysis,

S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, 2024

work page 2024
[6]

Automap: Diagnose your microservice-based web applications automatically,

M. Ma, J. Xu, Y . Wang, P. Chen, Z. Zhang, and P. Wang, “Automap: Diagnose your microservice-based web applications automatically,” in Proceedings of The Web Conference 2020, 2020, pp. 246–258

work page 2020
[7]

Localizing failure root causes in a microservice through causality inference,

Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” in2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 2020, pp. 1–10

work page 2020
[8]

Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,

C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1750–1762

work page 2023
[9]

Giving every modality a voice in microservice failure diagnosis via multimodal adaptive optimization,

L. Tao, S. Zhang, Z. Jia, J. Sun, M. Ma, Z. Li, Y . Sun, C. Yang, Y . Zhang, and D. Pei, “Giving every modality a voice in microservice failure diagnosis via multimodal adaptive optimization,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1107–1119

work page 2024
[10]

Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,

L. Zheng, Z. Chen, J. He, and H. Chen, “Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,” in Proceedings of the ACM Web Conference 2024, 2024, pp. 4107–4116

work page 2024
[11]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,

G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565

work page 2023
[12]

Robust failure diagnosis of microservice system through multimodal data,

S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jinet al., “Robust failure diagnosis of microservice system through multimodal data,”IEEE Transactions on Services Computing, vol. 16, no. 6, pp. 3851–3864, 2023

work page 2023
[13]

Recommending root-cause and mitigation steps for cloud incidents using large language models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1737–1749

work page 2023
[14]

Automated root causing of cloud incidents using in-context learning with gpt-4,

X. Zhang, S. Ghosh, C. Bansal, R. Wang, M. Ma, Y . Kang, and S. Ra- jmohan, “Automated root causing of cloud incidents using in-context learning with gpt-4,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 266–277

work page 2024
[15]

Large language models can provide accurate and interpretable incident triage,

Z. Wang, J. Li, M. Ma, Z. Li, Y . Kang, C. Zhang, C. Bansal, M. Chintalapati, S. Rajmohan, Q. Linet al., “Large language models can provide accurate and interpretable incident triage,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 523–534

work page 2024
[16]

The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,

Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943

work page 2024
[17]

mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,

W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, Y . Chaoran, Z. Li, T. Li, X. Shi, L. Zhenget al., “mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 4017–4033

work page 2024
[18]

Interpretable failure localization for microservice systems based on graph autoencoder,

Y . Sun, Z. Lin, B. Shi, S. Zhang, S. Ma, P. Jin, Z. Zhong, L. Pan, Y . Guo, and D. Pei, “Interpretable failure localization for microservice systems based on graph autoencoder,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025

work page 2025
[19]

Mapcoder: Multi-agent code generation for competitive problem solving,

M. A. Islam, M. E. Ali, and M. R. Parvez, “Mapcoder: Multi-agent code generation for competitive problem solving,” inAnnual Meeting of the Association of Computational Linguistics 2024. Association for Computational Linguistics (ACL), 2024, pp. 4912–4944

work page 2024
[20]

Codes: Natural language to code repository via multi-layer sketch,

D. Zan, A. Yu, W. Liu, D. Chen, B. Shen, W. Li, Y . Yao, Y . Gong, X. Chen, B. Guanet al., “Codes: Natural language to code repository via multi-layer sketch,”CoRR, 2024

work page 2024
[21]

A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,

H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1319–1331

work page 2024
[22]

Axnav: Replaying accessibility tests from natural language,

M. Taeb, A. Swearngin, E. Schoop, R. Cheng, Y . Jiang, and J. Nichols, “Axnav: Replaying accessibility tests from natural language,” inPro- ceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–16

work page 2024
[23]

Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,

S. Hu, T. Huang, F. ˙Ilhan, S. F. Tekin, and L. Liu, “Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,” in2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). IEEE, 2023, pp. 297–306

work page 2023
[24]

arXiv preprint arXiv:2405.03256 , year =

D. Jin, Z. Jin, X. Chen, and C. Wang, “Mare: Multi-agents col- laboration framework for requirements engineering,”arXiv preprint arXiv:2405.03256, 2024

work page arXiv 2024
[25]

Elicitron: A large language model agent-based simulation framework for design requirements elicitation,

M. Ataei, H. Cheong, D. Grandi, Y . Wang, N. Morris, and A. Tessier, “Elicitron: A large language model agent-based simulation framework for design requirements elicitation,”Journal of Computing and Infor- mation Science in Engineering, vol. 25, no. 2, p. 021012, 2025

work page 2025
[26]

D-bot: Database diagnosis system using large language models,

X. Zhou, G. Li, Z. Sun, Z. Liu, W. Chen, J. Wu, J. Liu, R. Feng, and G. Zeng, “D-bot: Database diagnosis system using large language models,”Proceedings of the VLDB Endowment, vol. 17, no. 10, pp. 2514–2527, 2024

work page 2024
[27]

Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

work page 2025
[28]

Microservices: yesterday, today, and tomor- row,

N. Dragoni, S. Giallorenzo, A. L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, and L. Safina, “Microservices: yesterday, today, and tomor- row,”Present and ulterior software engineering, pp. 195–216, 2017

work page 2017
[29]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Retrieval-augmented generation for large language models: A survey,

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, Q. Guo, M. Wanget al., “Retrieval-augmented generation for large language models: A survey,”CoRR, 2023

work page 2023
[32]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, “From local to global: A graph rag approach to query-focused summarization,”arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,

H. Qian, Z. Liu, P. Zhang, K. Mao, D. Lian, Z. Dou, and T. Huang, “Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 2366–2377

work page 2025
[34]

Hy- bridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction,

B. Sarmah, D. Mehta, B. Hall, R. Rao, S. Patel, and S. Pasquali, “Hy- bridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction,” inProceedings of the 5th ACM International Conference on AI in Finance, 2024, pp. 608– 616

work page 2024
[35]

Identifying root-cause metrics for incident diagnosis in online service systems,

C. Wu, N. Zhao, L. Wang, X. Yang, S. Li, M. Zhang, X. Jin, X. Wen, X. Nie, W. Zhanget al., “Identifying root-cause metrics for incident diagnosis in online service systems,” in2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2021, pp. 91–102

work page 2021
[36]

Drain: An online log parsing approach with fixed depth tree,

P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE international conference on web services (ICWS). IEEE, 2017, pp. 33–40

work page 2017
[37]

Term-weighting approaches in automatic text retrieval,

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,”Information processing & management, vol. 24, no. 5, pp. 513–523, 1988

work page 1988
[38]

Trioxpert: An automated incident management framework for microservice system,

Y . Sun, Y . Luo, X. Wen, Y . Yuan, X. Nie, S. Zhang, T. Liu, and X. Luo, “Trioxpert: An automated incident management framework for microservice system,”arXiv preprint arXiv:2506.10043, 2025

work page arXiv 2025
[39]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[40]

Sre book, chapter 9: Incident response,

S. T. A. C. J. M. J. Y . Jennifer Mace, Jelena Oertel, “Sre book, chapter 9: Incident response,” https://sre.google/workbook/incident-response/

work page
[41]

G-eval: Nlg evaluation using gpt-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522

work page 2023
[42]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http: //arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

work page 2020
[44]

Exploring llm-based agents for root cause analysis,

D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in Companion proceedings of the 32nd ACM international conference on the foundations of software engineering, 2024, pp. 208–219

work page 2024
[45]

Reflex- ion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023

work page 2023
[46]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

work page 2023
[47]

Incremental causal graph learning for online root cause analysis,

D. Wang, Z. Chen, Y . Fu, Y . Liu, and H. Chen, “Incremental causal graph learning for online root cause analysis,” inProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp. 2269–2278

work page 2023
[48]

Actionable and interpretable fault localization for recurring failures in online service systems,

Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008

work page 2022
[49]

Cloudranger: Root cause identification for cloud native systems,

P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y . Wang, and P. Chen, “Cloudranger: Root cause identification for cloud native systems,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 2018, pp. 492–502

work page 2018
[50]

Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,

M. Ma, W. Lin, D. Pan, and P. Wang, “Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,” in2019 IEEE International Conference on Web Services (ICWS). IEEE, 2019, pp. 60–67

work page 2019
[51]

Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,

X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 2762–2780, 2022

work page 2022
[52]

Logkg: Log failure diagnosis through knowledge graph,

Y . Sui, Y . Zhang, J. Sun, T. Xu, S. Zhang, Z. Li, Y . Sun, F. Guo, J. Shen, Y . Zhanget al., “Logkg: Log failure diagnosis through knowledge graph,”IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3493–3507, 2023

work page 2023
[53]

Logm: Log analysis for multiple components of hadoop platform,

Y . Xie, K. Yang, and P. Luo, “Logm: Log analysis for multiple components of hadoop platform,”IEEE Access, vol. 9, pp. 73 522– 73 532, 2021

work page 2021
[54]

Onion: identifying incident-indicating logs for cloud systems,

X. Zhang, Y . Xu, S. Qin, S. He, B. Qiao, Z. Li, H. Zhang, X. Li, Y . Dang, Q. Linet al., “Onion: identifying incident-indicating logs for cloud systems,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1253–1263

work page 2021
[55]

Latent error prediction and fault localization for microservice applications by learning from system trace logs,

X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 683–694

work page 2019
[56]

Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,

G. Yu, Z. Huang, and P. Chen, “Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,” Journal of Software: Evolution and Process, vol. 35, no. 10, p. e2413, 2023

work page 2023
[57]

Practical root cause localization for microservice systems via trace analysis,

Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wanget al., “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021, pp. 1–10

work page 2021
[58]

Tracenet: Operation aware root cause localization of microservice system anomalies,

J. Yang, Y . Guo, Y . Chen, and Y . Zhao, “Tracenet: Operation aware root cause localization of microservice system anomalies,” in2023 IEEE International Conference on Communications Workshops (ICC Workshops). IEEE, 2023, pp. 758–763. 12

work page 2023

[1] [1]

Openrca: Can large language models locate the root cause of software failures?

J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “Openrca: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[2] [2]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wenet al., “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 674–688

work page 2024

[3] [3]

Art: A unified unsupervised framework for incident management in mi- croservice systems,

Y . Sun, B. Shi, M. Mao, M. Ma, S. Xia, S. Zhang, and D. Pei, “Art: A unified unsupervised framework for incident management in mi- croservice systems,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1183–1194

work page 2024

[4] [4]

Incident report of google cloud outage,

“Incident report of google cloud outage,” https://status.cloud.google. com/incidents/ow5i3PPK96RduMcb1SsW, 2025

work page 2025

[5] [5]

Failure diagnosis in microservice systems: A comprehensive survey and analysis,

S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, 2024

work page 2024

[6] [6]

Automap: Diagnose your microservice-based web applications automatically,

M. Ma, J. Xu, Y . Wang, P. Chen, Z. Zhang, and P. Wang, “Automap: Diagnose your microservice-based web applications automatically,” in Proceedings of The Web Conference 2020, 2020, pp. 246–258

work page 2020

[7] [7]

Localizing failure root causes in a microservice through causality inference,

Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” in2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 2020, pp. 1–10

work page 2020

[8] [8]

Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,

C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1750–1762

work page 2023

[9] [9]

Giving every modality a voice in microservice failure diagnosis via multimodal adaptive optimization,

L. Tao, S. Zhang, Z. Jia, J. Sun, M. Ma, Z. Li, Y . Sun, C. Yang, Y . Zhang, and D. Pei, “Giving every modality a voice in microservice failure diagnosis via multimodal adaptive optimization,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1107–1119

work page 2024

[10] [10]

Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,

L. Zheng, Z. Chen, J. He, and H. Chen, “Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,” in Proceedings of the ACM Web Conference 2024, 2024, pp. 4107–4116

work page 2024

[11] [11]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,

G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565

work page 2023

[12] [12]

Robust failure diagnosis of microservice system through multimodal data,

S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jinet al., “Robust failure diagnosis of microservice system through multimodal data,”IEEE Transactions on Services Computing, vol. 16, no. 6, pp. 3851–3864, 2023

work page 2023

[13] [13]

Recommending root-cause and mitigation steps for cloud incidents using large language models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1737–1749

work page 2023

[14] [14]

Automated root causing of cloud incidents using in-context learning with gpt-4,

X. Zhang, S. Ghosh, C. Bansal, R. Wang, M. Ma, Y . Kang, and S. Ra- jmohan, “Automated root causing of cloud incidents using in-context learning with gpt-4,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 266–277

work page 2024

[15] [15]

Large language models can provide accurate and interpretable incident triage,

Z. Wang, J. Li, M. Ma, Z. Li, Y . Kang, C. Zhang, C. Bansal, M. Chintalapati, S. Rajmohan, Q. Linet al., “Large language models can provide accurate and interpretable incident triage,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 523–534

work page 2024

[16] [16]

The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,

Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943

work page 2024

[17] [17]

mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,

W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, Y . Chaoran, Z. Li, T. Li, X. Shi, L. Zhenget al., “mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 4017–4033

work page 2024

[18] [18]

Interpretable failure localization for microservice systems based on graph autoencoder,

Y . Sun, Z. Lin, B. Shi, S. Zhang, S. Ma, P. Jin, Z. Zhong, L. Pan, Y . Guo, and D. Pei, “Interpretable failure localization for microservice systems based on graph autoencoder,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025

work page 2025

[19] [19]

Mapcoder: Multi-agent code generation for competitive problem solving,

M. A. Islam, M. E. Ali, and M. R. Parvez, “Mapcoder: Multi-agent code generation for competitive problem solving,” inAnnual Meeting of the Association of Computational Linguistics 2024. Association for Computational Linguistics (ACL), 2024, pp. 4912–4944

work page 2024

[20] [20]

Codes: Natural language to code repository via multi-layer sketch,

D. Zan, A. Yu, W. Liu, D. Chen, B. Shen, W. Li, Y . Yao, Y . Gong, X. Chen, B. Guanet al., “Codes: Natural language to code repository via multi-layer sketch,”CoRR, 2024

work page 2024

[21] [21]

A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,

H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1319–1331

work page 2024

[22] [22]

Axnav: Replaying accessibility tests from natural language,

M. Taeb, A. Swearngin, E. Schoop, R. Cheng, Y . Jiang, and J. Nichols, “Axnav: Replaying accessibility tests from natural language,” inPro- ceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–16

work page 2024

[23] [23]

Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,

S. Hu, T. Huang, F. ˙Ilhan, S. F. Tekin, and L. Liu, “Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,” in2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). IEEE, 2023, pp. 297–306

work page 2023

[24] [24]

arXiv preprint arXiv:2405.03256 , year =

D. Jin, Z. Jin, X. Chen, and C. Wang, “Mare: Multi-agents col- laboration framework for requirements engineering,”arXiv preprint arXiv:2405.03256, 2024

work page arXiv 2024

[25] [25]

Elicitron: A large language model agent-based simulation framework for design requirements elicitation,

M. Ataei, H. Cheong, D. Grandi, Y . Wang, N. Morris, and A. Tessier, “Elicitron: A large language model agent-based simulation framework for design requirements elicitation,”Journal of Computing and Infor- mation Science in Engineering, vol. 25, no. 2, p. 021012, 2025

work page 2025

[26] [26]

D-bot: Database diagnosis system using large language models,

X. Zhou, G. Li, Z. Sun, Z. Liu, W. Chen, J. Wu, J. Liu, R. Feng, and G. Zeng, “D-bot: Database diagnosis system using large language models,”Proceedings of the VLDB Endowment, vol. 17, no. 10, pp. 2514–2527, 2024

work page 2024

[27] [27]

Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

work page 2025

[28] [28]

Microservices: yesterday, today, and tomor- row,

N. Dragoni, S. Giallorenzo, A. L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, and L. Safina, “Microservices: yesterday, today, and tomor- row,”Present and ulterior software engineering, pp. 195–216, 2017

work page 2017

[29] [29]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Retrieval-augmented generation for large language models: A survey,

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, Q. Guo, M. Wanget al., “Retrieval-augmented generation for large language models: A survey,”CoRR, 2023

work page 2023

[32] [32]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, “From local to global: A graph rag approach to query-focused summarization,”arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,

H. Qian, Z. Liu, P. Zhang, K. Mao, D. Lian, Z. Dou, and T. Huang, “Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 2366–2377

work page 2025

[34] [34]

Hy- bridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction,

B. Sarmah, D. Mehta, B. Hall, R. Rao, S. Patel, and S. Pasquali, “Hy- bridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction,” inProceedings of the 5th ACM International Conference on AI in Finance, 2024, pp. 608– 616

work page 2024

[35] [35]

Identifying root-cause metrics for incident diagnosis in online service systems,

C. Wu, N. Zhao, L. Wang, X. Yang, S. Li, M. Zhang, X. Jin, X. Wen, X. Nie, W. Zhanget al., “Identifying root-cause metrics for incident diagnosis in online service systems,” in2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2021, pp. 91–102

work page 2021

[36] [36]

Drain: An online log parsing approach with fixed depth tree,

P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE international conference on web services (ICWS). IEEE, 2017, pp. 33–40

work page 2017

[37] [37]

Term-weighting approaches in automatic text retrieval,

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,”Information processing & management, vol. 24, no. 5, pp. 513–523, 1988

work page 1988

[38] [38]

Trioxpert: An automated incident management framework for microservice system,

Y . Sun, Y . Luo, X. Wen, Y . Yuan, X. Nie, S. Zhang, T. Liu, and X. Luo, “Trioxpert: An automated incident management framework for microservice system,”arXiv preprint arXiv:2506.10043, 2025

work page arXiv 2025

[39] [39]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022

[40] [40]

Sre book, chapter 9: Incident response,

S. T. A. C. J. M. J. Y . Jennifer Mace, Jelena Oertel, “Sre book, chapter 9: Incident response,” https://sre.google/workbook/incident-response/

work page

[41] [41]

G-eval: Nlg evaluation using gpt-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522

work page 2023

[42] [42]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http: //arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[43] [43]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

work page 2020

[44] [44]

Exploring llm-based agents for root cause analysis,

D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in Companion proceedings of the 32nd ACM international conference on the foundations of software engineering, 2024, pp. 208–219

work page 2024

[45] [45]

Reflex- ion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023

work page 2023

[46] [46]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

work page 2023

[47] [47]

Incremental causal graph learning for online root cause analysis,

D. Wang, Z. Chen, Y . Fu, Y . Liu, and H. Chen, “Incremental causal graph learning for online root cause analysis,” inProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp. 2269–2278

work page 2023

[48] [48]

Actionable and interpretable fault localization for recurring failures in online service systems,

Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008

work page 2022

[49] [49]

Cloudranger: Root cause identification for cloud native systems,

P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y . Wang, and P. Chen, “Cloudranger: Root cause identification for cloud native systems,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 2018, pp. 492–502

work page 2018

[50] [50]

Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,

M. Ma, W. Lin, D. Pan, and P. Wang, “Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,” in2019 IEEE International Conference on Web Services (ICWS). IEEE, 2019, pp. 60–67

work page 2019

[51] [51]

Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,

X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 2762–2780, 2022

work page 2022

[52] [52]

Logkg: Log failure diagnosis through knowledge graph,

Y . Sui, Y . Zhang, J. Sun, T. Xu, S. Zhang, Z. Li, Y . Sun, F. Guo, J. Shen, Y . Zhanget al., “Logkg: Log failure diagnosis through knowledge graph,”IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3493–3507, 2023

work page 2023

[53] [53]

Logm: Log analysis for multiple components of hadoop platform,

Y . Xie, K. Yang, and P. Luo, “Logm: Log analysis for multiple components of hadoop platform,”IEEE Access, vol. 9, pp. 73 522– 73 532, 2021

work page 2021

[54] [54]

Onion: identifying incident-indicating logs for cloud systems,

X. Zhang, Y . Xu, S. Qin, S. He, B. Qiao, Z. Li, H. Zhang, X. Li, Y . Dang, Q. Linet al., “Onion: identifying incident-indicating logs for cloud systems,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1253–1263

work page 2021

[55] [55]

Latent error prediction and fault localization for microservice applications by learning from system trace logs,

X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 683–694

work page 2019

[56] [56]

Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,

G. Yu, Z. Huang, and P. Chen, “Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,” Journal of Software: Evolution and Process, vol. 35, no. 10, p. e2413, 2023

work page 2023

[57] [57]

Practical root cause localization for microservice systems via trace analysis,

Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wanget al., “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021, pp. 1–10

work page 2021

[58] [58]

Tracenet: Operation aware root cause localization of microservice system anomalies,

J. Yang, Y . Guo, Y . Chen, and Y . Zhao, “Tracenet: Operation aware root cause localization of microservice system anomalies,” in2023 IEEE International Conference on Communications Workshops (ICC Workshops). IEEE, 2023, pp. 758–763. 12

work page 2023