pith. sign in

arxiv: 2510.24145 · v3 · submitted 2025-10-28 · 💻 cs.AI

OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices

Pith reviewed 2026-05-18 03:39 UTC · model grok-4.3

classification 💻 cs.AI
keywords incident managementmulti-agent systemmicroservicesself-evolvingobservability dataautomated diagnosissystem reliability
0
0 comments X

The pith

OpsAgent achieves state-of-the-art incident management in microservices through a self-evolving multi-agent system without requiring task-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces OpsAgent to address the challenges of managing incidents in large microservice systems where manual review of vast observability data is impractical. The authors claim that a training-free processor can structure metrics, logs, and traces into text that supports effective multi-agent collaboration for diagnosis. They further propose a dual self-evolution process combining model updates and experience accumulation to enable the system to improve over time. Validation comes from superior benchmark results on OPENRCA and successful use in a production setting at Lenovo. If these elements hold, OpsAgent offers a deployable alternative that is generalizable across systems and sustainable without high ongoing costs.

Core claim

The central discovery is that OpsAgent, by using a training-free data processor to convert heterogeneous observability data into structured textual descriptions and a multi-agent collaboration framework for transparent inference, combined with a dual self-evolution mechanism for internal updates and external experience accumulation, delivers state-of-the-art performance on the OPENRCA benchmark while proving generalizable, interpretable, cost-efficient, and self-evolving in both experiments and real industrial deployment.

What carries the argument

Dual self-evolution mechanism that pairs internal model updates with external experience accumulation within a multi-agent framework supported by training-free data processing.

Load-bearing premise

The training-free data processor reliably converts heterogeneous observability data into structured textual descriptions that preserve all diagnostic information needed for accurate multi-agent inference.

What would settle it

Demonstrating a case where key diagnostic information is lost in the textual conversion step, causing the multi-agent system to miss the correct root cause of an incident that would be identifiable from the original metrics, logs, and traces.

Figures

Figures reproduced from arXiv: 2510.24145 by Dan Pei, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Shenglin Zhang, Xidao Wen, Yongqian Sun, Yu Luo.

Figure 1
Figure 1. Figure 1: From lightweight LLM to MAS-based IM. OpsAgent turns a lightweight LLM into a deployable and sustainable IM system by incorporat￾ing (1) training-free data processor (Section III-B), (2) multi-agent collabora￾tion (Section III-C), and (3) self-evolution mechanism (Section III-D). To address these challenges, we present OpsAgent (as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training-free Data Processor. The processor handles three types of observability data separately: metrics (left), logs (middle), and traces (right). Unlike DL-based IM methods that require large-scale data to learn feature distributions [3], [8], [12], [18], our data processor adopts a training-free approach as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrative example of data descriptions. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multi-agent Collaboration. Agents with predefined roles (via agent profile) cooperate under a structured workflow and cross-review mechanism to enhance reasoning from multiple perspectives. The Root Cause Report not only guides online incident mitigation but also feeds offline training, closing the loop for sustainable capability growth. online deployment and offline training use the Root Cause Report to c… view at source ↗
Figure 5
Figure 5. Figure 5: Self-evolution Mechanism. Internally, agents are fine-tuned via PPO training with a carefully designed reward model (top). Externally, a reflection process distills reusable knowledge into a task-specific knowledge base, which is later leveraged through RAG for knowledge injection (bottom). 2) Reflection: While internal parameter optimization via PPO training enhances task-specific reasoning capability, it… view at source ↗
Figure 6
Figure 6. Figure 6: Mean scores by dimension. Results for OpsAgent on the test set, trained with 60% of incident cases for self-evolution [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Score distributions by dimension. Results for OpsAgent on the test set, trained with 60% of incident cases for self-evolution. We assess outputs produced by Qwen2.5-14B-Instruct-1M on the test set, covering 133 incident cases. For each incident case, they independently reviewed the model’s root cause report and assigned 0–5 ratings on four dimensions, with 3 indicating a neutral (neither good nor poor) sco… view at source ↗
read the original abstract

Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real-world industrial settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OpsAgent, a lightweight self-evolving multi-agent system for incident management in large-scale microservice architectures. It consists of a training-free data processor that converts heterogeneous observability data (metrics, logs, traces) into structured textual descriptions, a multi-agent collaboration framework intended to make diagnostic inference transparent and auditable, and a dual self-evolution mechanism combining internal model updates with external experience accumulation. The authors report comprehensive experiments on the OPENRCA benchmark demonstrating state-of-the-art performance together with claims of generalizability, interpretability, cost-efficiency, and self-evolution, plus a production deployment at Lenovo.

Significance. If the empirical claims are substantiated with quantitative evidence, the work could offer a practically relevant advance in automated incident management by addressing generalization across systems, interpretability of decisions, and long-term sustainability through self-evolution. The training-free processor and multi-agent transparency are potentially attractive for industrial adoption where labeled data and high deployment costs are barriers. The real-world Lenovo deployment adds credibility if accompanied by concrete performance indicators.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on OPENRCA plus generalizability, interpretability, cost-efficiency, and self-evolution, yet supplies no quantitative metrics, baseline comparisons, statistical significance tests, or details on how these properties were measured. Without these, the central empirical claim cannot be evaluated.
  2. [§3.1] §3.1 (Data Processor): The training-free, rule-based processor is presented as reliably converting heterogeneous observability data into structured text while preserving all diagnostic information. No quantitative fidelity audit (e.g., preservation of temporal correlations, low-amplitude anomalies, or cross-signal causal links) is reported; any information loss is irrecoverable downstream and directly affects the validity of all subsequent performance and interpretability claims.
  3. [§4.2–4.3] §4.2–4.3 (Evaluation and Ablation): The manuscript must provide explicit tables or figures showing precision/recall/F1 against named baselines on OPENRCA, together with ablation results isolating the contribution of the data processor versus the multi-agent framework. Current description leaves these comparisons unspecified.
minor comments (2)
  1. [§3.1] Clarify the exact template or rule set used by the data processor to generate textual descriptions; an example input-output pair would improve reproducibility.
  2. [§4] Define the precise metrics used to quantify 'interpretability' and 'cost-efficiency' (e.g., token usage, latency, human audit time).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and substantiation of the empirical claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on OPENRCA plus generalizability, interpretability, cost-efficiency, and self-evolution, yet supplies no quantitative metrics, baseline comparisons, statistical significance tests, or details on how these properties were measured. Without these, the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the SOTA claim. In the revision we will insert the key OPENRCA F1 score, baseline comparisons, and a brief note on the evaluation protocol. Section 4 already contains the full set of tables with precision/recall/F1, baseline names, and ablation results; we will add statistical significance tests (paired t-tests or Wilcoxon) and a dedicated paragraph clarifying how generalizability (cross-system transfer), interpretability (human audit of agent traces), cost-efficiency (token and latency measurements), and self-evolution (performance delta after experience accumulation) were quantified. revision: yes

  2. Referee: [§3.1] §3.1 (Data Processor): The training-free, rule-based processor is presented as reliably converting heterogeneous observability data into structured text while preserving all diagnostic information. No quantitative fidelity audit (e.g., preservation of temporal correlations, low-amplitude anomalies, or cross-signal causal links) is reported; any information loss is irrecoverable downstream and directly affects the validity of all subsequent performance and interpretability claims.

    Authors: We accept that an explicit fidelity audit would strengthen the claim. Although the processor performs a direct, one-to-one mapping of each metric, log line, and span without summarization or filtering, we will add a quantitative evaluation in the revised §3.1 (or an appendix). This will report (i) correlation coefficients between original time-series and reconstructed signals from the textual descriptions, (ii) recall of injected low-amplitude anomalies, and (iii) preservation of known causal links on a sample of OPENRCA incidents. revision: yes

  3. Referee: [§4.2–4.3] §4.2–4.3 (Evaluation and Ablation): The manuscript must provide explicit tables or figures showing precision/recall/F1 against named baselines on OPENRCA, together with ablation results isolating the contribution of the data processor versus the multi-agent framework. Current description leaves these comparisons unspecified.

    Authors: The manuscript already reports these comparisons in §4.2–4.3, but we agree the presentation can be tightened. We will reorganize the section to include (a) a single consolidated table listing precision, recall, and F1 for OpsAgent and every named baseline on OPENRCA, and (b) a dedicated ablation table (plus accompanying figure) that isolates the data-processor component from the multi-agent collaboration framework, reporting incremental gains when each module is added or removed. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical benchmark results and system description

full rationale

The paper presents OpsAgent as a multi-agent system with a training-free data processor and dual self-evolution mechanism, but advances no mathematical derivations, equations, or fitted parameters that reduce to their own inputs. Central claims of SOTA performance, generalizability, interpretability, and real-world deployability are supported by experiments on the OPENRCA benchmark and a production deployment at Lenovo. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core results; the architecture is described directly and evaluated externally. This is a standard empirical systems paper whose validity hinges on observable performance metrics rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified assumption that heterogeneous observability data can be losslessly turned into text and that multi-agent collaboration plus self-evolution will produce reliable, generalizable diagnoses. No free parameters or invented physical entities are mentioned.

axioms (2)
  • domain assumption Heterogeneous observability data from microservices can be converted into structured textual descriptions without model training while preserving diagnostic value.
    This is the foundation of the training-free data processor described in the abstract.
  • domain assumption Multi-agent collaboration produces transparent and auditable diagnostic inferences superior to single-model approaches.
    Invoked to justify the multi-agent framework for interpretability.

pith-pipeline@v0.9.0 · 5761 in / 1415 out tokens · 29449 ms · 2026-05-18T03:39:07.888827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

    cs.AI 2026-05 unverdicted novelty 7.0

    SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.

  2. SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

    cs.AI 2026-05 unverdicted novelty 5.0

    SREGym supplies 90 high-fidelity SRE tasks in a live environment to measure how well frontier AI agents handle diverse faults, noises, and complex failure modes such as metastable and correlated failures.

  3. Position: agentic AI orchestration should be Bayes-consistent

    cs.AI 2026-05 unverdicted novelty 4.0

    Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Openrca: Can large language models locate the root cause of software failures?

    J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “Openrca: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025

  2. [2]

    Automatic root cause analysis via large language models for cloud incidents,

    Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wenet al., “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 674–688

  3. [3]

    Art: A unified unsupervised framework for incident management in mi- croservice systems,

    Y . Sun, B. Shi, M. Mao, M. Ma, S. Xia, S. Zhang, and D. Pei, “Art: A unified unsupervised framework for incident management in mi- croservice systems,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1183–1194

  4. [4]

    Incident report of google cloud outage,

    “Incident report of google cloud outage,” https://status.cloud.google. com/incidents/ow5i3PPK96RduMcb1SsW, 2025

  5. [5]

    Failure diagnosis in microservice systems: A comprehensive survey and analysis,

    S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, 2024

  6. [6]

    Automap: Diagnose your microservice-based web applications automatically,

    M. Ma, J. Xu, Y . Wang, P. Chen, Z. Zhang, and P. Wang, “Automap: Diagnose your microservice-based web applications automatically,” in Proceedings of The Web Conference 2020, 2020, pp. 246–258

  7. [7]

    Localizing failure root causes in a microservice through causality inference,

    Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” in2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 2020, pp. 1–10

  8. [8]

    Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,

    C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1750–1762

  9. [9]

    Giving every modality a voice in microservice failure diagnosis via multimodal adaptive optimization,

    L. Tao, S. Zhang, Z. Jia, J. Sun, M. Ma, Z. Li, Y . Sun, C. Yang, Y . Zhang, and D. Pei, “Giving every modality a voice in microservice failure diagnosis via multimodal adaptive optimization,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1107–1119

  10. [10]

    Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,

    L. Zheng, Z. Chen, J. He, and H. Chen, “Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,” in Proceedings of the ACM Web Conference 2024, 2024, pp. 4107–4116

  11. [11]

    Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,

    G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565

  12. [12]

    Robust failure diagnosis of microservice system through multimodal data,

    S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jinet al., “Robust failure diagnosis of microservice system through multimodal data,”IEEE Transactions on Services Computing, vol. 16, no. 6, pp. 3851–3864, 2023

  13. [13]

    Recommending root-cause and mitigation steps for cloud incidents using large language models,

    T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1737–1749

  14. [14]

    Automated root causing of cloud incidents using in-context learning with gpt-4,

    X. Zhang, S. Ghosh, C. Bansal, R. Wang, M. Ma, Y . Kang, and S. Ra- jmohan, “Automated root causing of cloud incidents using in-context learning with gpt-4,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 266–277

  15. [15]

    Large language models can provide accurate and interpretable incident triage,

    Z. Wang, J. Li, M. Ma, Z. Li, Y . Kang, C. Zhang, C. Bansal, M. Chintalapati, S. Rajmohan, Q. Linet al., “Large language models can provide accurate and interpretable incident triage,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 523–534

  16. [16]

    The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,

    Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943

  17. [17]

    mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,

    W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, Y . Chaoran, Z. Li, T. Li, X. Shi, L. Zhenget al., “mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 4017–4033

  18. [18]

    Interpretable failure localization for microservice systems based on graph autoencoder,

    Y . Sun, Z. Lin, B. Shi, S. Zhang, S. Ma, P. Jin, Z. Zhong, L. Pan, Y . Guo, and D. Pei, “Interpretable failure localization for microservice systems based on graph autoencoder,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025

  19. [19]

    Mapcoder: Multi-agent code generation for competitive problem solving,

    M. A. Islam, M. E. Ali, and M. R. Parvez, “Mapcoder: Multi-agent code generation for competitive problem solving,” inAnnual Meeting of the Association of Computational Linguistics 2024. Association for Computational Linguistics (ACL), 2024, pp. 4912–4944

  20. [20]

    Codes: Natural language to code repository via multi-layer sketch,

    D. Zan, A. Yu, W. Liu, D. Chen, B. Shen, W. Li, Y . Yao, Y . Gong, X. Chen, B. Guanet al., “Codes: Natural language to code repository via multi-layer sketch,”CoRR, 2024

  21. [21]

    A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,

    H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1319–1331

  22. [22]

    Axnav: Replaying accessibility tests from natural language,

    M. Taeb, A. Swearngin, E. Schoop, R. Cheng, Y . Jiang, and J. Nichols, “Axnav: Replaying accessibility tests from natural language,” inPro- ceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–16

  23. [23]

    Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,

    S. Hu, T. Huang, F. ˙Ilhan, S. F. Tekin, and L. Liu, “Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,” in2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). IEEE, 2023, pp. 297–306

  24. [24]

    arXiv preprint arXiv:2405.03256 , year =

    D. Jin, Z. Jin, X. Chen, and C. Wang, “Mare: Multi-agents col- laboration framework for requirements engineering,”arXiv preprint arXiv:2405.03256, 2024

  25. [25]

    Elicitron: A large language model agent-based simulation framework for design requirements elicitation,

    M. Ataei, H. Cheong, D. Grandi, Y . Wang, N. Morris, and A. Tessier, “Elicitron: A large language model agent-based simulation framework for design requirements elicitation,”Journal of Computing and Infor- mation Science in Engineering, vol. 25, no. 2, p. 021012, 2025

  26. [26]

    D-bot: Database diagnosis system using large language models,

    X. Zhou, G. Li, Z. Sun, Z. Liu, W. Chen, J. Wu, J. Liu, R. Feng, and G. Zeng, “D-bot: Database diagnosis system using large language models,”Proceedings of the VLDB Endowment, vol. 17, no. 10, pp. 2514–2527, 2024

  27. [27]

    Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

    C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

  28. [28]

    Microservices: yesterday, today, and tomor- row,

    N. Dragoni, S. Giallorenzo, A. L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, and L. Safina, “Microservices: yesterday, today, and tomor- row,”Present and ulterior software engineering, pp. 195–216, 2017

  29. [29]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  30. [30]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  31. [31]

    Retrieval-augmented generation for large language models: A survey,

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, Q. Guo, M. Wanget al., “Retrieval-augmented generation for large language models: A survey,”CoRR, 2023

  32. [32]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, “From local to global: A graph rag approach to query-focused summarization,”arXiv preprint arXiv:2404.16130, 2024

  33. [33]

    Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,

    H. Qian, Z. Liu, P. Zhang, K. Mao, D. Lian, Z. Dou, and T. Huang, “Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 2366–2377

  34. [34]

    Hy- bridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction,

    B. Sarmah, D. Mehta, B. Hall, R. Rao, S. Patel, and S. Pasquali, “Hy- bridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction,” inProceedings of the 5th ACM International Conference on AI in Finance, 2024, pp. 608– 616

  35. [35]

    Identifying root-cause metrics for incident diagnosis in online service systems,

    C. Wu, N. Zhao, L. Wang, X. Yang, S. Li, M. Zhang, X. Jin, X. Wen, X. Nie, W. Zhanget al., “Identifying root-cause metrics for incident diagnosis in online service systems,” in2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2021, pp. 91–102

  36. [36]

    Drain: An online log parsing approach with fixed depth tree,

    P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE international conference on web services (ICWS). IEEE, 2017, pp. 33–40

  37. [37]

    Term-weighting approaches in automatic text retrieval,

    G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,”Information processing & management, vol. 24, no. 5, pp. 513–523, 1988

  38. [38]

    Trioxpert: An automated incident management framework for microservice system,

    Y . Sun, Y . Luo, X. Wen, Y . Yuan, X. Nie, S. Zhang, T. Liu, and X. Luo, “Trioxpert: An automated incident management framework for microservice system,”arXiv preprint arXiv:2506.10043, 2025

  39. [39]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  40. [40]

    Sre book, chapter 9: Incident response,

    S. T. A. C. J. M. J. Y . Jennifer Mace, Jelena Oertel, “Sre book, chapter 9: Incident response,” https://sre.google/workbook/incident-response/

  41. [41]

    G-eval: Nlg evaluation using gpt-4 with better human alignment,

    Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522

  42. [42]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http: //arxiv.org/abs/1908.10084

  43. [43]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  44. [44]

    Exploring llm-based agents for root cause analysis,

    D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in Companion proceedings of the 32nd ACM international conference on the foundations of software engineering, 2024, pp. 208–219

  45. [45]

    Reflex- ion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023

  46. [46]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  47. [47]

    Incremental causal graph learning for online root cause analysis,

    D. Wang, Z. Chen, Y . Fu, Y . Liu, and H. Chen, “Incremental causal graph learning for online root cause analysis,” inProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp. 2269–2278

  48. [48]

    Actionable and interpretable fault localization for recurring failures in online service systems,

    Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008

  49. [49]

    Cloudranger: Root cause identification for cloud native systems,

    P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y . Wang, and P. Chen, “Cloudranger: Root cause identification for cloud native systems,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 2018, pp. 492–502

  50. [50]

    Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,

    M. Ma, W. Lin, D. Pan, and P. Wang, “Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,” in2019 IEEE International Conference on Web Services (ICWS). IEEE, 2019, pp. 60–67

  51. [51]

    Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,

    X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 2762–2780, 2022

  52. [52]

    Logkg: Log failure diagnosis through knowledge graph,

    Y . Sui, Y . Zhang, J. Sun, T. Xu, S. Zhang, Z. Li, Y . Sun, F. Guo, J. Shen, Y . Zhanget al., “Logkg: Log failure diagnosis through knowledge graph,”IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3493–3507, 2023

  53. [53]

    Logm: Log analysis for multiple components of hadoop platform,

    Y . Xie, K. Yang, and P. Luo, “Logm: Log analysis for multiple components of hadoop platform,”IEEE Access, vol. 9, pp. 73 522– 73 532, 2021

  54. [54]

    Onion: identifying incident-indicating logs for cloud systems,

    X. Zhang, Y . Xu, S. Qin, S. He, B. Qiao, Z. Li, H. Zhang, X. Li, Y . Dang, Q. Linet al., “Onion: identifying incident-indicating logs for cloud systems,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1253–1263

  55. [55]

    Latent error prediction and fault localization for microservice applications by learning from system trace logs,

    X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 683–694

  56. [56]

    Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,

    G. Yu, Z. Huang, and P. Chen, “Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,” Journal of Software: Evolution and Process, vol. 35, no. 10, p. e2413, 2023

  57. [57]

    Practical root cause localization for microservice systems via trace analysis,

    Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wanget al., “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021, pp. 1–10

  58. [58]

    Tracenet: Operation aware root cause localization of microservice system anomalies,

    J. Yang, Y . Guo, Y . Chen, and Y . Zhao, “Tracenet: Operation aware root cause localization of microservice system anomalies,” in2023 IEEE International Conference on Communications Workshops (ICC Workshops). IEEE, 2023, pp. 758–763. 12