OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices
Pith reviewed 2026-05-18 03:39 UTC · model grok-4.3
The pith
OpsAgent achieves state-of-the-art incident management in microservices through a self-evolving multi-agent system without requiring task-specific training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that OpsAgent, by using a training-free data processor to convert heterogeneous observability data into structured textual descriptions and a multi-agent collaboration framework for transparent inference, combined with a dual self-evolution mechanism for internal updates and external experience accumulation, delivers state-of-the-art performance on the OPENRCA benchmark while proving generalizable, interpretable, cost-efficient, and self-evolving in both experiments and real industrial deployment.
What carries the argument
Dual self-evolution mechanism that pairs internal model updates with external experience accumulation within a multi-agent framework supported by training-free data processing.
Load-bearing premise
The training-free data processor reliably converts heterogeneous observability data into structured textual descriptions that preserve all diagnostic information needed for accurate multi-agent inference.
What would settle it
Demonstrating a case where key diagnostic information is lost in the textual conversion step, causing the multi-agent system to miss the correct root cause of an incident that would be identifiable from the original metrics, logs, and traces.
Figures
read the original abstract
Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real-world industrial settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpsAgent, a lightweight self-evolving multi-agent system for incident management in large-scale microservice architectures. It consists of a training-free data processor that converts heterogeneous observability data (metrics, logs, traces) into structured textual descriptions, a multi-agent collaboration framework intended to make diagnostic inference transparent and auditable, and a dual self-evolution mechanism combining internal model updates with external experience accumulation. The authors report comprehensive experiments on the OPENRCA benchmark demonstrating state-of-the-art performance together with claims of generalizability, interpretability, cost-efficiency, and self-evolution, plus a production deployment at Lenovo.
Significance. If the empirical claims are substantiated with quantitative evidence, the work could offer a practically relevant advance in automated incident management by addressing generalization across systems, interpretability of decisions, and long-term sustainability through self-evolution. The training-free processor and multi-agent transparency are potentially attractive for industrial adoption where labeled data and high deployment costs are barriers. The real-world Lenovo deployment adds credibility if accompanied by concrete performance indicators.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on OPENRCA plus generalizability, interpretability, cost-efficiency, and self-evolution, yet supplies no quantitative metrics, baseline comparisons, statistical significance tests, or details on how these properties were measured. Without these, the central empirical claim cannot be evaluated.
- [§3.1] §3.1 (Data Processor): The training-free, rule-based processor is presented as reliably converting heterogeneous observability data into structured text while preserving all diagnostic information. No quantitative fidelity audit (e.g., preservation of temporal correlations, low-amplitude anomalies, or cross-signal causal links) is reported; any information loss is irrecoverable downstream and directly affects the validity of all subsequent performance and interpretability claims.
- [§4.2–4.3] §4.2–4.3 (Evaluation and Ablation): The manuscript must provide explicit tables or figures showing precision/recall/F1 against named baselines on OPENRCA, together with ablation results isolating the contribution of the data processor versus the multi-agent framework. Current description leaves these comparisons unspecified.
minor comments (2)
- [§3.1] Clarify the exact template or rule set used by the data processor to generate textual descriptions; an example input-output pair would improve reproducibility.
- [§4] Define the precise metrics used to quantify 'interpretability' and 'cost-efficiency' (e.g., token usage, latency, human audit time).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and substantiation of the empirical claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The abstract asserts SOTA performance on OPENRCA plus generalizability, interpretability, cost-efficiency, and self-evolution, yet supplies no quantitative metrics, baseline comparisons, statistical significance tests, or details on how these properties were measured. Without these, the central empirical claim cannot be evaluated.
Authors: We agree that the abstract would benefit from explicit quantitative support for the SOTA claim. In the revision we will insert the key OPENRCA F1 score, baseline comparisons, and a brief note on the evaluation protocol. Section 4 already contains the full set of tables with precision/recall/F1, baseline names, and ablation results; we will add statistical significance tests (paired t-tests or Wilcoxon) and a dedicated paragraph clarifying how generalizability (cross-system transfer), interpretability (human audit of agent traces), cost-efficiency (token and latency measurements), and self-evolution (performance delta after experience accumulation) were quantified. revision: yes
-
Referee: [§3.1] §3.1 (Data Processor): The training-free, rule-based processor is presented as reliably converting heterogeneous observability data into structured text while preserving all diagnostic information. No quantitative fidelity audit (e.g., preservation of temporal correlations, low-amplitude anomalies, or cross-signal causal links) is reported; any information loss is irrecoverable downstream and directly affects the validity of all subsequent performance and interpretability claims.
Authors: We accept that an explicit fidelity audit would strengthen the claim. Although the processor performs a direct, one-to-one mapping of each metric, log line, and span without summarization or filtering, we will add a quantitative evaluation in the revised §3.1 (or an appendix). This will report (i) correlation coefficients between original time-series and reconstructed signals from the textual descriptions, (ii) recall of injected low-amplitude anomalies, and (iii) preservation of known causal links on a sample of OPENRCA incidents. revision: yes
-
Referee: [§4.2–4.3] §4.2–4.3 (Evaluation and Ablation): The manuscript must provide explicit tables or figures showing precision/recall/F1 against named baselines on OPENRCA, together with ablation results isolating the contribution of the data processor versus the multi-agent framework. Current description leaves these comparisons unspecified.
Authors: The manuscript already reports these comparisons in §4.2–4.3, but we agree the presentation can be tightened. We will reorganize the section to include (a) a single consolidated table listing precision, recall, and F1 for OpsAgent and every named baseline on OPENRCA, and (b) a dedicated ablation table (plus accompanying figure) that isolates the data-processor component from the multi-agent collaboration framework, reporting incremental gains when each module is added or removed. revision: yes
Circularity Check
No circularity: claims rest on empirical benchmark results and system description
full rationale
The paper presents OpsAgent as a multi-agent system with a training-free data processor and dual self-evolution mechanism, but advances no mathematical derivations, equations, or fitted parameters that reduce to their own inputs. Central claims of SOTA performance, generalizability, interpretability, and real-world deployability are supported by experiments on the OPENRCA benchmark and a production deployment at Lenovo. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core results; the architecture is described directly and evaluated externally. This is a standard empirical systems paper whose validity hinges on observable performance metrics rather than any definitional or self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Heterogeneous observability data from microservices can be converted into structured textual descriptions without model training while preserving diagnostic value.
- domain assumption Multi-agent collaboration produces transparent and auditable diagnostic inferences superior to single-model approaches.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
training-free data processor ... converts heterogeneous observability data into structured textual descriptions
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual self-evolution mechanism that integrates internal model updates with external experience accumulation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.
-
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
SREGym supplies 90 high-fidelity SRE tasks in a live environment to measure how well frontier AI agents handle diverse faults, noises, and complex failure modes such as metastable and correlated failures.
-
Position: agentic AI orchestration should be Bayes-consistent
Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.
Reference graph
Works this paper leans on
-
[1]
Openrca: Can large language models locate the root cause of software failures?
J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “Openrca: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[2]
Automatic root cause analysis via large language models for cloud incidents,
Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wenet al., “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 674–688
work page 2024
-
[3]
Art: A unified unsupervised framework for incident management in mi- croservice systems,
Y . Sun, B. Shi, M. Mao, M. Ma, S. Xia, S. Zhang, and D. Pei, “Art: A unified unsupervised framework for incident management in mi- croservice systems,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1183–1194
work page 2024
-
[4]
Incident report of google cloud outage,
“Incident report of google cloud outage,” https://status.cloud.google. com/incidents/ow5i3PPK96RduMcb1SsW, 2025
work page 2025
-
[5]
Failure diagnosis in microservice systems: A comprehensive survey and analysis,
S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, 2024
work page 2024
-
[6]
Automap: Diagnose your microservice-based web applications automatically,
M. Ma, J. Xu, Y . Wang, P. Chen, Z. Zhang, and P. Wang, “Automap: Diagnose your microservice-based web applications automatically,” in Proceedings of The Web Conference 2020, 2020, pp. 246–258
work page 2020
-
[7]
Localizing failure root causes in a microservice through causality inference,
Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” in2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, 2020, pp. 1–10
work page 2020
-
[8]
Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,
C. Lee, T. Yang, Z. Chen, Y . Su, and M. R. Lyu, “Eadro: An end- to-end troubleshooting framework for microservices on multi-source data,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1750–1762
work page 2023
-
[9]
L. Tao, S. Zhang, Z. Jia, J. Sun, M. Ma, Z. Li, Y . Sun, C. Yang, Y . Zhang, and D. Pei, “Giving every modality a voice in microservice failure diagnosis via multimodal adaptive optimization,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1107–1119
work page 2024
-
[10]
Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,
L. Zheng, Z. Chen, J. He, and H. Chen, “Mulan: multi-modal causal structure learning and root cause analysis for microservice systems,” in Proceedings of the ACM Web Conference 2024, 2024, pp. 4107–4116
work page 2024
-
[11]
G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565
work page 2023
-
[12]
Robust failure diagnosis of microservice system through multimodal data,
S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jinet al., “Robust failure diagnosis of microservice system through multimodal data,”IEEE Transactions on Services Computing, vol. 16, no. 6, pp. 3851–3864, 2023
work page 2023
-
[13]
Recommending root-cause and mitigation steps for cloud incidents using large language models,
T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1737–1749
work page 2023
-
[14]
Automated root causing of cloud incidents using in-context learning with gpt-4,
X. Zhang, S. Ghosh, C. Bansal, R. Wang, M. Ma, Y . Kang, and S. Ra- jmohan, “Automated root causing of cloud incidents using in-context learning with gpt-4,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 266–277
work page 2024
-
[15]
Large language models can provide accurate and interpretable incident triage,
Z. Wang, J. Li, M. Ma, Z. Li, Y . Kang, C. Zhang, C. Bansal, M. Chintalapati, S. Rajmohan, Q. Linet al., “Large language models can provide accurate and interpretable incident triage,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 523–534
work page 2024
-
[16]
Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 931–943
work page 2024
-
[17]
W. Zhang, H. Guo, J. Yang, Z. Tian, Y . Zhang, Y . Chaoran, Z. Li, T. Li, X. Shi, L. Zhenget al., “mabc: Multi-agent blockchain-inspired collaboration for root cause analysis in micro-services architecture,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 4017–4033
work page 2024
-
[18]
Interpretable failure localization for microservice systems based on graph autoencoder,
Y . Sun, Z. Lin, B. Shi, S. Zhang, S. Ma, P. Jin, Z. Zhong, L. Pan, Y . Guo, and D. Pei, “Interpretable failure localization for microservice systems based on graph autoencoder,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–28, 2025
work page 2025
-
[19]
Mapcoder: Multi-agent code generation for competitive problem solving,
M. A. Islam, M. E. Ali, and M. R. Parvez, “Mapcoder: Multi-agent code generation for competitive problem solving,” inAnnual Meeting of the Association of Computational Linguistics 2024. Association for Computational Linguistics (ACL), 2024, pp. 4912–4944
work page 2024
-
[20]
Codes: Natural language to code repository via multi-layer sketch,
D. Zan, A. Yu, W. Liu, D. Chen, B. Shen, W. Li, Y . Yao, Y . Gong, X. Chen, B. Guanet al., “Codes: Natural language to code repository via multi-layer sketch,”CoRR, 2024
work page 2024
-
[21]
H. Zhang, W. Cheng, Y . Wu, and W. Hu, “A pair programming framework for code generation via multi-plan exploration and feedback- driven refinement,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1319–1331
work page 2024
-
[22]
Axnav: Replaying accessibility tests from natural language,
M. Taeb, A. Swearngin, E. Schoop, R. Cheng, Y . Jiang, and J. Nichols, “Axnav: Replaying accessibility tests from natural language,” inPro- ceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–16
work page 2024
-
[23]
Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,
S. Hu, T. Huang, F. ˙Ilhan, S. F. Tekin, and L. Liu, “Large language model-powered smart contract vulnerability detection: New perspec- 11 tives,” in2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). IEEE, 2023, pp. 297–306
work page 2023
-
[24]
arXiv preprint arXiv:2405.03256 , year =
D. Jin, Z. Jin, X. Chen, and C. Wang, “Mare: Multi-agents col- laboration framework for requirements engineering,”arXiv preprint arXiv:2405.03256, 2024
-
[25]
M. Ataei, H. Cheong, D. Grandi, Y . Wang, N. Morris, and A. Tessier, “Elicitron: A large language model agent-based simulation framework for design requirements elicitation,”Journal of Computing and Infor- mation Science in Engineering, vol. 25, no. 2, p. 021012, 2025
work page 2025
-
[26]
D-bot: Database diagnosis system using large language models,
X. Zhou, G. Li, Z. Sun, Z. Liu, W. Chen, J. Wu, J. Liu, R. Feng, and G. Zeng, “D-bot: Database diagnosis system using large language models,”Proceedings of the VLDB Endowment, vol. 17, no. 10, pp. 2514–2527, 2024
work page 2024
-
[27]
Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,
C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431
work page 2025
-
[28]
Microservices: yesterday, today, and tomor- row,
N. Dragoni, S. Giallorenzo, A. L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, and L. Safina, “Microservices: yesterday, today, and tomor- row,”Present and ulterior software engineering, pp. 195–216, 2017
work page 2017
-
[29]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Retrieval-augmented generation for large language models: A survey,
Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, Q. Guo, M. Wanget al., “Retrieval-augmented generation for large language models: A survey,”CoRR, 2023
work page 2023
-
[32]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, “From local to global: A graph rag approach to query-focused summarization,”arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,
H. Qian, Z. Liu, P. Zhang, K. Mao, D. Lian, Z. Dou, and T. Huang, “Memorag: Boosting long context processing with global memory- enhanced retrieval augmentation,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 2366–2377
work page 2025
-
[34]
B. Sarmah, D. Mehta, B. Hall, R. Rao, S. Patel, and S. Pasquali, “Hy- bridrag: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction,” inProceedings of the 5th ACM International Conference on AI in Finance, 2024, pp. 608– 616
work page 2024
-
[35]
Identifying root-cause metrics for incident diagnosis in online service systems,
C. Wu, N. Zhao, L. Wang, X. Yang, S. Li, M. Zhang, X. Jin, X. Wen, X. Nie, W. Zhanget al., “Identifying root-cause metrics for incident diagnosis in online service systems,” in2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2021, pp. 91–102
work page 2021
-
[36]
Drain: An online log parsing approach with fixed depth tree,
P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE international conference on web services (ICWS). IEEE, 2017, pp. 33–40
work page 2017
-
[37]
Term-weighting approaches in automatic text retrieval,
G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,”Information processing & management, vol. 24, no. 5, pp. 513–523, 1988
work page 1988
-
[38]
Trioxpert: An automated incident management framework for microservice system,
Y . Sun, Y . Luo, X. Wen, Y . Yuan, X. Nie, S. Zhang, T. Liu, and X. Luo, “Trioxpert: An automated incident management framework for microservice system,”arXiv preprint arXiv:2506.10043, 2025
-
[39]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[40]
Sre book, chapter 9: Incident response,
S. T. A. C. J. M. J. Y . Jennifer Mace, Jelena Oertel, “Sre book, chapter 9: Incident response,” https://sre.google/workbook/incident-response/
-
[41]
G-eval: Nlg evaluation using gpt-4 with better human alignment,
Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 2511–2522
work page 2023
-
[42]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http: //arxiv.org/abs/1908.10084
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[43]
Retrieval- augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020
work page 2020
-
[44]
Exploring llm-based agents for root cause analysis,
D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in Companion proceedings of the 32nd ACM international conference on the foundations of software engineering, 2024, pp. 208–219
work page 2024
-
[45]
Reflex- ion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 8634–8652, 2023
work page 2023
-
[46]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[47]
Incremental causal graph learning for online root cause analysis,
D. Wang, Z. Chen, Y . Fu, Y . Liu, and H. Chen, “Incremental causal graph learning for online root cause analysis,” inProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp. 2269–2278
work page 2023
-
[48]
Actionable and interpretable fault localization for recurring failures in online service systems,
Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008
work page 2022
-
[49]
Cloudranger: Root cause identification for cloud native systems,
P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y . Wang, and P. Chen, “Cloudranger: Root cause identification for cloud native systems,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 2018, pp. 492–502
work page 2018
-
[50]
Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,
M. Ma, W. Lin, D. Pan, and P. Wang, “Ms-rank: Multi-metric and self- adaptive root cause diagnosis for microservice applications,” in2019 IEEE International Conference on Web Services (ICWS). IEEE, 2019, pp. 60–67
work page 2019
-
[51]
Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,
X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust anomaly detection and localization for interleaved unstructured logs,”IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 2762–2780, 2022
work page 2022
-
[52]
Logkg: Log failure diagnosis through knowledge graph,
Y . Sui, Y . Zhang, J. Sun, T. Xu, S. Zhang, Z. Li, Y . Sun, F. Guo, J. Shen, Y . Zhanget al., “Logkg: Log failure diagnosis through knowledge graph,”IEEE Transactions on Services Computing, vol. 16, no. 5, pp. 3493–3507, 2023
work page 2023
-
[53]
Logm: Log analysis for multiple components of hadoop platform,
Y . Xie, K. Yang, and P. Luo, “Logm: Log analysis for multiple components of hadoop platform,”IEEE Access, vol. 9, pp. 73 522– 73 532, 2021
work page 2021
-
[54]
Onion: identifying incident-indicating logs for cloud systems,
X. Zhang, Y . Xu, S. Qin, S. He, B. Qiao, Z. Li, H. Zhang, X. Li, Y . Dang, Q. Linet al., “Onion: identifying incident-indicating logs for cloud systems,” inProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 1253–1263
work page 2021
-
[55]
X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” inProceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 683–694
work page 2019
-
[56]
G. Yu, Z. Huang, and P. Chen, “Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,” Journal of Software: Evolution and Process, vol. 35, no. 10, p. e2413, 2023
work page 2023
-
[57]
Practical root cause localization for microservice systems via trace analysis,
Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wanget al., “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). IEEE, 2021, pp. 1–10
work page 2021
-
[58]
Tracenet: Operation aware root cause localization of microservice system anomalies,
J. Yang, Y . Guo, Y . Chen, and Y . Zhao, “Tracenet: Operation aware root cause localization of microservice system anomalies,” in2023 IEEE International Conference on Communications Workshops (ICC Workshops). IEEE, 2023, pp. 758–763. 12
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.