pith. machine review for the scientific record. sign in

arxiv: 2604.26805 · v2 · submitted 2026-04-29 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

· Lean Theorem

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords LLM agentsagentic frameworksystem operationsskill arrangementroot cause analysisonline systemsself-evolvingalert management
0
0 comments X

The pith

Bian Que enables LLM agents to handle online system operations by using flexible Skill Arrangement to precisely select relevant data and knowledge for each event.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to demonstrate that LLM-based agents can be made practical for operating and maintaining large-scale online systems like search engines by solving the key problem of orchestration. Instead of feeding all data and knowledge indiscriminately which causes errors, or manually mapping everything which is impossible with many daily releases, the framework provides a way to define and arrange Skills that match the right information to each situation. This matters because manual effort in monitoring, responding to alerts, and analyzing root causes is a major burden, and automation could free up engineers significantly if it works reliably. The approach includes abstracting operations into standard patterns, auto-generating and optimizing the Skills, and a self-evolving system based on corrections.

Core claim

Bian Que abstracts routine O&M actions into three patterns—release interception, proactive inspection, and alert root cause analysis—and introduces the flexible Skill Arrangement where each Skill specifies the exact data and knowledge needed for a context. These Skills are generated and updated automatically by LLM agents and can be refined by engineers through natural language, with a self-evolving mechanism that uses correction signals to distill knowledge and refine Skills further.

What carries the argument

The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context, allowing automatic generation, updates by agents, and iterative optimization via natural language instructions.

Load-bearing premise

LLM agents using the flexible Skill Arrangement can reliably select the precise data and knowledge for each event without dilution or hallucination, while Skills can be automatically generated and iteratively optimized with minimal ongoing human curation.

What would settle it

A sustained live deployment where root cause analysis accuracy drops below 80 percent or alert reductions fall short of 50 percent would indicate the framework does not deliver as claimed.

Figures

Figures reproduced from arXiv: 2604.26805 by Ben Chen, Bochao Liu, Chenyi Lei, Hongen Wan, Junpeng Zhuang, Shuo Yang, Xiao Liang, Xinyuan Jiang, Yang Zhao, Yao Wu, Yufei Ma, Zhipeng Qian, Zihan Liang.

Figure 1
Figure 1. Figure 1: Overview of the BIAN QUE architecture. Operational events from the OPS platform (top) are dispatched to a matching Agent, which invokes one or more matched Skills to assemble the relevant data (system signals: logs, metrics, change events) and knowledge (domain knowledge distilled from case memory, seeded by operational handbooks) for the LLM to reason over; the resulting diagnosis is returned to the OPS p… view at source ↗
Figure 2
Figure 2. Figure 2: Agent Matrix and Skill Pool. Each Agent (top) implements one canonical pattern; Skills view at source ↗
Figure 3
Figure 3. Figure 3: Flexible Skill lifecycle. New Skills are generated from seed configurations with validation view at source ↗
Figure 4
Figure 4. Figure 4: Day-level alert-analysis accuracy on pro view at source ↗
read the original abstract

Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Bian Que, an agentic LLM framework for online system O&M that abstracts operations into three patterns (release interception, proactive inspection, alert root cause analysis). Its core is the flexible Skill Arrangement mechanism, in which Skills explicitly bundle relevant data (metrics/logs/changes) and knowledge (handbooks/practitioner experience) for each event; Skills are LLM-generated, LLM-updated, and iteratively refined via natural-language instructions from engineers. A unified self-evolving loop distills corrections into memory and Skill updates. Deployed on KuaiShou’s e-commerce search engine, the system is reported to reduce alert volume by 75%, achieve 80% RCA accuracy, cut MTTR by >50%, and reach 99.0% offline pass rate. Code is released at https://github.com/benchen4395/BianQue_Assistant.

Significance. If the deployment claims can be substantiated with transparent methodology, baselines, and error analysis, the work would provide concrete evidence that LLM agents can be orchestrated reliably enough for production O&M at scale, directly addressing the data/knowledge selection bottleneck that currently limits such systems. The open-sourced code is a clear strength for reproducibility. The flexible Skill design and self-evolution loop are conceptually appealing and could generalize beyond the reported deployment.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Deployment Results): The headline metrics (75% alert-volume reduction, 80% RCA accuracy, >50% MTTR reduction, 99.0% offline pass rate) are stated without any baseline system, pre-deployment measurements, statistical error bars, data-collection window, or alert-counting definition. Because the central claim is that the framework itself produces these gains, the absence of this information is load-bearing and prevents attribution to the Skill Arrangement rather than other factors.
  2. [§3.2] §3.2 (Flexible Skill Arrangement): The paper asserts that Skills are automatically generated, updated, and NL-optimized by LLMs with a self-evolving loop, yet supplies no quantitative data on Skill-selection error rates, hallucination frequency during live events, or the volume/frequency of human interventions required over the deployment period. This directly bears on the weakest assumption that the agent reliably maps events to the precise data/knowledge subset without dilution or heavy curation.
  3. [§4.3] §4.3 (Offline Evaluation): The 99.0% pass rate is reported without test-set size, definition of a “pass,” task distribution, or comparison against non-agent baselines. This leaves the offline validation of the Skill mechanism and self-evolution loop unanchored and weakens the supporting evidence for the deployment claims.
minor comments (2)
  1. [Abstract] The abstract is unusually dense with performance numbers; moving the quantitative claims to a dedicated results paragraph or table would improve readability.
  2. [§3] Notation for “Skill” is introduced without an explicit formal definition or pseudocode; a small diagram or boxed definition in §3 would clarify the interface between LLM generation and engineer NL optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's recognition of the potential of the flexible Skill design and self-evolution loop for production O&M. We address each major comment below and will incorporate revisions to improve transparency and substantiation of the results.

read point-by-point responses
  1. Referee: [Abstract and §4] The headline metrics (75% alert-volume reduction, 80% RCA accuracy, >50% MTTR reduction, 99.0% offline pass rate) are stated without any baseline system, pre-deployment measurements, statistical error bars, data-collection window, or alert-counting definition. This prevents attribution to the Skill Arrangement.

    Authors: We agree that additional context is needed to strengthen attribution of the gains to the framework. In the revised manuscript, we will expand §4 (and update the abstract accordingly) to describe the pre-deployment baseline system, the data collection window (pre- and post-deployment periods), the precise definition of alert volume and counting methodology, and any available statistical measures such as variance in MTTR where computable from logs. While certain production details remain subject to internal confidentiality, we will provide sufficient information to allow readers to evaluate the role of Skill Arrangement versus other factors. revision: yes

  2. Referee: [§3.2] The paper asserts that Skills are automatically generated, updated, and NL-optimized by LLMs with a self-evolving loop, yet supplies no quantitative data on Skill-selection error rates, hallucination frequency during live events, or the volume/frequency of human interventions required over the deployment period.

    Authors: This point is well-taken, as quantitative indicators of Skill reliability would better support the claims. We will revise §3.2 to include deployment-derived statistics: the volume and frequency of LLM-generated Skill updates, observed rates of selection errors or hallucinations (mitigated by the self-evolving loop), and the number and nature of human interventions via natural-language instructions. These will be summarized in a new table or paragraph based on production logs, quantifying the human effort and the loop's effectiveness without revealing proprietary details. revision: yes

  3. Referee: [§4.3] The 99.0% pass rate is reported without test-set size, definition of a “pass,” task distribution, or comparison against non-agent baselines.

    Authors: We acknowledge the need for more detail to anchor the offline results. In the revised §4.3, we will specify the test-set size, the exact definition of a 'pass' (e.g., successful task completion per pattern criteria), the distribution of evaluated tasks across release interception, proactive inspection, and alert RCA, and direct comparisons against non-agent baselines such as direct LLM prompting without Skills or rule-based approaches. This will better substantiate the offline validation of the Skill mechanism and self-evolution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical deployment results independent of internal definitions or self-citations

full rationale

The paper describes an agentic O&M framework (unified paradigm, flexible Skill Arrangement, self-evolving mechanism) and reports concrete deployment metrics from KuaiShou (75% alert reduction, 80% RCA accuracy, >50% MTTR reduction, 99% offline pass rate). No equations, fitted parameters, or 'predictions' appear that reduce by construction to inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The central claims rest on observed system outcomes rather than any derivation chain that loops back to its own definitions or fitted data. This is a standard engineering/deployment paper with externally measurable results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the premise that LLM orchestration via Skills avoids hallucination and that self-evolution improves performance without introducing new instabilities.

axioms (1)
  • domain assumption LLM agents can precisely select relevant data and knowledge for each operational event when guided by predefined Skills.
    This assumption underpins the claim that the framework avoids dilution and hallucination.
invented entities (1)
  • Skill no independent evidence
    purpose: A predefined package that explicitly defines the requisite data and operational knowledge for a specific operational context.
    Core new abstraction introduced to enable flexible arrangement and automatic updates.

pith-pipeline@v0.9.0 · 5656 in / 1320 out tokens · 45296 ms · 2026-05-12T01:46:11.170687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 9 internal anchors

  1. [1]

    Openclaw

    OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. Open-source personal AI assistant, version 2026.3.8, accessed 2026-03-09

  2. [2]

    Claude code overview

    Anthropic. Claude code overview. https://code.claude.com/docs/en/overview, 2026. Official documentation, accessed 2026-03-10

  3. [3]

    Harness engineering: leveraging codex in an agent-first world

    OpenAI. Harness engineering: leveraging codex in an agent-first world. Engineering blog,

  4. [4]

    Published: 2026-02-

    URL https://openai.com/index/harness-engineering/. Published: 2026-02-

  5. [5]

    Accessed: 2026-03-13

  6. [6]

    Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

    Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. InProceedings of the 33rd ACM international conference on information and knowledge management, pages 4966–4974, 2024

  7. [7]

    A survey of aiops for failure management in the era of large language models.arXiv preprint arXiv:2406.11213, 2024

    Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S Yu, and Ying Li. A survey of aiops for failure management in the era of large language models.arXiv preprint arXiv:2406.11213, 2024

  8. [8]

    arXiv preprint arXiv:2509.03236 , year=

    Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, et al. Onesearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arXiv preprint arXiv:2509.03236, 2025

  9. [9]

    Onesearch-v2: The latent reasoning enhanced self- distillation generative search framework.arXiv preprint arXiv:2603.24422, 2026

    Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, et al. Onesearch-v2: The latent reasoning enhanced self- distillation generative search framework.arXiv preprint arXiv:2603.24422, 2026

  10. [10]

    Uniecs: Unified multimodal e-commerce search framework with gated cross-modal fusion

    Zihan Liang, Yufei Ma, ZhiPeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, and Han Li. Uniecs: Unified multimodal e-commerce search framework with gated cross-modal fusion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 1788–1797, 2025

  11. [11]

    CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

    Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, and Xiaoshuai Sun. Csmcir: Cot-enhanced symmetric alignment with memory bank for composed image retrieval.arXiv preprint arXiv:2601.03728, 2026

  12. [12]

    OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

    Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965, 2025

  13. [13]

    A survey of aiops methods for failure management.ACM Transactions on Intelligent Systems and Technology (TIST), 12(6):1–45, 2021

    Paolo Notaro, Jorge Cardoso, and Michael Gerndt. A survey of aiops methods for failure management.ACM Transactions on Intelligent Systems and Technology (TIST), 12(6):1–45, 2021

  14. [14]

    Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009, 2025

    Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009, 2025

  15. [15]

    Stalled, biased, and confused: Uncovering reasoning failures in llms for cloud-based root cause analysis.arXiv preprint arXiv:2601.22208, 2026

    Evelien Riddell, James Riddell, Gengyi Sun, Micha´L Antkiewicz, and Krzysztof Czarnecki. Stalled, biased, and confused: Uncovering reasoning failures in llms for cloud-based root cause analysis.arXiv preprint arXiv:2601.22208, 2026

  16. [16]

    Empowering aiops: Leveraging large language models for it operations management.arXiv preprint arXiv:2501.12461, 2025

    Arthur Vitui and Tse-Hsun Chen. Empowering aiops: Leveraging large language models for it operations management.arXiv preprint arXiv:2501.12461, 2025

  17. [17]

    Anomaly detection in univariate time-series: A survey on the state-of-the-art.arXiv preprint arXiv:2004.00433, 2020

    Mohammad Braei and Sebastian Wagner. Anomaly detection in univariate time-series: A survey on the state-of-the-art.arXiv preprint arXiv:2004.00433, 2020

  18. [18]

    Drain: An online log parsing approach with fixed depth tree

    Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. Drain: An online log parsing approach with fixed depth tree. In2017 IEEE international conference on web services (ICWS), pages 33–40. IEEE, 2017. 13

  19. [19]

    Predicting node failure in cloud service systems

    Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, et al. Predicting node failure in cloud service systems. InProceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pages 480–490, 2018

  20. [20]

    Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey.ACM Computing Surveys (CSUR), 55(3):1–39, 2022

    Jacopo Soldani and Antonio Brogi. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey.ACM Computing Surveys (CSUR), 55(3):1–39, 2022

  21. [21]

    Llm-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677, 2025

    Siraaj Akhtar, Saad Khan, and Simon Parkinson. Llm-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677, 2025

  22. [22]

    Retrieval augmented generation-based incident resolution recommendation system for it support.arXiv preprint arXiv:2409.13707, 2024

    Paulina Toro Isaza, Michael Nidd, Noah Zheutlin, Jae-wook Ahn, Chidansh Amitkumar Bhatt, Yu Deng, Ruchi Mahindru, Martin Franz, Hans Florian, and Salim Roukos. Retrieval augmented generation-based incident resolution recommendation system for it support.arXiv preprint arXiv:2409.13707, 2024

  23. [23]

    Exploring llm-based agents for root cause analysis

    Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. Exploring llm-based agents for root cause analysis. InCompanion pro- ceedings of the 32nd ACM international conference on the foundations of software engineering, pages 208–219, 2024

  24. [24]

    Raglog: Log anomaly detection using retrieval augmented generation

    Jonathan Pan, Wong Swee Liang, and Yuan Yidi. Raglog: Log anomaly detection using retrieval augmented generation. In2024 IEEE World Forum on Public Safety Technology (WFPST), pages 169–174. IEEE, 2024

  25. [25]

    A survey of aiops in the era of large language models.ACM Computing Surveys, 58(2):1–35, 2025

    Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip Yu, and Ying Li. A survey of aiops in the era of large language models.ACM Computing Surveys, 58(2):1–35, 2025

  26. [26]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  27. [27]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

  28. [28]

    Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224,

    Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

  29. [29]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  30. [30]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  31. [31]

    TaskWeaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

    Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

  32. [32]

    IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

    Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, and Wenwu Ou. Ig-search: Step-level information gain rewards for search- augmented reasoning.arXiv preprint arXiv:2604.15148, 2026

  33. [33]

    The anatomy of an agent harness

    LangChain. The anatomy of an agent harness. Engineering blog, 2026. URL https:// blog.langchain.com/the-anatomy-of-an-agent-harness/ . Published: 2026-03-10. Accessed: 2026-03-12. 14

  34. [34]

    MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

    Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, et al. Mcp-flow: Facilitating llm agents to master real-world, diverse and scaling mcp tools.arXiv preprint arXiv:2510.24284, 2025

  35. [35]

    Llma4itops: A lightweight llm-based multi-agent framework for it operations and maintenance

    Zhuoxuan Jiang, Tianyang Zhang, Haotian Zhang, Yinong Xun, Yang Liu, Dehua Feng, Wen Si, and Shaohua Zhang. Llma4itops: A lightweight llm-based multi-agent framework for it operations and maintenance. InCCF International Conference on Natural Language Processing and Chinese Computing, pages 471–482. Springer, 2025

  36. [36]

    Aoi: Turning failed trajectories into training signals for autonomous cloud diagnosis.arXiv preprint arXiv:2603.03378, 2026

    Pei Yang, Wanyi Chen, Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Bill Shi, Lynn Ai, et al. Aoi: Turning failed trajectories into training signals for autonomous cloud diagnosis.arXiv preprint arXiv:2603.03378, 2026

  37. [37]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  38. [38]

    Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

  39. [39]

    Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7036...

  40. [40]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

  41. [41]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  42. [42]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  43. [43]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  44. [44]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  45. [45]

    A survey on self-evolution of large language models

    Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024

  46. [46]

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

    Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

  47. [47]

    Live-evo: Online evolution of agentic memory from continuous feedback.arXiv preprint arXiv:2602.02369, 2026

    Yaolun Zhang, Yiran Wu, Yijiong Yu, Qingyun Wu, and Huazheng Wang. Live-evo: Online evolution of agentic memory from continuous feedback.arXiv preprint arXiv:2602.02369, 2026

  48. [48]

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, et al. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025. 15