arxiv: 2604.08618 · v2 · submitted 2026-04-09 · 💻 cs.IR · cs.AI· cs.SE

Recognition: 2 theorem links

· Lean Theorem

SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support

Xingyan Liu , Xiyue Luo , Linyu Li , Ganghong Huang , Jianfeng Liu , Honglin Qiao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:23 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.SE

keywords self-evolving agentsskill optimizationcloud technical supportLLM agentsfailure diagnosisdomain-specific skillsiterative refinementautomated evolution

0 comments

The pith

SkillForge uses an iterative loop of failure analysis and skill rewriting to let cloud support agents improve their skills automatically, eventually surpassing expert-authored versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillForge as a framework to create and continuously refine skills for LLM agents in cloud technical support. It starts by grounding skill creation in domain knowledge bases and historical tickets for better initial alignment. Then a pipeline analyzes execution failures in batches, diagnoses skill deficiencies, and optimizes the skills for the next round. This process runs repeatedly, allowing skills to self-evolve with operational data. A key result is that this automated evolution can lead to higher quality than manually curated expert skills, even when starting from expert versions.

Core claim

The central discovery is that the combination of domain-contextualized skill creation and a three-stage self-optimization pipeline enables progressive improvement in skill quality across multiple rounds of deployment and feedback, as demonstrated in experiments where evolved skills outperformed expert references on real cloud support tasks.

What carries the argument

The self-evolution loop formed by the Failure Analyzer, Skill Diagnostician, and Skill Optimizer, which processes batches of execution traces to identify deficiencies and generate improved skill versions.

If this is right

The Domain-Contextualized Skill Creator generates initial skills with higher consistency to expert responses than generic methods.
The self-evolution loop improves performance from diverse starting points including expert, domain-created, and generic skills.
Skills continue to improve with each round of feedback from real ticket executions.
This approach was validated across five scenarios involving nearly 2,000 tickets and over 3,000 tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that manual skill curation may become less necessary as automated loops accumulate enough feedback data.
Similar self-evolution mechanisms could be adapted for agent skills in other high-volume operational domains like IT helpdesks or customer service.
If the loop runs long enough, it might produce skills tailored to specific company or region ticket patterns beyond general domain knowledge.

Load-bearing premise

The stages that analyze failures and rewrite skills can correctly identify the actual deficiencies without missing issues or introducing new problems that degrade performance.

What would settle it

Running the evolution loop for several rounds on a held-out set of tickets and observing no improvement in success rates or quality metrics compared to the initial skills would disprove the claim.

Figures

Figures reproduced from arXiv: 2604.08618 by Ganghong Huang, Honglin Qiao, Jianfeng Liu, Linyu Li, Xingyan Liu, Xiyue Luo.

**Figure 1.** Figure 1: Overview of the SkillForge self-evolving skill framework. A Domain-Contextualized Skill Creator mines [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Deploying LLM-powered agents in enterprise scenarios such as cloud technical support demands high-quality, domain-specific skills. However, existing skill creators lack domain grounding, producing skills poorly aligned with real-world task requirements. Moreover, once deployed, there is no systematic mechanism to trace execution failures back to skill deficiencies and drive targeted refinements, leaving skill quality stagnant despite accumulating operational evidence. We introduce SkillForge, a self-evolving framework that closes an end-to-end creation-evaluation-refinement loop. To produce well-aligned initial skills, a Domain-Contextualized Skill Creator grounds skill synthesis in knowledge bases and historical support tickets. To enable continuous self-optimization, a three-stage pipeline -- Failure Analyzer, Skill Diagnostician, and Skill Optimizer -- automatically diagnoses execution failures in batch, pinpoints the underlying skill deficiencies, and rewrites the skill to eliminate them. This cycle runs iteratively, allowing skills to self-improve with every round of deployment feedback. Evaluated on five real-world cloud support scenarios spanning 1,883 tickets and 3,737 tasks, experiments show that: (1) the Domain-Contextualized Skill Creator produces substantially better initial skills than the generic skill creator, as measured by consistency with expert-authored reference responses from historical tickets; and (2) the self-evolution loop progressively improves skill quality from diverse starting points (including expert-authored, domain-created, and generic skills) across successive rounds, demonstrating that automated evolution can surpass manually curated expert knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillForge gives a concrete three-stage loop for grounding and iteratively refining LLM agent skills from support tickets, but the claim that it beats expert skills rests on evaluation that lacks held-out tests or controls for overfitting.

read the letter

The paper's main contribution is a practical framework that first builds initial skills by pulling from knowledge bases and historical cloud support tickets, then runs an automated cycle of failure analysis, skill diagnosis, and rewrite optimization to improve them over rounds of use. The three-stage pipeline is a clear engineering choice that turns execution traces into targeted updates rather than leaving skills static after deployment. It also tests the loop starting from expert, domain-created, and generic seeds, which shows some thought about robustness across different starting points. The scale of 1,883 tickets and 3,737 tasks across five scenarios is reasonable for an enterprise setting. That said, the evaluation details are thin. The abstract only mentions consistency with expert reference responses and progressive improvement on the same data, with no baselines, no statistical tests, no checks for leakage between skill creation and measurement, and no results on temporally or distributionally held-out tickets. The risk that the optimizer is mainly fitting patterns already in the historical tickets, or that rewrites introduce new failure modes, is not addressed in what is shown. If the full paper includes ablations on fresh data and evidence that the pipeline reliably identifies genuine deficiencies without regressions, the surpass-expert result would land better. As it stands, the central assumption about the analyzer, diagnostician, and optimizer working as claimed needs more support. This is worth a reading group for teams building agent systems in technical support or similar operational domains, mainly to discuss the pipeline design. It deserves peer review so the methods and controls can be checked properly.

Referee Report

3 major / 2 minor

Summary. The paper introduces SkillForge, a framework for creating and iteratively refining domain-specific skills for LLM-powered agents in cloud technical support. It proposes a Domain-Contextualized Skill Creator that grounds initial skills in knowledge bases and historical tickets, followed by a three-stage self-evolution pipeline (Failure Analyzer, Skill Diagnostician, Skill Optimizer) that diagnoses failures from execution traces and rewrites skills. Experiments across five real-world scenarios involving 1,883 tickets and 3,737 tasks report that the creator yields better initial skills than generic methods (by consistency with expert references) and that the evolution loop progressively improves quality from diverse seeds, ultimately surpassing manually curated expert skills.

Significance. If the self-evolution pipeline produces generalizable improvements rather than distribution-specific fitting, the work would offer a practical mechanism for continuous, automated skill maintenance in enterprise agent deployments, reducing dependence on static expert curation and enabling adaptation to accumulating operational data.

major comments (3)

[Abstract and evaluation] Abstract and evaluation section: The headline claim that the self-evolution loop surpasses expert-authored skills rests on progressive improvement measured against the same 1,883 historical tickets used for skill creation and reference responses. No results are reported on temporally or distributionally held-out tickets, nor are there ablations measuring whether rewrites introduce new failure modes or regress on previously solved cases.
[Method (three-stage pipeline)] Method section (three-stage pipeline): The Failure Analyzer, Skill Diagnostician, and Skill Optimizer are presented as reliably extracting true deficiencies from traces and producing targeted rewrites, yet the manuscript supplies no quantitative checks (e.g., regression rates, performance on novel failure types, or inter-annotator agreement with human experts) that these components avoid overfitting to the historical ticket distribution or creating spurious new deficiencies.
[Abstract] Abstract: The reported positive results cite consistency with expert-authored references on 1,883 tickets and 3,737 tasks but omit baselines, exact metric definitions beyond the consistency measure, statistical tests, variance across scenarios, or explicit controls for data leakage between skill synthesis and evaluation.

minor comments (2)

[Method] The description of the Domain-Contextualized Skill Creator would benefit from an explicit pseudocode or diagram showing how knowledge-base retrieval is combined with ticket examples during synthesis.
[Experiments] Tables or figures summarizing skill quality across evolution rounds should report per-scenario breakdowns and standard deviations rather than aggregate trends only.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications, defenses, and proposed revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract and evaluation] Abstract and evaluation section: The headline claim that the self-evolution loop surpasses expert-authored skills rests on progressive improvement measured against the same 1,883 historical tickets used for skill creation and reference responses. No results are reported on temporally or distributionally held-out tickets, nor are there ablations measuring whether rewrites introduce new failure modes or regress on previously solved cases.

Authors: We acknowledge that using the same historical tickets for both skill grounding and evaluation introduces potential concerns about generalization. Our design reflects the practical setting where refinement occurs on observed operational data, and the progressive gains from diverse seeds (including expert-authored skills) provide evidence of improvement beyond simple memorization. In the revision, we will add an ablation tracking performance on previously solved tasks across rounds to check for regressions or new failure modes. We will also expand the limitations section to explicitly discuss the lack of temporally held-out data and its implications. revision: partial
Referee: [Method (three-stage pipeline)] Method section (three-stage pipeline): The Failure Analyzer, Skill Diagnostician, and Skill Optimizer are presented as reliably extracting true deficiencies from traces and producing targeted rewrites, yet the manuscript supplies no quantitative checks (e.g., regression rates, performance on novel failure types, or inter-annotator agreement with human experts) that these components avoid overfitting to the historical ticket distribution or creating spurious new deficiencies.

Authors: We agree that additional quantitative safeguards would strengthen confidence in the pipeline. The revised manuscript will include new ablations reporting regression rates (re-evaluating prior tasks post-evolution) and breakdowns of performance on novel vs. recurring failure types extracted from traces. We will also add a qualitative expert review of sampled diagnostic and rewrite outputs. Resource constraints prevented formal inter-annotator agreement metrics, but the added review addresses reliability. revision: yes
Referee: [Abstract] Abstract: The reported positive results cite consistency with expert-authored references on 1,883 tickets and 3,737 tasks but omit baselines, exact metric definitions beyond the consistency measure, statistical tests, variance across scenarios, or explicit controls for data leakage between skill synthesis and evaluation.

Authors: The full paper provides the requested details: baselines include generic and non-domain creators; the consistency metric is defined as alignment rate with expert reference responses; results report means and standard deviations across the five scenarios with statistical tests (paired t-tests); and evaluation tasks are distinct from synthesis instances to mitigate leakage. The abstract is space-constrained, but we will revise it to briefly reference these elements and add an explicit leakage-control paragraph in the evaluation section. revision: yes

standing simulated objections not resolved

We do not have access to temporally or distributionally held-out tickets in the current dataset, so we cannot report results on completely unseen data.

Circularity Check

0 steps flagged

No significant circularity in the self-evolution claim or evaluation chain

full rationale

The paper describes an empirical framework whose core results are measured improvements in skill performance against expert-authored reference responses drawn from the 1,883 historical tickets. These tickets function as an external grounding corpus for both initial skill creation and outcome evaluation, rather than the reported gains being defined in terms of the optimizer's own outputs. No equations, fitted parameters, or self-citations are shown to reduce the progressive improvement result to a tautology or to the input traces by construction. The three-stage pipeline is presented as a procedural mechanism whose reliability is tested experimentally, not presupposed mathematically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that historical support tickets provide a faithful proxy for real task distributions and expert performance; no new physical or mathematical entities are introduced.

axioms (1)

domain assumption Historical support tickets and knowledge bases accurately reflect the distribution of real customer issues and high-quality expert responses.
Used both to ground initial skill synthesis and to measure consistency of generated skills.

pith-pipeline@v0.9.0 · 5581 in / 1353 out tokens · 31083 ms · 2026-05-10T18:23:32.152027+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage pipeline — Failure Analyzer, Skill Diagnostician, and Skill Optimizer — automatically diagnoses execution failures in batch, pinpoints the underlying skill deficiencies, and rewrites the skill
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-evolution loop progressively improves skill quality from diverse starting points across successive rounds

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
cs.SE 2026-05 conditional novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...

Reference graph

Works this paper leans on

84 extracted references · cited by 1 Pith paper

[1]

A survey on large language model based autonomous 8 A PREPRINT agents, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous 8 A PREPRINT agents, 2024. Submitted Aug 2023, revised Jan 2024

2024
[2]

Automatic root cause analysis via large language models for cloud incidents

Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo Mandal, Xiaohua Jing, Chenyu Zhao, Jiahao Li, Sheryn Tai, Jom Dora, Tingting Liu, Longfei Li, Guoyao Xu, Yunlong Zhang, Rodrigo Fonseca, Saravan Rajmohan, and Thomas Moscibroda. Automatic root cause analysis via large language m...

2024
[3]

RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Zelin Wang, Zhaoyang Shen, Chao Ma, Jiaming Zhang, Fuyuan Zhou, Hongtao Zhang, Jianpeng Yao, Kai Liu, Kunyi Li, Qiyu Liao, Liuqing Shen, Jianhui Deng, Bing Guo, Ye Li, Hongfeng Jiang, Juntao Wang, Guangbo Yang, and Yang Chen. RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. InProceedings of the 33rd ACM In...

2024
[4]

D-Bot: Database diagnosis system using large language models.Proceedings of the VLDB Endowment, 17(11), 2024

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. D-Bot: Database diagnosis system using large language models.Proceedings of the VLDB Endowment, 17(11), 2024. VLDB 2024

2024
[5]

Equipping agents for the real world with agent skills

Barry Zhang, Keith Lazuka, and Mahesh Murag. Equipping agents for the real world with agent skills. https:// www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills , October 2025. Published Oct 16, 2025

2025
[6]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. Last revised Mar 10, 2023

2023
[7]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. Submitted Feb 9, 2023

2023
[8]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2024

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2024. ICLR 2024 Spotlight

2024
[9]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. NeurIPS 2023

2023
[10]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS),
[11]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, 2024. AAAI 2024

2024
[12]

Large language models are human-level prompt engineers, 2023

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers, 2023. ICLR 2023

2023
[13]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers, 2023. NeurIPS 2023

2023
[14]

differentiation

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text. 2024. Published in Nature, 2024

2024
[15]

Skillsbench: Benchmarking how well agent skills work across diverse tasks, 2026

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Li...

2026
[16]

Agent-in-the-loop: A data flywheel for continuous improvement in LLM-based customer support

Cen Zhao, Tiantian Zhang, Hanchen Su, Yufeng Zhang, Shaowei Su, Mingzhi Xu, Yu Liu, Wei Han, Jeremy Werner, Claire Na Cheng, and Yashar Mehdad. Agent-in-the-loop: A data flywheel for continuous improvement in LLM-based customer support. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1919–193...

2025
[17]

Symbolic learning enables self-evolving agents, 2024

Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. Symbolic learning enables self-evolving agents, 2024. Submitted Jun 2024. 9 A PREPRINT

2024
[18]

Memskill: Learning and evolving memory skills for self-evolving agents, 2026

Viktor Axelsen et al. Memskill: Learning and evolving memory skills for self-evolving agents, 2026. Submitted Feb 2026

2026
[19]

Automated design of agentic systems, 2024

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems, 2024. ICLR 2025 Outstanding Paper

2024
[20]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. NeurIPS 2022

2022
[21]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. NeurIPS 2024

2023
[22]

AgentBench: Evaluating LLMs as agents, 2023

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, et al. AgentBench: Evaluating LLMs as agents, 2023. ICLR 2024

2023
[23]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Ströbl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter,
[24]

Joshi, Hanna Moazam, et al

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, et al. DSPy: Compiling declarative language model calls into self-improving pipelines, 2023. ICLR 2024

2023
[25]

V oyager: An open-ended embodied agent with large language models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023. NeurIPS 2023 Workshop

2023
[26]

SkillX: Automatically constructing skill knowledge bases for complex agent tasks, 2026

Zhixin Zhang, Jian Yang, Yifan Yu, Jiayi Zhang, and Zhoujun Li. SkillX: Automatically constructing skill knowledge bases for complex agent tasks, 2026. Submitted Apr 2026

2026
[27]

AgentSkillOS: Organizing LLM-based agent skills via capability trees and DAG orchestration, 2026

Jinyu Xiang, Tao Wang, Qi Zhang, and Xuanjing Huang. AgentSkillOS: Organizing LLM-based agent skills via capability trees and DAG orchestration, 2026. Submitted Mar 2026

2026
[28]

PolySkill: Polymorphic skill abstraction for cross-domain agent generalization, 2026

Yixiao Wang, Qi Liu, and Enhong Chen. PolySkill: Polymorphic skill abstraction for cross-domain agent generalization, 2026. ICLR 2026

2026
[29]

AgentFactory: Automatically accumulating executable sub-agents via progressive skill refinement, 2026

Zihao Wang, Shaofei Cai, Anji Liu, and Yitao Liang. AgentFactory: Automatically accumulating executable sub-agents via progressive skill refinement, 2026. Submitted Mar 2026

2026
[30]

TARSE: Test-time adaptation via retrievable skills and experiences for LLM agents, 2026

Yuxuan Jiang et al. TARSE: Test-time adaptation via retrievable skills and experiences for LLM agents, 2026. Submitted Mar 2026

2026
[31]

Gödel agent: A self-referential framework for agents recursively self-improvement, 2024

Xunjian Yin et al. Gödel agent: A self-referential framework for agents recursively self-improvement, 2024. Submitted Oct 2024

2024
[32]

Self-evolving agent skills via co-evolutionary verification, 2026

Jialu Zhang, Xiangru Tang, Junyu Luo, Yilun Zhao, Arman Cohan, and Mark Gerstein. Self-evolving agent skills via co-evolutionary verification, 2026. Submitted Apr 2026

2026
[33]

Dual-track knowledge distillation: Learning skills from success and guardrails from failure, 2026

Zihao Wang, Shaofei Cai, Anji Liu, and Yitao Liang. Dual-track knowledge distillation: Learning skills from success and guardrails from failure, 2026. Submitted Mar 2026

2026
[34]

AutoAgent: Evolving cognition and elastic memory for LLM-based agents, 2026

Qiushi Sun et al. AutoAgent: Evolving cognition and elastic memory for LLM-based agents, 2026. Submitted Mar 2026

2026
[35]

DNS resolution failure

Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, and Zhonghai Wu. A survey of AIOps for failure management in the era of large language models, 2024. Submitted Jun 2024. 10 A PREPRINT A Domain-Contextualized Skill Creator Details The Domain-Contextualized Skill Creator addresses the cold-start problem by generating a robust initial ski...

2024
[36]

Problem Identification: Extract a detailed, self-contained problem description that enriches the coarse category tag with contextual information from the dialogue, so the description can be understood independently
[37]

Resolution Path Reconstruction: Reconstruct the complete handling workflow by identifying the key phases—clarification (how the agent pinpointed the actual problem), information gathering (what data the agent requested and why), diagnosis/execution (the reasoning chain and actions taken), and solution delivery (the final resolution or escalation)
[38]

Experience Distillation: Extract reusable lessons, explicitly labeled aspositive patterns(effective strategies worth replicating) ornegative warnings(pitfalls to avoid)
[39]

soothing emotion while setting expectations

Exemplar Response Extraction: Select high-quality verbatim responses from the human agent that demon- strate effective communication at critical moments—such as de-escalating customer frustration, explaining a complex technical constraint, or guiding the customer through a multi-step operation. Each extracted response is annotated with its usage context (...
[40]

Parses operation logs across all tickets in the target domain to build a tool invocation frequency table
[41]

Applies a frequency threshold to select high-utility tools, filtering out rarely used or deprecated ones
[42]

Extracts the schema for each selected tool (name, description, parameters, return values) from the internal tool registry
[43]

Associates each tool with the scenarios in which it is most commonly used, based on co-occurrence statistics. 11 A PREPRINT Output.A tools.json file containing the selected tool schemas, along with per-scenario tool usage annotations that inform the skill’s workflow steps about when and how to invoke each tool. Knowledge Extraction Knowledge Extraction ga...
[44]

Retrieved content is filtered for relevance and condensed into concise reference documents

Documentation Search: The system constructs search queries from the task description and identified scenarios, then retrieves relevant articles from the internal knowledge base and official product documentation sites. Retrieved content is filtered for relevance and condensed into concise reference documents
[45]

These cited references are collected, deduplicated, and organized by scenario, providing a curated set of authoritative sources validated by actual usage

Ticket-Cited References: Historical tickets often contain links to documentation articles that human agents consulted during resolution. These cited references are collected, deduplicated, and organized by scenario, providing a curated set of authoritative sources validated by actual usage. Output.A set of reference documents under references/, each cover...
[46]

This section is intentionally prioritized as the primary determinant of skill quality

Background Knowledge: Clarifies concepts most commonly misunderstood by customers, sourced from Knowledge Extraction and cross-scenario analysis of mined workflows. This section is intentionally prioritized as the primary determinant of skill quality
[47]

Case-Type Triage: A decision tree derived from the case-type classification in Workflow Mining, enabling the agent to route each incoming request to the appropriate handling procedure based on observable signals
[48]

Per-Case-Type Handling(typically 4–8 case types per skill): Each case type includes an applicability description, a branching workflow reconstructed from mined resolution paths (not a flat procedure), tool invocation guidance from Tool Mining, specific failure causes paired with resolution steps, and an escalation fallback
[49]

5.Reference Index: Pointers to the reference documents and tool schemas bundled in the skill package

FAQ: Covers long-tail issues with frequency below 5% of the scenario’s tickets—too infrequent for a dedicated case type but validated by real occurrence. 5.Reference Index: Pointers to the reference documents and tool schemas bundled in the skill package. Synthesis Constraints.The Creator enforces several quality constraints during synthesis: • All proced...
[50]

Style issues only matter when semantic content is correct

Style Analysis: Evaluates expression quality (robotic, verbose, cold, inappropriate tone). Style issues only matter when semantic content is correct
[51]

Knowledge Analysis: Identifies knowledge-level problems including missing information, factual errors, contradictions, outdated content, misapplication, or failure to surface existing knowledge
[52]

Tool Analysis: Examines tool invocation behavior for missed calls, wrong tool selection, incorrect parameters, repeated calls, result misinterpretation, or underutilization
[53]

actual response

Clarification Analysis: Assesses information gathering strategy appropriateness (over-clarification, under- clarification, wrong clarification focus). Aggregation Mechanism Results from the four dimensions are aggregated through deterministic code logic to produce: •failure_categories: List of dimensions with detected issues •overall_severity: Maximum sev...
[54]

Do you have any other questions?

Reference Response: The real human agent’s response at this point. Note: (a) it may be multiple messages concatenated with line breaks; (b) it may contain boilerplate (e.g., “Do you have any other questions?”, “Thank you for your inquiry”) — extract only thecore substantive content. 4.Actual Response: The AI agent’s response to be evaluated. [Evaluation C...
[55]

Root Cause Attribution: For each failure category, map issues to specific SKILL.md locations and classify defect types (missing, insufficient, incorrect)
[56]

Each recommendation specifies the modification location, content changes, whether examples or knowledge search are needed, expected impact, and risk assessment

Optimization Plan Generation: Produce prioritized, actionable modification recommendations with evidence support Attribution Patterns Common mappings from FA categories to skill defects include: Diagnostic Report Structure The output diagnostic report contains: (1) overview with top issues and category distribution, (2) per-category analysis linking evide...
[57]

Read diagnostic report and identify similar recommendations for merging
[58]

Understand original SKILL.md structure to determine appropriate insertion points
[59]

Apply modifications by priority, consulting category analysis files for detailed evidence
[60]

Add examples when specified (using reference excerpts from FA results)
[61]

Perform deduplication check to remove redundant content
[62]

yes, it is accessible

Verify changes meet safety criteria (additive, consistent, evidence-backed) Content Placement Strategy New content is inserted following the original SKILL.md structure: background knowledge goes in dedicated knowledge sections, tool call rules are embedded in relevant workflow steps, style guidelines appear near response templates, and examples immediate...
[63]

a valid signed URL can directly access private files

Add the rule “a valid signed URL can directly access private files” to the skill’s knowledge base
[64]

permanent)

Supplement knowledge about signed URL lifecycle and AccessKey types (temporary vs. permanent)
[65]

Fix the clarification strategy to avoid requesting error information when a complete signed URL with expiration time is already available
[66]

The expert reference clarified that this format is legitimate and that the 502 error was caused by the origin server not returning a 200/206/404 status code

Add tool invocation guidance: when a user provides a signed URL, automatically trigger a bucket ACL query Case 2: Incorrect Knowledge — Mirror-Based Back-to-Origin Configuration Divergence Summary:The agentincorrectlyasserted that OSS mirror back-to-origin configuration cannot contain the https:// protocol prefix or/* wildcard, and instructed the customer...
[67]

Correct the skill’s description of mirror back-to-origin format rules: protocol prefixes (e.g.,https://) and wildcards (/*) are allowed
[68]

Add knowledge: when mirror back-to-origin fails, prioritize verifying origin server response status codes (200/206/404) before modifying configuration
[69]

Add tool trigger condition: when a user mentions a specific domain or configuration, automatically query actual bucket configuration state
[70]

resource packages can only offset usage incurredafterthe activation date and cannot retroactively offset prior usage

Fix the clarification strategy: do not deny user configuration before confirming origin behavior Case 3: Missing Tool — Resource Package Billing Diagnosis Divergence Summary:The agent provided only generic troubleshooting suggestions (checking region mismatch, storage class, etc.) without querying the customer’s actual resource package and billing data. T...
[71]

only offsets usage after activation date

Add resource package deduction rule to knowledge base: “only offsets usage after activation date”
[72]

Add resource package and billing query tool to the skill’s tool set
[73]

Below is the resulting optimization plan with three prioritized actions

Optimize clarification strategy: prioritize tool-based diagnosis over generic multi-factor enumeration for billing inquiries Diagnostic Report and Optimization Plan The Skill Diagnostician reads the aggregated failure analysis and the current SKILL.md, then maps failure patterns to specific skill defects. Below is the resulting optimization plan with thre...
[74]

a valid signed URL can directly access private files

Add core knowledge: “a valid signed URL can directly access private files”
[75]

Correct mirror back-to-origin format description: explicitly allow protocol prefixes (https://) and wildcards (/*)
[76]

Add configuration guide for private bucket + CDN back-to-origin with URL rewriting
[77]

packages only offset usage incurred after activation, not retroactively

Add resource package deduction rule: “packages only offset usage incurred after activation, not retroactively” Requires knowledge search:Yes (“OSS signed URL private file access”, “OSS mirror back-to-origin format”, “OSS resource package deduction rules”) Requires examples:Yes (from expert reference responses) Expected impact:Resolves 48 high-severity kno...
[78]

minimum information set

Add conditional judgment to information gathering strategy; define “minimum information set” concept to avoid requesting already-available data
[79]

Add conciseness guidelines: directly answer the core question before expanding
[80]

Add natural language and empathy guidance; reduce mechanical checklist-style output

Showing first 80 references.