MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Fuxin Jiang; Huawei Lin; Jie Song; Peng Li; Tieying Zhang

arxiv: 2605.27366 · v1 · pith:GSWXHLDXnew · submitted 2026-05-26 · 💻 cs.AI · cs.CL· cs.LG· cs.MA

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Huawei Lin , Peng Li , Jie Song , Fuxin Jiang , Tieying Zhang This is my paper

Pith reviewed 2026-06-29 17:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MA

keywords LLM agentsskill creationskill memoryagent evolutionskill evaluationself-improving agentsSkillsBench

0 comments

The pith

Agents improve task performance by managing skills through a full lifecycle of creation, memory, evaluation and refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MUSE-Autoskill, a framework that lets LLM agents create skills when needed, store them with accumulated experience across tasks, organize them for selection, evaluate them using unit tests and runtime feedback, and refine them continuously. This unified lifecycle treats skills as reusable, long-lived assets instead of isolated static items. The approach aims to enable ongoing improvement in solving complex tasks. Experiments on SkillsBench indicate gains in success rates, efficiency, skill reuse, and transfer to other agents.

Core claim

By unifying skill creation on demand, skill-level memory for experience accumulation, management for organization and selection, evaluation through unit tests and runtime feedback, and refinement, agents can continuously evolve their capabilities as long-lived, experience-aware, and testable assets.

What carries the argument

The skill lifecycle of creation, memory, management, evaluation, and refinement, with skill-level memory that accumulates experience for each skill across tasks.

If this is right

Skills become reusable across multiple tasks instead of being recreated each time.
Runtime feedback and unit tests drive ongoing refinement of individual skills.
Experience accumulated in skill-level memory improves adaptation when the same skill is reused.
Skills developed by one agent can transfer effectively to other agents.
Task solving shows higher success rates and lower resource use over repeated interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could reduce reliance on retraining base models by shifting improvement to the skill layer.
Similar lifecycle structures might apply to other reusable components such as plans or tool sets.
Longer-running agent deployments would test whether refinement continues to yield gains without external intervention.

Load-bearing premise

Agents can create skills on demand and refine them meaningfully using only unit tests and runtime feedback without further human oversight.

What would settle it

Experiments on SkillsBench that compare the lifecycle-managed approach against static skills show no gains in task success rate, efficiency, reuse, or cross-agent transfer.

read the original abstract

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MUSE-Autoskill sketches a unified lifecycle plus skill-level memory for LLM agents, but the mechanisms for on-demand creation and test-driven refinement stay too vague to back the performance claims.

read the letter

The core idea here is a single loop where agents create skills when needed, store them with per-skill experience logs, pick the right ones for new tasks, test them with unit tests and runtime signals, and edit them over time. The skill-level memory is the clearest addition over prior work that treated skills as one-off static items.

That memory component is a reasonable move. Accumulating task-specific outcomes for each skill could in principle improve selection and adaptation without retraining the whole agent. The paper also correctly flags that isolated skill creation leaves reusability and reliability on the table.

The soft spot is exactly the one in the stress-test note. The abstract never spells out the decision rules or prompt patterns that turn a failed unit test into a concrete skill edit, nor how the agent decides a new skill is worth creating in the first place. Without those pieces, any reported lift on SkillsBench could just be the base LLM plus extra scaffolding rather than the lifecycle itself. The experiments are described only at the level of “initial evidence,” with no numbers, baselines, or error analysis visible.

This is for people already building agent systems who want a high-level template for skill management. A reader could borrow the memory and lifecycle framing even if they end up implementing the details differently.

I would send it to peer review. The topic matters and the framing is coherent on its own terms, but the authors need to show the actual creation and refinement procedures plus the quantitative results before the claims can be assessed.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MUSE-Autoskill, a skill-centric framework for LLM agents that enables continuous capability improvement via a unified lifecycle of skill creation, memory storage, management, evaluation, and refinement. It introduces skill-level memory to accumulate cross-task experience for each skill and claims that experiments on SkillsBench provide initial evidence of gains in task success, efficiency, reuse, and cross-agent transfer when skills are treated as long-lived, experience-aware, and testable assets.

Significance. If the experimental claims hold after details are supplied, the work could meaningfully advance autonomous agent research by shifting from static, isolated skills to dynamically managed ones with built-in evaluation and memory. The skill-level memory construct is a clear conceptual contribution that addresses long-term adaptation, and the unified lifecycle framing offers a coherent organizing principle for future systems.

major comments (2)

[Abstract, §3] Abstract and §3 (Framework): the central claim that unit tests plus runtime feedback drive meaningful continuous refinement without human oversight rests on unspecified mechanisms; no concrete procedure, prompt template, decision rule, or guardrail is given for translating test failures into skill edits, so observed gains cannot be attributed to the proposed lifecycle rather than base LLM capabilities.
[Abstract, §4] Abstract and §4 (Experiments): the statement that 'experiments on SkillsBench provide initial evidence' of improved success, efficiency, reuse, and transfer supplies no methods, baselines, quantitative results, error analysis, or ablation, rendering it impossible to evaluate whether the data supports the load-bearing claim that lifecycle-managed skills are responsible for the gains.

minor comments (2)

[§3.1] Notation for skill-level memory is introduced without a formal definition or pseudocode, which would aid reproducibility.
[Abstract] The abstract uses 'initial evidence' without clarifying the scale of SkillsBench or number of tasks/agents evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential conceptual contributions of the skill-level memory and unified lifecycle. We address the two major comments below and will incorporate the requested details into the revised manuscript.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Framework): the central claim that unit tests plus runtime feedback drive meaningful continuous refinement without human oversight rests on unspecified mechanisms; no concrete procedure, prompt template, decision rule, or guardrail is given for translating test failures into skill edits, so observed gains cannot be attributed to the proposed lifecycle rather than base LLM capabilities.

Authors: We agree that the current description of the refinement process in §3 is high-level and lacks the requested implementation specifics. In the revision we will add the concrete prompt templates for generating skill edits from unit-test failures and runtime feedback, the decision rules that determine when and how an edit is applied, and the guardrails that keep the process autonomous. These additions will make it possible to attribute observed gains to the lifecycle rather than base LLM behavior. revision: yes
Referee: [Abstract, §4] Abstract and §4 (Experiments): the statement that 'experiments on SkillsBench provide initial evidence' of improved success, efficiency, reuse, and transfer supplies no methods, baselines, quantitative results, error analysis, or ablation, rendering it impossible to evaluate whether the data supports the load-bearing claim that lifecycle-managed skills are responsible for the gains.

Authors: We accept that the experimental reporting in §4 is insufficient for rigorous evaluation. The revised manuscript will expand this section with complete methods, the full set of baselines, all quantitative results (including means, variances, and statistical tests), error analysis, and ablation studies that isolate the contribution of skill-level memory and the evaluation-refinement loop. This will directly address whether the lifecycle is responsible for the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework claims rest on empirical evaluation rather than self-referential reduction.

full rationale

The paper describes a proposed agent framework (MUSE-Autoskill) with a skill lifecycle and reports experimental results on SkillsBench showing improvements in success, efficiency, reuse, and transfer. No derivation chain, equations, predictions, or first-principles results are present that reduce by construction to fitted inputs, self-citations, or renamed known results. The central claims are supported by external benchmark evaluation of the implemented system, with no load-bearing self-citation chains or self-definitional loops identified in the abstract or framework description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based on abstract only; limited information available. The framework relies on domain assumptions about agent capabilities for on-demand creation and feedback-driven refinement. Introduces the new concept of skill-level memory without independent evidence provided.

axioms (2)

domain assumption Skills can be created on demand by the agent
Central to the creation stage of the proposed lifecycle.
domain assumption Unit tests and runtime feedback suffice to drive continuous skill refinement
Assumed for the evaluation and refinement stages to enable long-term improvement.

invented entities (1)

skill-level memory no independent evidence
purpose: Accumulates experience for each skill across tasks to enable better reuse and adaptation
New postulated component introduced to support the memory part of the lifecycle.

pith-pipeline@v0.9.1-grok · 5721 in / 1380 out tokens · 122480 ms · 2026-06-29T17:38:52.769481+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MUSE: A Unified Agentic Harness for MLLMs
cs.CV 2026-06 unverdicted novelty 6.0

MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and ver...

Reference graph

Works this paper leans on

40 extracted references · 18 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Agent Skills: Equipping Agents for the Real World.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills , 2025

Anthropic. Agent Skills: Equipping Agents for the Real World.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills , 2025. Open standard released December 2025; https://github.com/anthropics/skills

2025
[3]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. In The TwelfthInternational Conference on Learning Representations, ICLR, Vienna, Austria, 2024

2024
[4]

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

Hongcheol Cho, Ryangkyung Kang, and Youngeun Kim. Skillret: A large-scale benchmark for skill retrieval in llm agents.arXiv preprint arXiv:2605.05726, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, StevenKaShingYau, ZijuanLin, LiyangZhou, ChenyuRan, LingfengXiao, ChenglinWu, andJürgenSchmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe TwelfthInternational Conference on Learning Representations, ICLR 2024, Vie...

2024
[6]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of LLM agents: A survey.CoRR, abs/2402.02716, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pages 1658– 1677, Bangkok, Thailand, 2024. As...

2024
[8]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, ICLR, Vienna, Austria, 2024. OpenReview.net

2024
[9]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Omnigaia: Towards native omni-modal ai agents.CoRR, abs/2602.22897, 2026

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, et al. Omnigaia: Towards native omni-modal ai agents.CoRR, abs/2602.22897, 2026

work page arXiv 2026
[11]

Agent-omni: Test-time multimodal reasoning via model coordination for understanding anything.CoRR, abs/2511.02834, 2025

Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, and Ravender Pal Singh. Agent-omni: Test-time multimodal reasoning via model coordination for understanding anything.CoRR, abs/2511.02834, 2025

work page arXiv 2025
[12]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Trans.Assoc. Comput. Linguistics, 12:157–173, 2024. 17

2024
[13]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InThe TwelfthInternational Conference on Learning Re...

2024
[14]

SkillGen: Verified Inference-Time Agent Skill Synthesis

Yuchen Ma, Yue Huang, Han Bao, Haomin Zhuang, Swadheen Shukla, Michel Galley, Xiangliang Zhang, and Stefan Feuerriegel. Skillgen: Verified inference-time agent skill synthesis.arXiv preprint arXiv:2605.10999, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advancesin Neural Information Processing Sys...

2023
[16]

GAIA: a benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe TwelfthInternational Conference on Learning Representations, ICLR, Vienna, Austria, 2024. OpenReview.net

2024
[17]

SkillOS: Learning Skill Curation for Self-Evolving Agents

Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, et al. Skillos: Learning skill curation for self-evolving agents. arXiv preprint arXiv:2605.06614, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.CoRR, abs/2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,UIST 3, pages 2:1–2:22, San Francisco, CA, 2023

2023
[20]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, NeurIPS, Vancouver, BC, Canada, 2024

2024
[21]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Advancesin Neural Information Processing Systems, NeurIPS, New Orleans, LA, 2023

2023
[22]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November4-9, 2025, pages 5977–6043, 2025

2025
[23]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface.CoRR, abs/2303.17580, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi Gu, Xunliang Cai, Xiang Wang, and An Zhang. Skill1: Unified evolution of skill-augmented agents via reinforcement learning.arXiv preprint arXiv:2605.06130, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Youtu-agent: Scaling agent productivity with automated generation and hybrid policy optimization

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, et al. Youtu-agent: Scaling agent productivity with automated generation and hybrid policy optimization. arXiv preprint arXiv:2512.24615, 2025

work page arXiv 2025
[26]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InAdvancesin Neural Information Processing Systems, NeurIPS, New Orleans, LA, 2023

2023
[27]

Voyager: An open-ended embodied agent with large language models.Trans

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2024, 2024

2024
[28]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software 18 developers as generalist agents. In...

2025
[29]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. CoRR, abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, ICLR, Vienna, Austria, 2024. OpenReview.net

2024
[31]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. InAdvancesin Neural Information Processing Systems, NeurIPS, Vancouver, BC, Canada, 2024

2024
[33]

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Min Yang, Jinghua Piao, Xu Xia, Xiaochong Lan, Jiaju Chen, Yongshun Gong, and Yong Li. Skillmaster: Toward autonomous skill mastery in llm agents.arXiv preprint arXiv:2605.08693, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026
[35]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 2023

2023
[36]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 13643–13658, 2024

2024
[37]

Expel: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: LLM agents are experiential learners. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI, pages 19632–19642, Vancouver, Canada, 2024

2024
[38]

Lifelongagentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942, 2025

Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, and Qianli Ma. Lifelongagentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942, 2025

work page arXiv 2025
[39]

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo FR Ribeiro, Markus Dreyer, Sean Ammirati, and Chenyan Xiong. Skilllearnbench: Benchmarking continual learning methods for agent skill generation on real-world tasks.arXiv preprint arXiv:2604.20087, 2026. 19 A Selected Task List Table 7 lists all 51 selected SkillsBench tasks us...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

doc-only

Numbered list of invariants the implementation must preserve. ## Recommended tools and libraries - Concrete library names, CLI commands, or sandbox tools. ## Workflow Step-by-step procedure the agent should follow at runtime. Catalog routing.The frontmatter description field is the only piece of the skill that is surfaced eagerly: at the start of every ta...

2026

[1] [1]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Agent Skills: Equipping Agents for the Real World.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills , 2025

Anthropic. Agent Skills: Equipping Agents for the Real World.https://www.anthropic.com/engineering/ equipping-agents-for-the-real-world-with-agent-skills , 2025. Open standard released December 2025; https://github.com/anthropics/skills

2025

[3] [3]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. In The TwelfthInternational Conference on Learning Representations, ICLR, Vienna, Austria, 2024

2024

[4] [4]

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

Hongcheol Cho, Ryangkyung Kang, and Youngeun Kim. Skillret: A large-scale benchmark for skill retrieval in llm agents.arXiv preprint arXiv:2605.05726, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, StevenKaShingYau, ZijuanLin, LiyangZhou, ChenyuRan, LingfengXiao, ChenglinWu, andJürgenSchmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe TwelfthInternational Conference on Learning Representations, ICLR 2024, Vie...

2024

[6] [6]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of LLM agents: A survey.CoRR, abs/2402.02716, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pages 1658– 1677, Bangkok, Thailand, 2024. As...

2024

[8] [8]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe TwelfthInternational Conference on Learning Representations, ICLR, Vienna, Austria, 2024. OpenReview.net

2024

[9] [9]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Omnigaia: Towards native omni-modal ai agents.CoRR, abs/2602.22897, 2026

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Yuan Lu, et al. Omnigaia: Towards native omni-modal ai agents.CoRR, abs/2602.22897, 2026

work page arXiv 2026

[11] [11]

Agent-omni: Test-time multimodal reasoning via model coordination for understanding anything.CoRR, abs/2511.02834, 2025

Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, and Ravender Pal Singh. Agent-omni: Test-time multimodal reasoning via model coordination for understanding anything.CoRR, abs/2511.02834, 2025

work page arXiv 2025

[12] [12]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Trans.Assoc. Comput. Linguistics, 12:157–173, 2024. 17

2024

[13] [13]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. InThe TwelfthInternational Conference on Learning Re...

2024

[14] [14]

SkillGen: Verified Inference-Time Agent Skill Synthesis

Yuchen Ma, Yue Huang, Han Bao, Haomin Zhuang, Swadheen Shukla, Michel Galley, Xiangliang Zhang, and Stefan Feuerriegel. Skillgen: Verified inference-time agent skill synthesis.arXiv preprint arXiv:2605.10999, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advancesin Neural Information Processing Sys...

2023

[16] [16]

GAIA: a benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe TwelfthInternational Conference on Learning Representations, ICLR, Vienna, Austria, 2024. OpenReview.net

2024

[17] [17]

SkillOS: Learning Skill Curation for Self-Evolving Agents

Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, et al. Skillos: Learning skill curation for self-evolving agents. arXiv preprint arXiv:2605.06614, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems.CoRR, abs/2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,UIST 3, pages 2:1–2:22, San Francisco, CA, 2023

2023

[20] [20]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, NeurIPS, Vancouver, BC, Canada, 2024

2024

[21] [21]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Advancesin Neural Information Processing Systems, NeurIPS, New Orleans, LA, 2023

2023

[22] [22]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November4-9, 2025, pages 5977–6043, 2025

2025

[23] [23]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface.CoRR, abs/2303.17580, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi Gu, Xunliang Cai, Xiang Wang, and An Zhang. Skill1: Unified evolution of skill-augmented agents via reinforcement learning.arXiv preprint arXiv:2605.06130, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Youtu-agent: Scaling agent productivity with automated generation and hybrid policy optimization

Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, et al. Youtu-agent: Scaling agent productivity with automated generation and hybrid policy optimization. arXiv preprint arXiv:2512.24615, 2025

work page arXiv 2025

[26] [26]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InAdvancesin Neural Information Processing Systems, NeurIPS, New Orleans, LA, 2023

2023

[27] [27]

Voyager: An open-ended embodied agent with large language models.Trans

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2024, 2024

2024

[28] [28]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software 18 developers as generalist agents. In...

2025

[29] [29]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework. CoRR, abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, ICLR, Vienna, Austria, 2024. OpenReview.net

2024

[31] [31]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. InAdvancesin Neural Information Processing Systems, NeurIPS, Vancouver, BC, Canada, 2024

2024

[33] [33]

SkillMaster: Toward Autonomous Skill Mastery in LLM Agents

Min Yang, Jinghua Piao, Xu Xia, Xiaochong Lan, Jiaju Chen, Yongshun Gong, and Yong Li. Skillmaster: Toward autonomous skill mastery in llm agents.arXiv preprint arXiv:2605.08693, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026

[35] [35]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR, Kigali, Rwanda, 2023

2023

[36] [36]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 13643–13658, 2024

2024

[37] [37]

Expel: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: LLM agents are experiential learners. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI, pages 19632–19642, Vancouver, Canada, 2024

2024

[38] [38]

Lifelongagentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942, 2025

Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, and Qianli Ma. Lifelongagentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942, 2025

work page arXiv 2025

[39] [39]

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo FR Ribeiro, Markus Dreyer, Sean Ammirati, and Chenyan Xiong. Skilllearnbench: Benchmarking continual learning methods for agent skill generation on real-world tasks.arXiv preprint arXiv:2604.20087, 2026. 19 A Selected Task List Table 7 lists all 51 selected SkillsBench tasks us...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

doc-only

Numbered list of invariants the implementation must preserve. ## Recommended tools and libraries - Concrete library names, CLI commands, or sandbox tools. ## Workflow Step-by-step procedure the agent should follow at runtime. Catalog routing.The frontmatter description field is the only piece of the skill that is surfaced eagerly: at the start of every ta...

2026