SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

Bo Tang; Feiyu Xiong; Haoyan Yang; Hongyi Liu; Tao Jiang; Zhiyu Li

arxiv: 2605.18401 · v1 · pith:GGF2QQNZnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

Hongyi Liu , Haoyan Yang , Tao Jiang , Bo Tang , Feiyu Xiong , Zhiyu Li This is my paper

Pith reviewed 2026-05-20 11:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords agent skillsskill governanceLLM agentstrajectory decompositionfrozen agentsskill evolutionexternal librarieslong-horizon tasks

0 comments

The pith

Governed external skill libraries improve frozen LLM agents on long-horizon tasks without model updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkillsVote as a framework that manages the full lifecycle of agent skills, from profiling large open-source collections to recommending relevant ones before execution and selectively evolving the library afterward. It turns noisy agent trajectories into reusable, verifiable skills by synthesizing tasks, searching a structured library, decomposing executions, and attributing results to specific skills rather than other factors. Only successful and reusable skills are admitted to future use, which the authors show yields measurable gains on terminal and software engineering benchmarks. A sympathetic reader would care because this offers a way to accumulate and govern experience outside the model weights themselves.

Core claim

SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution it performs agentic library search to expose instructional context. After execution it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. This produces performance lifts on Terminal-Bench 2.0 and SWE-Bench Pro for frozen agents.

What carries the argument

SkillsVote, a lifecycle-governance framework that couples executable scripts with procedural guidance and enforces evidence-gated updates through post-execution trajectory decomposition and outcome attribution.

If this is right

Offline skill evolution raises GPT-5.2 performance on Terminal-Bench 2.0 by up to 7.9 percentage points.
Online skill evolution raises performance on SWE-Bench Pro by up to 2.6 percentage points.
Frozen agents can accumulate capability through external library control instead of weight updates.
Systems can limit exposure to redundant or low-quality skills to avoid polluting future context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar governance could be applied to non-coding agent domains such as web agents or scientific experiment loops if trajectory attribution can be made reliable.
Over repeated cycles the approach might produce compact, high-value skill repositories that reduce the need for ever-larger context windows.
If attribution proves stable, the same evidence-gated mechanism could govern shared skill libraries across multiple independent agents or organizations.

Load-bearing premise

Post-execution trajectory decomposition can reliably credit outcomes to particular skills rather than to agent exploration, environment effects, or other unmodeled factors.

What would settle it

Run the same agents and tasks but replace the attribution step with random or environment-only credit assignment; if benchmark gains disappear or reverse, the central claim fails.

read the original abstract

Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic library search over structured skill library to expose instructional skill context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. In our evaluation, offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp, while online evolution improves SWE-Bench Pro by up to 2.6 pp. Overall, governed external skill libraries can improve frozen agents without model updates when systems control exposure, credit, and preservation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillsVote sketches a governance loop for agent skills with reported benchmark lifts, but the attribution step from trajectory decomposition needs much more evidence to support the claims.

read the letter

The core idea here is a full lifecycle for handling skills in long-horizon agents: profile a large corpus, search the library before runs, decompose trajectories after, and only admit skills that pass evidence checks. That combination of profiling, agentic retrieval, and gated updates is the piece that registers as new compared to earlier skill-library work. The paper does a clean job framing the pollution problem with raw trajectories and showing why frozen agents could still improve if exposure and credit are controlled externally. The reported lifts on Terminal-Bench 2.0 and SWE-Bench Pro are the kind of practical signal that matters for people shipping coding or automation agents. Credit for spelling out the schema that pairs executable scripts with procedural guidance. The soft spot is exactly the one the stress-test note flags. Post-execution decomposition has to separate skill contributions from exploration noise and environment effects, yet the abstract gives no mechanics, ablations, or controls for that step. If attribution is loose, the evidence gate either lets junk in or discards useful skills, which would explain away the 7.9 and 2.6 point numbers. Without those details the central claim stays hard to evaluate. This is aimed at researchers and engineers working on agent memory and reuse rather than core model training. Someone already maintaining a skill library or external memory system would get concrete pipeline ideas to try or critique. The work shows clear thinking about the governance problem even if the current evidence is light, so it deserves a serious referee who can press on the evaluation design and attribution reliability. I would send it out for review but flag the need for stronger experimental grounding.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SkillsVote, a lifecycle-governance framework for Agent Skills in long-horizon LLM agents. Skills are treated as executable scripts coupled with procedural guidance. The framework profiles a million-scale open-source corpus for environment requirements, quality, and verifiability; synthesizes tasks for verifiable skills; performs agentic library search to expose instructional context before execution; and, after execution, decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, then admits only successful reusable discoveries via evidence-gated updates. The authors report that offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp and online evolution improves SWE-Bench Pro by up to 2.6 pp, concluding that governed external skill libraries can improve frozen agents without model updates when exposure, credit, and preservation are controlled.

Significance. If the attribution mechanism can be shown to reliably isolate skill-specific contributions and the reported gains prove reproducible under controlled conditions, the work would offer a concrete, non-parametric route to agent improvement that avoids retraining. It directly addresses redundancy and pollution risks in open skill ecosystems and supplies an operational schema (collection-recommendation-evolution) that could be adopted by agent platforms.

major comments (2)

[Abstract] Abstract: the reported 7.9 pp and 2.6 pp gains are stated without any description of baselines, number of runs, statistical tests, or controls for confounding factors. Because the central claim is that governance produces these improvements on frozen agents, the absence of this information prevents evaluation of whether the gains are attributable to SkillsVote rather than to skill selection heuristics or evaluation artifacts.
[Abstract] Abstract: the post-execution step that 'decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals' is described only at the level of intent. No algorithm, decision rules, or validation procedure is supplied. This attribution step is load-bearing: noisy or biased attribution would either pollute the library with non-reusable artifacts or discard useful skills, directly undermining the claimed benchmark gains.

minor comments (1)

[Abstract] The abstract refers to 'GPT-5.2' without clarifying whether this is a real model variant or a placeholder; this should be disambiguated in the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of SkillsVote's potential. We address each major comment below with clarifications from the manuscript and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 7.9 pp and 2.6 pp gains are stated without any description of baselines, number of runs, statistical tests, or controls for confounding factors. Because the central claim is that governance produces these improvements on frozen agents, the absence of this information prevents evaluation of whether the gains are attributable to SkillsVote rather than to skill selection heuristics or evaluation artifacts.

Authors: We agree that the abstract would benefit from additional context to support evaluation of the central claim. The manuscript provides these details in Section 4 (Experiments) and Appendix B, including baselines such as vanilla GPT-5.2 and ungoverned skill libraries, results aggregated over 5 independent runs with means and standard deviations, and statistical tests (paired t-tests with p < 0.05). We will revise the abstract to briefly note the controlled evaluation on frozen models and the statistical reliability of the reported gains. revision: yes
Referee: [Abstract] Abstract: the post-execution step that 'decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals' is described only at the level of intent. No algorithm, decision rules, or validation procedure is supplied. This attribution step is load-bearing: noisy or biased attribution would either pollute the library with non-reusable artifacts or discard useful skills, directly undermining the claimed benchmark gains.

Authors: We acknowledge the referee's point on the importance of transparency for the attribution mechanism. While the abstract summarizes at a high level, the full algorithm—including trajectory decomposition rules, the attribution scoring function (weighted combination of outcome, exploration, environment, and result signals), decision thresholds for reusability, and validation via inter-annotator agreement (kappa = 0.82 on sampled trajectories)—is detailed in Section 3.4 with pseudocode in Algorithm 2. We will revise the abstract to reference this section explicitly and add a concise description of the core decision rules. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes SkillsVote as a lifecycle governance framework involving corpus profiling, task synthesis, agentic library search before execution, and post-execution trajectory decomposition that attributes outcomes to skill use, agent exploration, environment, and result signals before admitting successful reusable discoveries. Reported gains are measured directly on external benchmarks (Terminal-Bench 2.0 and SWE-Bench Pro) for a frozen agent. No equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text that would reduce the claimed improvements to the inputs by construction. The central claim therefore rests on an independently evaluated governance process rather than tautological re-labeling or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the framework rests on unstated assumptions about skill decomposability and outcome attribution that are not evidenced here.

axioms (1)

domain assumption Trajectories can be decomposed into skill-linked subtasks whose outcomes can be attributed to skill use versus exploration or environment.
Invoked in the post-execution step described in the abstract.

pith-pipeline@v0.9.0 · 5757 in / 1157 out tokens · 52875 ms · 2026-05-20T11:36:32.130335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

governed external skill libraries can improve frozen agents without model updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · 32 internal anchors

[1]

Agent Skills, 2026

Agent Skills. Agent Skills, 2026. URLhttps://agentskills.io/. Accessed: 2026-05-12

work page 2026
[2]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Extend Claude with Skills, 2026

Anthropic. Extend Claude with Skills, 2026. URL https://code.claude.com/docs/en/skills. Accessed: 2026-05-12

work page 2026
[4]

Agentrx: Diagnosing ai agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. Agentrx: Diagnosing ai agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

work page arXiv 2026
[5]

Training-free group relative policy optimization, October 2025

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, et al. Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191, 2025

work page arXiv 2025
[6]

Flex: Continuous agent evolution via forward learning from experience.arXiv preprint arXiv:2511.06449, 2025

Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, and Hao Zhou. Flex: Continuous agent evolution via forward learning from experience.arXiv preprint arXiv:2511.06449, 2025

work page arXiv 2025
[7]

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution.arXiv preprint arXiv:2512.10696, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

Le Chen, Erhu Feng, Yubin Xia, and Haibo Chen. Skvm: Revisiting language vm for skills across heterogenous llms and harnesses.arXiv preprint arXiv:2604.03088, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Skillcraft: Can LLM agents learn to use tools skillfully?

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026

work page arXiv 2026
[10]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Agentprocessbench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, et al. Agentprocessbench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

work page arXiv 2026
[12]

Trajectory-informed memory generation for self-improving agent systems.arXiv preprint arXiv:2603.10600, 2026

Gaodan Fang, Vatche Isahagian, KR Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum, and Gegi Thomas. Trajectory-informed memory generation for self-improving agent systems.arXiv preprint arXiv:2603.10600, 2026

work page arXiv 2026
[13]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Jingzhi Gong, Ruizhen Gu, Zhiwei Fei, Yazhuo Cao, Lukas Twist, Alina Geiger, Shuo Han, Dominik Sobania, Federica Sarro, and Jie M Zhang. Skillmoo: Multi-objective optimization of agent skills for software engineering. arXiv preprint arXiv:2604.09297, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Bash Is All You Need

Ankur Goyal and Andrew Qu. Testing if “Bash Is All You Need”, January 2026. URLhttps://vercel.com/ blog/testing-if-bash-is-all-you-need. Accessed: 2026-05-12

work page 2026
[16]

Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026

Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026. URLhttps://github.com/harbor-framework/harbor

work page 2026
[17]

Mastering Hermes Skills, April 2026

Hermes. Mastering Hermes Skills, April 2026. URL https://hermes-agent.ai/blog/ hermes-agent-skills-guide. Accessed: 2026-05-12

work page 2026
[18]

Cascade: Cumulative agentic skill creation through autonomous development and evolution,

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cascade: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

work page arXiv 2025
[19]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. 11

work page 2024
[21]

Benchmarking AI Agent Memory: Is a Filesystem All You Need?, August 2025

Letta. Benchmarking AI Agent Memory: Is a Filesystem All You Need?, August 2025. URLhttps://www.letta. com/blog/benchmarking-ai-agent-memory. Accessed: 2026-05-12

work page 2025
[22]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale,

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale.arXiv preprint arXiv:2603.02176, 2026

work page arXiv 2026
[23]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, and Yu Zhang. Beyond semantic similarity: Rethinking retrieval for agentic search via direct corpus interaction. arXiv preprint arXiv:2605.05242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Available: https://arxiv.org/abs/2603.04448

Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, et al. Skillnet: Create, evaluate, and connect ai skills.arXiv preprintarXiv:2603.04448, 2026

work page arXiv 2026
[27]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Position: Agentic evolution is the path to evolving llms.arXiv preprint arXiv:2602.00359, 2026

Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, et al. Position: Agentic evolution is the path to evolving llms.arXiv preprint arXiv:2602.00359, 2026

work page arXiv 2026
[29]

Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004, 2026

George Ling, Shanshan Zhong, and Richard Huang. Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004, 2026

work page arXiv 2026
[30]

Unifying dynamic tool creation and cross-task experience sharing through cognitive memory architecture.arXiv preprint arXiv:2512.11303, 2025

Jiarun Liu, Shiyue Xu, Yang Li, Shangkun Liu, Yongli Yu, and Peng Cao. Unifying dynamic tool creation and cross-task experience sharing through cognitive memory architecture.arXiv preprint arXiv:2512.11303, 2025

work page arXiv 2025
[31]

SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support

Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, and Honglin Qiao. Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking llm skill usage in realistic settings.arXiv preprint arXiv:2604.04323, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Beyond static tools: Test-time tool evolution for scientific reasoning.arXiv preprint arXiv:2601.07641, 2026

Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, et al. Beyond static tools: Test-time tool evolution for scientific reasoning.arXiv preprint arXiv:2601.07641, 2026

work page arXiv 2026
[34]

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Skill-pro: Learning reusable skills from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Introducing GPT-5.2, December 2025

OpenAI. Introducing GPT-5.2, December 2025. URL https://openai.com/index/introducing-gpt-5-2/. Accessed: 2026-05-12

work page 2025
[40]

SkillsinChatGPT,2026

OpenAI. SkillsinChatGPT,2026. URL https://help.openai.com/en/articles/20001066-skills-in-chatgpt. Accessed: 2026-05-12

work page arXiv 2026
[41]

Agent Skills – Codex, 2026

OpenAI. Agent Skills – Codex, 2026. URLhttps://developers.openai.com/codex/skills. Accessed: 2026-05- 12

work page 2026
[42]

Introducing GPT-5.4 mini and nano, March 2026

OpenAI. Introducing GPT-5.4 mini and nano, March 2026. URL https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/. Accessed: 2026-05-12

work page 2026
[43]

Skills – OpenClaw, 2026

OpenClaw. Skills – OpenClaw, 2026. URLhttps://docs.openclaw.ai/tools/skills. Accessed: 2026-05-12

work page 2026
[44]

ClawHub: Skill Directory for OpenClaw, 2026

OpenClaw. ClawHub: Skill Directory for OpenClaw, 2026. URLhttps://clawhub.ai/. Accessed: 2026-05-12

work page 2026
[45]

SkillOS: Learning Skill Curation for Self-Evolving Agents

Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, et al. Skillos: Learning skill curation for self-evolving agents. arXiv preprint arXiv:2605.06614, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Reasoningbank: Scaling agent self-evolving with reasoning memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. InTheFourteenthInternational Conference on Learning Representatio...

work page 2026
[47]

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

Yipeng Ouyang, Yi Xiao, Yuhao Gu, and Xianwei Zhang. Skcc: Portable and secure skill compilation for cross-framework llm agents.arXiv preprint arXiv:2605.03353, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval, October 2025

Ben Pan, Carlo Baronio, Albert Tam, Pietro Marsella, Mokshit Jain, Daniel Chiu, Swyx, and Silas Alberti. Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval, October 2025. URL https://cognition.ai/blog/swe-grep. Accessed: 2026-05-12

work page 2025
[49]

We Removed 80% of Our Agent’s Tools, December 2025

Andrew Qu. We Removed 80% of Our Agent’s Tools, December 2025. URL https://vercel.com/blog/ we-removed-80-percent-of-our-agents-tools. Accessed: 2026-05-12

work page 2025
[50]

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi Gu, Xunliang Cai, Xiang Wang, and An Zhang. Skill1: Unified evolution of skill-augmented agents via reinforcement learning.arXiv preprint arXiv:2605.06130, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[52]

From Context to Skills: Can Language Models Learn from Context Skillfully?

Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, et al. From context to skills: Can language models learn from context skillfully? arXiv preprint arXiv:2604.27660, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Agent Skills Marketplace, 2026

SkillsMP. Agent Skills Marketplace, 2026. URLhttps://skillsmp.com/. Accessed: 2026-05-12

work page 2026
[54]

Codescout: An effective recipe for reinforcement learning of code search agents.arXiv preprint arXiv:2603.17829, 2026

Lintang Sutawika, Aditya Bharat Soni, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya, Yuchen Li, Xuhui Zhou, Yilin Zhang, Leander Melroy Maben, Graham Neubig, et al. Codescout: An effective recipe for reinforcement learning of code search agents.arXiv preprint arXiv:2603.17829, 2026

work page arXiv 2026
[55]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), p...

work page 2024
[56]

The Agent Skills Directory, 2026

Vercel. The Agent Skills Directory, 2026. URLhttps://skills.sh/. Accessed: 2026-05-12

work page 2026
[57]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=ehfRiF0R3a. 13

work page 2024
[59]

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience- driven test-time evolution.arXiv preprint arXiv:2604.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

work page arXiv 2026
[62]

Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al. Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem. arXiv preprint arXiv:2512.24873, 2025

work page arXiv 2025
[63]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InInternational Conference on Machine Learning, pages 63897–63911. PMLR, 2025

work page 2025
[65]

SkillRL: Evolving agents via recursive skill-augmented reinforcement learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning. InICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URL https://openreview.net/forum?id=FYc2IygegR

work page 2026
[66]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026

work page arXiv 2026
[67]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[68]

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

Binyan Xu, Dong Fang, Haitao Li, and Kehuan Zhang. From multi-agent to single-agent: When is skill distillation beneficial? arXiv preprint arXiv:2604.01608, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

Autoskill: Experience-driven lifelong learning via skill self-evolution,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026
[70]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Coevoskills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[71]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe FourteenthInternational Conference on Learning Representations, 2026. URLhttps://op...

work page 2026
[72]

Autogenesis: A Self-Evolving Agent Protocol

Wentao Zhang. Autogenesis: A self-evolving agent protocol.arXiv preprint arXiv:2604.15034, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[73]

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Experience compression spectrum: Unifying memory, skills, and rules in llm agents.arXiv preprint arXiv:2604.15877, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[74]

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, et al. Skillflow: Benchmarking lifelong skill discovery and evolution for autonomous agents.arXiv preprint arXiv:2604.17308, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[75]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

work page 2024
[76]

Synapse: Trajectory-as-exemplar prompting with memory for computer control

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InInternational Conference on Learning Representations, volume 2024, pages 19036–19066, 2024. 14

work page 2024
[77]

Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuan Zhu, Baohua Dong, and Hangcheng Zhu. Skillrouter: Skill routing for llm agents at scale.arXiv preprint arXiv:2603.22455, 2026

work page arXiv 2026
[78]

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering. arXiv preprint arXiv:2604.08224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[79]

Memento: Fine-tuning llm agents without fine-tuning llms

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms. arXiv preprint arXiv:2508.16153, 2025

work page arXiv 2025
[80]

Memento-skills: Let agents design agents,

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026

Showing first 80 references.

[1] [1]

Agent Skills, 2026

Agent Skills. Agent Skills, 2026. URLhttps://agentskills.io/. Accessed: 2026-05-12

work page 2026

[2] [2]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Extend Claude with Skills, 2026

Anthropic. Extend Claude with Skills, 2026. URL https://code.claude.com/docs/en/skills. Accessed: 2026-05-12

work page 2026

[4] [4]

Agentrx: Diagnosing ai agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. Agentrx: Diagnosing ai agent failures from execution trajectories.arXiv preprint arXiv:2602.02475, 2026

work page arXiv 2026

[5] [5]

Training-free group relative policy optimization, October 2025

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, et al. Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191, 2025

work page arXiv 2025

[6] [6]

Flex: Continuous agent evolution via forward learning from experience.arXiv preprint arXiv:2511.06449, 2025

Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jinsong Su, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, and Hao Zhou. Flex: Continuous agent evolution via forward learning from experience.arXiv preprint arXiv:2511.06449, 2025

work page arXiv 2025

[7] [7]

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution.arXiv preprint arXiv:2512.10696, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

Le Chen, Erhu Feng, Yubin Xia, and Haibo Chen. Skvm: Revisiting language vm for skills across heterogenous llms and harnesses.arXiv preprint arXiv:2604.03088, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Skillcraft: Can LLM agents learn to use tools skillfully?

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026

work page arXiv 2026

[10] [10]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Agentprocessbench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, et al. Agentprocessbench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

work page arXiv 2026

[12] [12]

Trajectory-informed memory generation for self-improving agent systems.arXiv preprint arXiv:2603.10600, 2026

Gaodan Fang, Vatche Isahagian, KR Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum, and Gegi Thomas. Trajectory-informed memory generation for self-improving agent systems.arXiv preprint arXiv:2603.10600, 2026

work page arXiv 2026

[13] [13]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Jingzhi Gong, Ruizhen Gu, Zhiwei Fei, Yazhuo Cao, Lukas Twist, Alina Geiger, Shuo Han, Dominik Sobania, Federica Sarro, and Jie M Zhang. Skillmoo: Multi-objective optimization of agent skills for software engineering. arXiv preprint arXiv:2604.09297, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Bash Is All You Need

Ankur Goyal and Andrew Qu. Testing if “Bash Is All You Need”, January 2026. URLhttps://vercel.com/ blog/testing-if-bash-is-all-you-need. Accessed: 2026-05-12

work page 2026

[16] [16]

Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026

Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026. URLhttps://github.com/harbor-framework/harbor

work page 2026

[17] [17]

Mastering Hermes Skills, April 2026

Hermes. Mastering Hermes Skills, April 2026. URL https://hermes-agent.ai/blog/ hermes-agent-skills-guide. Accessed: 2026-05-12

work page 2026

[18] [18]

Cascade: Cumulative agentic skill creation through autonomous development and evolution,

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cascade: Cumulative agentic skill creation through autonomous development and evolution.arXiv preprint arXiv:2512.23880, 2025

work page arXiv 2025

[19] [19]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024. 11

work page 2024

[21] [21]

Benchmarking AI Agent Memory: Is a Filesystem All You Need?, August 2025

Letta. Benchmarking AI Agent Memory: Is a Filesystem All You Need?, August 2025. URLhttps://www.letta. com/blog/benchmarking-ai-agent-memory. Accessed: 2026-05-12

work page 2025

[22] [22]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale,

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale.arXiv preprint arXiv:2603.02176, 2026

work page arXiv 2026

[23] [23]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, and Yu Zhang. Beyond semantic similarity: Rethinking retrieval for agentic search via direct corpus interaction. arXiv preprint arXiv:2605.05242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, et al. Genericagent: A token-efficient self-evolving llm agent via contextual information density maximization (v1. 0).arXiv preprint arXiv:2604.17091, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Available: https://arxiv.org/abs/2603.04448

Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, et al. Skillnet: Create, evaluate, and connect ai skills.arXiv preprintarXiv:2603.04448, 2026

work page arXiv 2026

[27] [27]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Position: Agentic evolution is the path to evolving llms.arXiv preprint arXiv:2602.00359, 2026

Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, et al. Position: Agentic evolution is the path to evolving llms.arXiv preprint arXiv:2602.00359, 2026

work page arXiv 2026

[29] [29]

Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004, 2026

George Ling, Shanshan Zhong, and Richard Huang. Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004, 2026

work page arXiv 2026

[30] [30]

Unifying dynamic tool creation and cross-task experience sharing through cognitive memory architecture.arXiv preprint arXiv:2512.11303, 2025

Jiarun Liu, Shiyue Xu, Yang Li, Shangkun Liu, Yongli Yu, and Peng Cao. Unifying dynamic tool creation and cross-task experience sharing through cognitive memory architecture.arXiv preprint arXiv:2512.11303, 2025

work page arXiv 2025

[31] [31]

SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support

Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, and Honglin Qiao. Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking llm skill usage in realistic settings.arXiv preprint arXiv:2604.04323, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Beyond static tools: Test-time tool evolution for scientific reasoning.arXiv preprint arXiv:2601.07641, 2026

Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, et al. Beyond static tools: Test-time tool evolution for scientific reasoning.arXiv preprint arXiv:2601.07641, 2026

work page arXiv 2026

[34] [34]

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Skill-pro: Learning reusable skills from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Introducing GPT-5.2, December 2025

OpenAI. Introducing GPT-5.2, December 2025. URL https://openai.com/index/introducing-gpt-5-2/. Accessed: 2026-05-12

work page 2025

[40] [40]

SkillsinChatGPT,2026

OpenAI. SkillsinChatGPT,2026. URL https://help.openai.com/en/articles/20001066-skills-in-chatgpt. Accessed: 2026-05-12

work page arXiv 2026

[41] [41]

Agent Skills – Codex, 2026

OpenAI. Agent Skills – Codex, 2026. URLhttps://developers.openai.com/codex/skills. Accessed: 2026-05- 12

work page 2026

[42] [42]

Introducing GPT-5.4 mini and nano, March 2026

OpenAI. Introducing GPT-5.4 mini and nano, March 2026. URL https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/. Accessed: 2026-05-12

work page 2026

[43] [43]

Skills – OpenClaw, 2026

OpenClaw. Skills – OpenClaw, 2026. URLhttps://docs.openclaw.ai/tools/skills. Accessed: 2026-05-12

work page 2026

[44] [44]

ClawHub: Skill Directory for OpenClaw, 2026

OpenClaw. ClawHub: Skill Directory for OpenClaw, 2026. URLhttps://clawhub.ai/. Accessed: 2026-05-12

work page 2026

[45] [45]

SkillOS: Learning Skill Curation for Self-Evolving Agents

Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra, Rui Meng, Chun-Liang Li, Yizhu Jiao, Kaiwen Zha, et al. Skillos: Learning skill curation for self-evolving agents. arXiv preprint arXiv:2605.06614, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Reasoningbank: Scaling agent self-evolving with reasoning memory

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. InTheFourteenthInternational Conference on Learning Representatio...

work page 2026

[47] [47]

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

Yipeng Ouyang, Yi Xiao, Yuhao Gu, and Xianwei Zhang. Skcc: Portable and secure skill compilation for cross-framework llm agents.arXiv preprint arXiv:2605.03353, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval, October 2025

Ben Pan, Carlo Baronio, Albert Tam, Pietro Marsella, Mokshit Jain, Daniel Chiu, Swyx, and Silas Alberti. Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval, October 2025. URL https://cognition.ai/blog/swe-grep. Accessed: 2026-05-12

work page 2025

[49] [49]

We Removed 80% of Our Agent’s Tools, December 2025

Andrew Qu. We Removed 80% of Our Agent’s Tools, December 2025. URL https://vercel.com/blog/ we-removed-80-percent-of-our-agents-tools. Accessed: 2026-05-12

work page 2025

[50] [50]

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi Gu, Xunliang Cai, Xiang Wang, and An Zhang. Skill1: Unified evolution of skill-augmented agents via reinforcement learning.arXiv preprint arXiv:2605.06130, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[52] [52]

From Context to Skills: Can Language Models Learn from Context Skillfully?

Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, et al. From context to skills: Can language models learn from context skillfully? arXiv preprint arXiv:2604.27660, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[53] [53]

Agent Skills Marketplace, 2026

SkillsMP. Agent Skills Marketplace, 2026. URLhttps://skillsmp.com/. Accessed: 2026-05-12

work page 2026

[54] [54]

Codescout: An effective recipe for reinforcement learning of code search agents.arXiv preprint arXiv:2603.17829, 2026

Lintang Sutawika, Aditya Bharat Soni, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya, Yuchen Li, Xuhui Zhou, Yilin Zhang, Leander Melroy Maben, Graham Neubig, et al. Codescout: An effective recipe for reinforcement learning of code search agents.arXiv preprint arXiv:2603.17829, 2026

work page arXiv 2026

[55] [55]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), p...

work page 2024

[56] [56]

The Agent Skills Directory, 2026

Vercel. The Agent Skills Directory, 2026. URLhttps://skills.sh/. Accessed: 2026-05-12

work page 2026

[57] [57]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [58]

Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactionson Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=ehfRiF0R3a. 13

work page 2024

[59] [59]

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

Junjie Wang, Yiming Ren, and Haoyang Zhang. From procedural skills to strategy genes: Towards experience- driven test-time evolution.arXiv preprint arXiv:2604.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026

work page arXiv 2026

[62] [62]

Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, et al. Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem. arXiv preprint arXiv:2512.24873, 2025

work page arXiv 2025

[63] [63]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [64]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InInternational Conference on Machine Learning, pages 63897–63911. PMLR, 2025

work page 2025

[65] [65]

SkillRL: Evolving agents via recursive skill-augmented reinforcement learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning. InICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving, 2026. URL https://openreview.net/forum?id=FYc2IygegR

work page 2026

[66] [66]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild.arXiv preprint arXiv:2603.17187, 2026

work page arXiv 2026

[67] [67]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[68] [68]

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

Binyan Xu, Dong Fang, Haitao Li, and Kehuan Zhang. From multi-agent to single-agent: When is skill distillation beneficial? arXiv preprint arXiv:2604.01608, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [69]

Autoskill: Experience-driven lifelong learning via skill self-evolution,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026

[70] [70]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Coevoskills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[71] [71]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe FourteenthInternational Conference on Learning Representations, 2026. URLhttps://op...

work page 2026

[72] [72]

Autogenesis: A Self-Evolving Agent Protocol

Wentao Zhang. Autogenesis: A self-evolving agent protocol.arXiv preprint arXiv:2604.15034, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[73] [73]

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He. Experience compression spectrum: Unifying memory, skills, and rules in llm agents.arXiv preprint arXiv:2604.15877, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[74] [74]

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, et al. Skillflow: Benchmarking lifelong skill discovery and evolution for autonomous agents.arXiv preprint arXiv:2604.17308, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[75] [75]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

work page 2024

[76] [76]

Synapse: Trajectory-as-exemplar prompting with memory for computer control

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. InInternational Conference on Learning Representations, volume 2024, pages 19036–19066, 2024. 14

work page 2024

[77] [77]

Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuan Zhu, Baohua Dong, and Hangcheng Zhu. Skillrouter: Skill routing for llm agents at scale.arXiv preprint arXiv:2603.22455, 2026

work page arXiv 2026

[78] [78]

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering. arXiv preprint arXiv:2604.08224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[79] [79]

Memento: Fine-tuning llm agents without fine-tuning llms

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, et al. Memento: Fine-tuning llm agents without fine-tuning llms. arXiv preprint arXiv:2508.16153, 2025

work page arXiv 2025

[80] [80]

Memento-skills: Let agents design agents,

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026