SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Bei Liu; Chong Luo; Dongdong Chen; Kai Qiu; Qi Dai; Qihao Yang; Weiquan Huang; Xuemei Gao; Xue Yang; Yan Li

SkillOpt optimizes agent skills by turning scored rollouts into bounded text edits that are kept only when they raise a held-out validation score.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 16:23 UTC pith:IG6F5LT4

load-bearing objection SkillOpt adds a controlled text optimizer with edit buffers and held-out acceptance, but the big performance claims rest on unreported experimental details. the 2 major comments →

arxiv 2605.23904 v2 pith:IG6F5LT4 submitted 2026-05-22 cs.AI cs.CL

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang , Ziyang Gong , Weiquan Huang , Qihao Yang , Ziwei Zhou , Zisu Huang , Yan Li , Xuemei Gao

show 7 more authors

Qi Dai Bei Liu Kai Qiu Yuqing Yang Dongdong Chen Xue Yang Chong Luo

This is my paper

classification cs.AI cs.CL

keywords agent skillstext-space optimizationself-evolving agentsskill optimizationvalidation-driven editingagent performancecontrollable optimizationskill transfer

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agent skills are currently produced by hand-crafting, one-shot generation, or uncontrolled self-revision, none of which reliably improves under feedback in the way weight optimization does. The paper treats the skill as external state of a frozen agent and applies the same reproducibility discipline used for weights. A separate optimizer model converts scored rollouts into a small number of add, delete, or replace operations on one skill document; an edit is retained only if it strictly raises performance on a held-out validation set. Stability is maintained through a textual learning-rate budget, a buffer of rejected edits, and epoch-wise slow updates, with no extra model calls required at deployment. The resulting skills outperform human, one-shot, and prior evolution baselines on every one of the 52 evaluated combinations of model, benchmark, and execution harness.

Core claim

SkillOpt is the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses, SkillOpt is best or tied on all 52 evaluated cells and beats every per-cell competitor.

What carries the argument

The optimizer model that converts scored rollouts into a bounded set of add/delete/replace edits on the skill document, with acceptance conditioned strictly on improvement of the held-out validation score.

Load-bearing premise

The held-out validation score used to accept or reject edits is a reliable, unbiased proxy for true generalization performance that does not itself require optimization or introduce selection effects.

What would settle it

An experiment in which SkillOpt produces no net gain over the initial skill on a fresh benchmark or model, or in which a non-SkillOpt baseline wins on the same validation set used for acceptance, would falsify the performance claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Optimized skill documents retain value when transferred across model scales without further editing.
Skills trained inside one execution harness (Codex) continue to improve performance inside a different harness (Claude Code).
The same optimized skill lifts accuracy on a nearby math benchmark without additional optimization steps.
The method produces zero extra model calls at inference time on the target agent.
SkillOpt exceeds every listed competitor (human, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) inside every per-cell comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of a dedicated optimizer model from the target agent creates a route to skill improvement that is independent of the base model's training.
Validation-driven text editing could be applied to other persistent artifacts such as multi-step plans or memory structures.
Transfer results suggest that skill documents may function as portable, model-agnostic modules rather than being tied to a single execution environment.
The requirement that every accepted edit must improve validation performance offers a concrete criterion for deciding when self-evolution should stop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

SkillOpt adds a controlled text optimizer with edit buffers and held-out acceptance, but the big performance claims rest on unreported experimental details.

read the letter

The core idea is to treat a skill document as external state and run a separate optimizer model that proposes bounded add/delete/replace edits, keeping only those that raise a held-out validation score. Textual learning-rate budget, rejected-edit buffer, and epoch-wise meta updates are meant to stabilize the process without adding inference cost later.

That combination is new relative to the hand-crafted, one-shot, and loose self-revision baselines listed. The transfer tests across model scales and execution harnesses are a concrete strength; they show the resulting artifacts are not locked to one environment.

The soft spot is the lack of any experimental design, validation-set construction details, or statistical reporting in the abstract. The claim that SkillOpt wins or ties on all 52 cells therefore cannot be checked for selection effects on the validation score or for baseline implementation quality. If the validation prompts overlap with test distributions, the reported lifts could partly reflect that rather than skill improvement.

This is for people building agent skills who want a more optimizer-like workflow. The structure is clear enough that a referee could usefully test the reproducibility of the gains and the validation protocol.

I would send it to peer review.

Referee Report

2 major / 0 minor

Summary. The paper introduces SkillOpt as the first systematic controllable text-space optimizer for agent skills. A separate optimizer model converts scored rollouts into bounded add/delete/replace edits on a single skill document; edits are accepted only if they strictly improve a held-out validation score. Additional mechanisms include a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates for stability, with zero added inference cost at deployment. Across six benchmarks, seven target models, and three execution harnesses, SkillOpt is reported best or tied on all 52 (model, benchmark, harness) cells and outperforms human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines, with gains such as +23.5 points on GPT-5.5 in direct chat.

Significance. If the performance claims and generalization results hold after full experimental disclosure, SkillOpt would constitute a meaningful methodological advance by treating skill documents as optimizable external state with optimizer-like controls (bounded edits, validation-gated acceptance, learning-rate analogs). This could enable more reproducible skill evolution for agents and support transfer across models and harnesses without runtime overhead.

major comments (2)

[Abstract] Abstract and method description paragraph: the claim of superiority on all 52 cells with specific point gains (+23.5, +24.8, +19.1) is asserted without any description of experimental design, statistical tests, baseline re-implementations, variance estimates, or error analysis, rendering the central empirical claims impossible to evaluate.
[Method description paragraph] Method description paragraph: acceptance of every edit is conditioned solely on strict improvement of a held-out validation score, yet no information is supplied on validation-set construction, size, sampling independence from rollout prompts, or whether multiple candidate edits are evaluated against the same fixed instances (creating a multiple-testing risk). This directly undermines the claim that resulting skills generalize rather than overfit the acceptance criterion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our empirical reporting and validation procedures. We agree that the abstract and method description require expansion to allow proper evaluation of the claims. Below we respond to each major comment and commit to revisions that add the missing details without altering the core methodology.

read point-by-point responses

Referee: [Abstract] Abstract and method description paragraph: the claim of superiority on all 52 cells with specific point gains (+23.5, +24.8, +19.1) is asserted without any description of experimental design, statistical tests, baseline re-implementations, variance estimates, or error analysis, rendering the central empirical claims impossible to evaluate.

Authors: We acknowledge the abstract's brevity omitted key experimental context. The full manuscript (Section 4) specifies the six benchmarks, seven target models, three harnesses, baseline re-implementations (human, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill), and per-cell comparisons. We will revise the abstract to include a concise experimental summary and add a dedicated paragraph on statistical analysis (including 5-run variance estimates, standard errors, and significance testing via paired t-tests) to the experiments section. This will make the 52-cell results and point gains fully evaluable. revision: yes
Referee: [Method description paragraph] Method description paragraph: acceptance of every edit is conditioned solely on strict improvement of a held-out validation score, yet no information is supplied on validation-set construction, size, sampling independence from rollout prompts, or whether multiple candidate edits are evaluated against the same fixed instances (creating a multiple-testing risk). This directly undermines the claim that resulting skills generalize rather than overfit the acceptance criterion.

Authors: We agree the current method paragraph lacks these specifics. We will expand it to describe: (1) validation sets of 50 fixed, held-out prompts per benchmark sampled independently from rollout prompts with no overlap to test sets; (2) use of the identical validation instances for all candidate edits within an epoch to control multiple-testing risk; and (3) explicit confirmation that acceptance requires strict improvement on this fixed set. These additions will directly address concerns about overfitting versus generalization. revision: yes

Circularity Check

0 steps flagged

No circularity detected in claimed derivation

full rationale

The paper describes an empirical optimization loop in which a separate optimizer proposes bounded edits to a skill document and accepts them only on strict improvement of a held-out validation score; final performance is then measured on separate benchmarks across models and harnesses. No equation, acceptance rule, or result is shown to reduce by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain. The validation score functions as an external selection filter rather than a self-referential target, and the reported gains are presented as direct empirical comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes the validation metric guides genuine improvement without further justification.

axioms (1)

domain assumption Held-out validation score is a reliable proxy for generalization
Edits are accepted or rejected solely on this score; abstract provides no supporting evidence or robustness checks.

pith-pipeline@v0.9.1-grok · 5864 in / 1350 out tokens · 65379 ms · 2026-06-30T16:23:25.051298+00:00 · methodology

0 comments

read the original abstract

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents
cs.RO 2026-06 unverdicted novelty 7.0

VASO is a verification-guided self-evolution framework for LLM robot skill contracts that reaches 97.2% formal-specification compliance on Jackal and quadcopter tasks using under 100 samples.
ACE-Brain-0.5: A Unified Embodied Foundational Model for Physical Agentic AI
cs.RO 2026-07 conditional novelty 6.0

A single 8B backbone unifies spatial perception, decision making, navigation/manipulation, and progress estimation with SSR+ merging, reporting gains on most spatial benchmarks and competitive action/progress results.
SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
cs.AI 2026-07 unverdicted novelty 6.0

SkillCoach introduces self-evolving rubrics derived from rollouts to evaluate and supervise four process dimensions of agentic skill-use separately from outcome success.
SoftSkill: Behavioral Compression for Contextual Adaptation
cs.AI 2026-06 unverdicted novelty 6.0

SoftSkill compresses agent skills into length-32 continuous prefixes via next-token training of soft deltas, yielding 5.2-12.5 point gains over SkillOpt on SearchQA and LiveMath while using far fewer tokens.
A Framework for Evaluating Agentic Skills at Scale
cs.SE 2026-06 unverdicted novelty 6.0

The authors developed an evaluation framework that generates 1000 tasks from 500 real-world agent skills, applies instruction-following and goal-completion rubrics, and benchmarks 19 proprietary and open-source model ...
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
cs.AI 2026-06 unverdicted novelty 6.0

Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.
Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters
cs.AI 2026-06 unverdicted novelty 6.0

SIGA is a coding-agent adapter using retrieval, procedural memory, and validation gates that raises success rate on GEOS from 0.720 to 0.789 while cutting variance 16x and matching expert quality in minutes instead of hours.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
cs.CL 2026-05 unverdicted novelty 6.0

SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
LemonHarness Technical Report
cs.AI 2026-06 unverdicted novelty 5.0

LemonHarness constrains LLM agent state changes to a defined workspace, supplies callable rule knowledge, and adds time awareness, yielding 84.49% and 86.52% accuracy on Terminal-Bench 2.0 with two GPT-5 backbones.
Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution
cs.LG 2026-06 unverdicted novelty 5.0

MAA formalizes alignability and comparability conditions and uses differential signals, EMA accumulation, and semantic identity merging to enable cross-batch operation-level evidence accumulation, outperforming batch-...
Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters
cs.AI 2026-06 unverdicted novelty 5.0

SIGA adapters let off-the-shelf coding agents produce complete, valid configurations for multiphysics simulators like GEOS in minutes rather than hours, with self-evolution further improving performance on held-out cases.
Governed Evolution of Agent Runtimes through Executable Operational Cognition
cs.SE 2026-05 unverdicted novelty 4.0

Introduces HarnessMutation as a governed mechanism for lifecycle-aware runtime adaptation in agent systems, modeling evolution as a bounded observable process over persistent operational memory.
Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models
cs.AI 2026-06 unverdicted novelty 3.0

ODYSSEY is a sheaf-theoretic framework for building verifiable foundation models as compositions of foundries via left and right Kan extensions.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 12 Pith papers · 23 internal anchors

[1]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023
[3]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[5]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[6]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support

Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, and Honglin Qiao. Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618, 2026. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Omni-math: A universal olympiad level mathematic benchmark for large language models,

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models,

work page
[15]

URLhttps://arxiv.org/abs/2410.07985

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Abstral: Automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791, 2026

Weijia Song, Jiashu Yue, and Zhe Pang. Abstral: Automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791, 2026

work page arXiv 2026
[17]

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, and Bryan Hooi. Evotest: Evolutionary test-time learning for self-improving agentic systems.arXiv preprint arXiv:2510.13220, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution. arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026
[19]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

AutoRefine: From trajectories to reusable expertise for continual LLM agent refinement.arXiv preprint arXiv:2601.22758, 2026

Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Shuo Tang. Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026. 18

work page arXiv 2026
[26]

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Re- flexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[29]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534– 46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534– 46594, 2023

work page 2023
[30]

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine.arXiv preprint arXiv:1704.05179, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

work page 2024
[32]

Officeqa pro: An enterprise benchmark for end-to-end grounded reasoning, 2026

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, et al. Officeqa pro: An enterprise benchmark for end-to-end grounded reasoning.arXiv preprint arXiv:2603.08655, 2026

work page arXiv 2026
[33]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021
[34]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean

Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, and Nima Mesgarani. Livemathematicianbench: A live benchmark for mathematician-level reasoning with proof sketches, 2026. URLhttps://arxiv.org/abs/2604.01754

work page arXiv 2026
[35]

{ALFW}orld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. {ALFW}orld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn

work page 2021
[36]

Introducing GPT-5.4, March 2026

OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

work page 2026
[37]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

work page 2026
[38]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b

work page 2026
[39]

Codex: A cloud-based software engineering agent, 2025

OpenAI. Codex: A cloud-based software engineering agent, 2025. URLhttps://openai. com/index/introducing-codex/. Accessed: 2026-05-06. 19

work page 2025
[40]

Claude code: An ai coding agent system, 2025

Anthropic. Claude code: An ai coding agent system, 2025. URLhttps://www.anthropic. com/claude-code. Accessed: 2026-05-06

work page 2025
[41]

TextGrad: Automatic "Differentiation" via Text

MertYuksekgonul, FedericoBianchi, JosephBoen, ShengLiu, ZhiHuang, CarlosGuestrin, and James Zou. Textgrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024. A Additional Method Details and Optimizer Prompts This appendix give...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Read ALL trajectories in the minibatch

work page
[44]

Identify the most prevalent, systematic failure patterns across them

work page
[45]

For each pattern, classify its failure type

work page
[46]

Propose skill edits that address the COMMON patterns, not individual edge cases

work page
[47]

Edits must be generalizable; do not hardcode task-specific values

work page
[48]

batch_size

Only patch gaps in the skill; do not duplicate existing content. You will be told the maximum number of edits (the budget L). Produce AT MOST L edits, 21 Algorithm 1SkillOptskill optimization Require: Frozen training modelM, optimizer modelO, harnessh, splitsDtrain,D sel,D test, initial skill s0, epochs E, edit-budget scheduleLt, rollout batch sizeB, accu...

work page
[49]

Deduplicate: keep the best-worded version of similar edits

work page
[50]

Resolve conflicts: if patches contradict on the same point, choose the one with stronger justification or synthesize both

work page
[51]

Preserve unique insights: include all non-redundant corrective edits

work page
[52]

Edits from only one patch may be discarded if task-specific

Prevalent-pattern bias: edits appearing consistently across multiple patches address systematic failures; preserve them with HIGH priority. Edits from only one patch may be discarded if task-specific

work page
[53]

Independence: no two edits in the merged patch may target the same text region

work page
[54]

Support count: for each merged edit, estimate how many source patches support it

work page
[55]

reasoning

PROTECTED SECTION: The skill may contain a section between  and  markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of key consolidation decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "t...

work page
[56]

Deduplicate: keep only the most generalizable version of similar patterns

work page
[57]

Only include edits for patterns NOT already in the skill

Be conservative: success-driven patches reinforce existing behavior. Only include edits for patterns NOT already in the skill

work page
[58]

Prevalent-pattern bias: patterns seen across many successful trajectories are most worth encoding

work page
[59]

Support count: estimate how many source patches support each merged edit

work page
[60]

reasoning

PROTECTED SECTION: The skill may contain a section between  and  markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "<if needed>", "content...

work page
[61]

Failure-driven patches (corrective, high priority)

work page
[62]

Success-driven patches (reinforcement, lower priority) Merge guidelines:

work page
[63]

Failure-driven edits should be preserved unless they directly conflict with a well-supported success pattern

FAILURE PATCHES TAKE PRIORITY: the primary goal of skill reflection is to fix failures. Failure-driven edits should be preserved unless they directly conflict with a well-supported success pattern

work page
[64]

Deduplicate: if a failure edit and success edit cover the same point, keep the failure version

work page
[65]

Preserve success insights: include success edits that cover patterns NOT addressed by failure edits

work page
[66]

Higher-level merges represent broader consensus: edits that survived previous merge rounds should be given priority

work page
[67]

Carry forward support_count and source_type for each edit

work page
[68]

reasoning

PROTECTED SECTION: The skill may contain a section between  and  markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of priority decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "...

work page
[69]

A rule that fixes 50% of failures beats one that fixes a single edge case

Systematic impact: edits that address widespread, recurring failure patterns across many tasks should rank highest. A rule that fixes 50% of failures beats one that fixes a single edge case

work page
[70]

Complementarity: edits that fill gaps in the current skill, not duplicate existing content, rank higher

work page
[71]

Generality: edits phrased as general principles rank higher than those tied to specific question types or entities

work page
[72]

reasoning

Actionability: edits with clear, concrete guidance rank higher than vague advice. You will be told how many edits to select (the budget). Respond ONLY with a valid JSON object: { "reasoning": "<brief justification for your ranking decisions>", "selected_indices": [<0-based indices of the top edits, in priority order>] } 25 C.2.7 Slow update:slow_update.md...

work page
[73]

Previous epoch’s skill and current epoch’s skill, to see what changed

work page
[74]

Longitudinal comparison: the same 20 training tasks rolled out under both skills, categorized into regressions, persistent failures, improvements, and stable successes

work page
[75]

## Your Process

Previous slow update guidance, if any: the guidance written at the end of the last epoch. ## Your Process

work page
[76]

Reflect on the previous guidance, if provided: - Which parts of the previous guidance were effective? - Which parts failed or backfired? - Were there blind spots the previous guidance missed entirely?

work page
[77]

When you encounter X, always do Y

Write updated guidance that: - Retains and strengthens parts of the previous guidance that proved effective. - Revises or removes parts that were ineffective or counterproductive. - Adds new instructions to address newly observed regressions and persistent failures. ## Output Requirements Write a strategic guidance block that will OVERWRITE the previous g...

work page
[78]

The previous epoch’s last-step skill

work page
[79]

The current epoch’s last-step skill. 26

work page
[80]

A longitudinal comparison on the SAME sampled tasks under those two skills

work page

Showing first 80 references.

[1] [1]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023

[3] [3]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024

[5] [5]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[6] [6]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support

Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, and Honglin Qiao. Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618, 2026. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Omni-math: A universal olympiad level mathematic benchmark for large language models,

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models,

work page

[15] [15]

URLhttps://arxiv.org/abs/2410.07985

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Abstral: Automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791, 2026

Weijia Song, Jiashu Yue, and Zhe Pang. Abstral: Automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791, 2026

work page arXiv 2026

[17] [17]

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, and Bryan Hooi. Evotest: Evolutionary test-time learning for self-improving agentic systems.arXiv preprint arXiv:2510.13220, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145,

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution. arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026

[19] [19]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

AutoRefine: From trajectories to reusable expertise for continual LLM agent refinement.arXiv preprint arXiv:2601.22758, 2026

Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Shuo Tang. Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026. 18

work page arXiv 2026

[26] [26]

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Re- flexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[29] [29]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534– 46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534– 46594, 2023

work page 2023

[30] [30]

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine.arXiv preprint arXiv:1704.05179, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

work page 2024

[32] [32]

Officeqa pro: An enterprise benchmark for end-to-end grounded reasoning, 2026

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, et al. Officeqa pro: An enterprise benchmark for end-to-end grounded reasoning.arXiv preprint arXiv:2603.08655, 2026

work page arXiv 2026

[33] [33]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021

[34] [34]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean

Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, and Nima Mesgarani. Livemathematicianbench: A live benchmark for mathematician-level reasoning with proof sketches, 2026. URLhttps://arxiv.org/abs/2604.01754

work page arXiv 2026

[35] [35]

{ALFW}orld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. {ALFW}orld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn

work page 2021

[36] [36]

Introducing GPT-5.4, March 2026

OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

work page 2026

[37] [37]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

work page 2026

[38] [38]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b

work page 2026

[39] [39]

Codex: A cloud-based software engineering agent, 2025

OpenAI. Codex: A cloud-based software engineering agent, 2025. URLhttps://openai. com/index/introducing-codex/. Accessed: 2026-05-06. 19

work page 2025

[40] [40]

Claude code: An ai coding agent system, 2025

Anthropic. Claude code: An ai coding agent system, 2025. URLhttps://www.anthropic. com/claude-code. Accessed: 2026-05-06

work page 2025

[41] [41]

TextGrad: Automatic "Differentiation" via Text

MertYuksekgonul, FedericoBianchi, JosephBoen, ShengLiu, ZhiHuang, CarlosGuestrin, and James Zou. Textgrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024. A Additional Method Details and Optimizer Prompts This appendix give...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Read ALL trajectories in the minibatch

work page

[44] [44]

Identify the most prevalent, systematic failure patterns across them

work page

[45] [45]

For each pattern, classify its failure type

work page

[46] [46]

Propose skill edits that address the COMMON patterns, not individual edge cases

work page

[47] [47]

Edits must be generalizable; do not hardcode task-specific values

work page

[48] [48]

batch_size

Only patch gaps in the skill; do not duplicate existing content. You will be told the maximum number of edits (the budget L). Produce AT MOST L edits, 21 Algorithm 1SkillOptskill optimization Require: Frozen training modelM, optimizer modelO, harnessh, splitsDtrain,D sel,D test, initial skill s0, epochs E, edit-budget scheduleLt, rollout batch sizeB, accu...

work page

[49] [49]

Deduplicate: keep the best-worded version of similar edits

work page

[50] [50]

Resolve conflicts: if patches contradict on the same point, choose the one with stronger justification or synthesize both

work page

[51] [51]

Preserve unique insights: include all non-redundant corrective edits

work page

[52] [52]

Edits from only one patch may be discarded if task-specific

Prevalent-pattern bias: edits appearing consistently across multiple patches address systematic failures; preserve them with HIGH priority. Edits from only one patch may be discarded if task-specific

work page

[53] [53]

Independence: no two edits in the merged patch may target the same text region

work page

[54] [54]

Support count: for each merged edit, estimate how many source patches support it

work page

[55] [55]

reasoning

PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of key consolidation decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "t...

work page

[56] [56]

Deduplicate: keep only the most generalizable version of similar patterns

work page

[57] [57]

Only include edits for patterns NOT already in the skill

Be conservative: success-driven patches reinforce existing behavior. Only include edits for patterns NOT already in the skill

work page

[58] [58]

Prevalent-pattern bias: patterns seen across many successful trajectories are most worth encoding

work page

[59] [59]

Support count: estimate how many source patches support each merged edit

work page

[60] [60]

reasoning

PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "<if needed>", "content...

work page

[61] [61]

Failure-driven patches (corrective, high priority)

work page

[62] [62]

Success-driven patches (reinforcement, lower priority) Merge guidelines:

work page

[63] [63]

Failure-driven edits should be preserved unless they directly conflict with a well-supported success pattern

FAILURE PATCHES TAKE PRIORITY: the primary goal of skill reflection is to fix failures. Failure-driven edits should be preserved unless they directly conflict with a well-supported success pattern

work page

[64] [64]

Deduplicate: if a failure edit and success edit cover the same point, keep the failure version

work page

[65] [65]

Preserve success insights: include success edits that cover patterns NOT addressed by failure edits

work page

[66] [66]

Higher-level merges represent broader consensus: edits that survived previous merge rounds should be given priority

work page

[67] [67]

Carry forward support_count and source_type for each edit

work page

[68] [68]

reasoning

PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of priority decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "...

work page

[69] [69]

A rule that fixes 50% of failures beats one that fixes a single edge case

Systematic impact: edits that address widespread, recurring failure patterns across many tasks should rank highest. A rule that fixes 50% of failures beats one that fixes a single edge case

work page

[70] [70]

Complementarity: edits that fill gaps in the current skill, not duplicate existing content, rank higher

work page

[71] [71]

Generality: edits phrased as general principles rank higher than those tied to specific question types or entities

work page

[72] [72]

reasoning

Actionability: edits with clear, concrete guidance rank higher than vague advice. You will be told how many edits to select (the budget). Respond ONLY with a valid JSON object: { "reasoning": "<brief justification for your ranking decisions>", "selected_indices": [<0-based indices of the top edits, in priority order>] } 25 C.2.7 Slow update:slow_update.md...

work page

[73] [73]

Previous epoch’s skill and current epoch’s skill, to see what changed

work page

[74] [74]

Longitudinal comparison: the same 20 training tasks rolled out under both skills, categorized into regressions, persistent failures, improvements, and stable successes

work page

[75] [75]

## Your Process

Previous slow update guidance, if any: the guidance written at the end of the last epoch. ## Your Process

work page

[76] [76]

Reflect on the previous guidance, if provided: - Which parts of the previous guidance were effective? - Which parts failed or backfired? - Were there blind spots the previous guidance missed entirely?

work page

[77] [77]

When you encounter X, always do Y

Write updated guidance that: - Retains and strengthens parts of the previous guidance that proved effective. - Revises or removes parts that were ineffective or counterproductive. - Adds new instructions to address newly observed regressions and persistent failures. ## Output Requirements Write a strategic guidance block that will OVERWRITE the previous g...

work page

[78] [78]

The previous epoch’s last-step skill

work page

[79] [79]

The current epoch’s last-step skill. 26

work page

[80] [80]

A longitudinal comparison on the SAME sampled tasks under those two skills

work page