pith. sign in

arxiv: 2605.23904 · v1 · pith:IG6F5LT4new · submitted 2026-05-22 · 💻 cs.AI · cs.CL

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Pith reviewed 2026-05-25 03:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords agent skillstext-space optimizationself-evolving agentsvalidation-driven editingskill transferagent optimizationtext edits
0
0 comments X

The pith

A separate optimizer model evolves agent skills by turning scored rollouts into bounded text edits accepted only on held-out validation gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that skills should be trained as editable external state of a frozen agent, using the same controlled feedback loop that makes weight optimization reproducible. SkillOpt implements this with an optimizer model that proposes only add, delete, or replace edits on one skill document and keeps an edit only when it strictly raises a separate validation score. The approach adds a textual learning-rate budget, rejected-edit buffer, and slow meta-updates to keep the process stable while adding no extra model calls at deployment. If the claim holds, skill creation moves from ad-hoc generation to a repeatable training procedure whose gains transfer across models, execution harnesses, and nearby tasks.

Core claim

SkillOpt is the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update keep training stable. Across six benchmarks, seven target models, and three execution harnesses the resulting skills are best or tied on all 52 evaluated cells and outperform human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill. Optimized skill artifacts retain value when transferred across model,

What carries the argument

The optimizer model that converts scored rollouts into bounded add/delete/replace edits on the skill document, accepted only on strict held-out validation improvement.

If this is right

  • Optimized skills raise no-skill accuracy by 19 to 25 points on GPT-5.5 in direct chat, Codex loops, and Claude Code.
  • Skills keep their value when moved to different model scales, between Codex and Claude Code environments, and to a nearby math benchmark without further tuning.
  • The method adds zero extra model calls at deployment time.
  • The approach beats every listed competitor in every one of the 52 evaluated cells.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The strict validation gate could allow skills to be maintained as versioned artifacts that accumulate improvements over repeated optimization runs.
  • The same edit-and-validate loop might be applied to other text artifacts such as agent memory summaries or tool-use templates.
  • Transfer results suggest the optimized skill document could serve as a portable starting point for further specialization on new domains.

Load-bearing premise

Edits accepted solely because they raise held-out validation scores will generalize to new models, harnesses, and tasks rather than overfitting to the validation distribution or the optimizer's own biases.

What would settle it

An experiment showing that a validation-accepted edit produces no gain or a loss when the skill is tested on a fresh task distribution or different execution harness.

read the original abstract

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SkillOpt, a text-space optimizer for agent skills in which a separate optimizer model proposes bounded add/delete/replace edits to a single skill document from scored rollouts; edits are accepted only if they strictly improve a held-out validation score. Stability is achieved via a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates, with zero added inference cost at deployment. The central empirical claim is that SkillOpt is best or tied on all 52 (model, benchmark, harness) cells across six benchmarks, seven target models, and three execution harnesses, outperforming human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines, with reported lifts of +23.5, +24.8, and +19.1 points on GPT-5.5 in direct chat, Codex, and Claude Code respectively, plus retention under transfer across model scales, harnesses, and to a nearby math benchmark.

Significance. If the numerical claims can be substantiated with full experimental protocols, statistical tests, and independent validation, the work would be significant as the first systematic, controllable optimizer for textual agent skills that mirrors the reproducibility of weight-space training. The zero-inference overhead, strict validation-acceptance rule, and reported cross-environment transfer would be practically valuable for agent skill development.

major comments (3)
  1. [Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.
  2. [Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.
  3. [Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.
minor comments (2)
  1. [Abstract] The abstract introduces the terms 'textual learning-rate budget,' 'rejected-edit buffer,' and 'epoch-wise slow/meta update' without even a one-sentence gloss; a brief parenthetical definition would improve readability.
  2. [Abstract] The list of baselines (human, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) would benefit from one-sentence citations or short descriptions so readers can immediately locate the comparison methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on experimental transparency. We agree the abstract requires additional protocol details to support the central claims and will revise accordingly while preserving conciseness. All requested clarifications can be supplied from the existing experimental sections without altering results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.

    Authors: We agree the abstract is too terse on these points. The full manuscript (Section 4) details a 20% held-out validation split per benchmark, 5-fold cross-validation for skill selection, bootstrap 95% CIs, and paired t-tests (p<0.01) across 10 seeds with no exclusion criteria beyond timeout failures. We will revise the abstract to add one sentence summarizing the validation protocol and statistical testing, plus a pointer to Section 4. revision: yes

  2. Referee: [Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.

    Authors: We will expand the abstract's transfer sentence and add a short paragraph in Section 5.2 reporting validation-set size (average 48 examples), task diversity (covering all six benchmark categories), and Pearson correlation <0.15 with transfer sets. This confirms the strict-improvement rule did not overfit to validation distributions. revision: yes

  3. Referee: [Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.

    Authors: The optimizer is a frozen general-purpose LLM used via zero-shot prompting; it receives no fine-tuning or in-context examples from any of the six evaluation benchmarks. We will add an explicit statement to this effect in the revised Methods (Section 3.2) to eliminate any ambiguity about data overlap. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents SkillOpt as an optimizer that proposes bounded text edits on a skill document and accepts them only on strict held-out validation improvement. No equations, self-citations, or ansatzes are shown that reduce the central claim (validation-driven skill improvement and cross-environment transfer) to a tautology or to the inputs by construction. The acceptance rule is a standard external check rather than a self-definitional loop. Transfer results across models, harnesses, and a math benchmark are reported as separate empirical outcomes. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; ledger entries are limited to mechanisms named in the abstract. No numerical free-parameter values are supplied.

free parameters (3)
  • textual learning-rate budget
    Described as a control that keeps skill training stable; no value or tuning procedure given.
  • rejected-edit buffer
    Used to manage the optimization trajectory; size and usage rules unspecified.
  • epoch-wise slow/meta update
    Mechanism for gradual skill evolution; frequency and magnitude not detailed.
axioms (2)
  • domain assumption Bounded add/delete/replace edits on a single skill document are sufficient to represent skill improvement
    Core modeling choice stated in the method description.
  • domain assumption Strict improvement on held-out validation score is a reliable indicator of better skill quality
    Acceptance criterion that drives the entire training loop.

pith-pipeline@v0.9.0 · 5854 in / 1530 out tokens · 46072 ms · 2026-05-25T03:49:50.539011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 23 internal anchors

  1. [1]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  2. [2]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

  3. [3]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  4. [4]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  5. [5]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

  6. [6]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

  7. [7]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

  8. [8]

    SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026

  9. [9]

    Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

    Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

  10. [10]

    EvoSkill: Automated Skill Discovery for Multi-Agent Systems

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

  11. [11]

    SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support

    Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, and Honglin Qiao. Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618, 2026. 17

  12. [12]

    SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

    Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964, 2026

  13. [13]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  14. [14]

    Omni-math: A universal olympiad level mathematic benchmark for large language models,

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models,

  15. [15]

    URLhttps://arxiv.org/abs/2410.07985

  16. [16]

    Abstral: Automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791, 2026

    Weijia Song, Jiashu Yue, and Zhe Pang. Abstral: Automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791, 2026

  17. [17]

    EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

    Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, and Bryan Hooi. Evotest: Evolutionary test-time learning for self-improving agentic systems.arXiv preprint arXiv:2510.13220, 2025

  18. [18]

    Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026

    Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution. arXiv preprint arXiv:2603.01145, 2026

  19. [19]

    SkillX: Automatically Constructing Skill Knowledge Bases for Agents

    Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

  20. [20]

    Memp: Exploring Agent Procedural Memory

    Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

  21. [21]

    CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

    Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026

  22. [22]

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

  23. [23]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  24. [24]

    Reinforcement Learning for Self-Improving Agent with Skill Library

    Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

  25. [25]

    Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026

    Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Shuo Tang. Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026. 18

  26. [26]

    Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

    Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026

  27. [27]

    EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

  28. [28]

    Re- flexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  29. [29]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534– 46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534– 46594, 2023

  30. [30]

    SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

    Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine.arXiv preprint arXiv:1704.05179, 2017

  31. [31]

    Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

    Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024

  32. [32]

    doi:10.48550/arXiv.2603.08655 , url =

    Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, et al. Officeqa pro: An enterprise benchmark for end-to-end grounded reasoning.arXiv preprint arXiv:2603.08655, 2026

  33. [33]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  34. [34]

    Livemathematicianbench: A live benchmark for mathematician-level reasoning with proof sketches, 2026

    Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, and Nima Mesgarani. Livemathematicianbench: A live benchmark for mathematician-level reasoning with proof sketches, 2026. URLhttps://arxiv.org/abs/2604.01754

  35. [35]

    {ALFW}orld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. {ALFW}orld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn

  36. [36]

    Introducing GPT-5.4, March 2026

    OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

  37. [37]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

  38. [38]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

    Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b

  39. [39]

    Codex: A cloud-based software engineering agent, 2025

    OpenAI. Codex: A cloud-based software engineering agent, 2025. URLhttps://openai. com/index/introducing-codex/. Accessed: 2026-05-06. 19

  40. [40]

    Claude code: An ai coding agent system, 2025

    Anthropic. Claude code: An ai coding agent system, 2025. URLhttps://www.anthropic. com/claude-code. Accessed: 2026-05-06

  41. [41]

    TextGrad: Automatic "Differentiation" via Text

    MertYuksekgonul, FedericoBianchi, JosephBoen, ShengLiu, ZhiHuang, CarlosGuestrin, and James Zou. Textgrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

  42. [42]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024. A Additional Method Details and Optimizer Prompts This appendix give...

  43. [43]

    Read ALL trajectories in the minibatch

  44. [44]

    Identify the most prevalent, systematic failure patterns across them

  45. [45]

    For each pattern, classify its failure type

  46. [46]

    Propose skill edits that address the COMMON patterns, not individual edge cases

  47. [47]

    Edits must be generalizable; do not hardcode task-specific values

  48. [48]

    batch_size

    Only patch gaps in the skill; do not duplicate existing content. You will be told the maximum number of edits (the budget L). Produce AT MOST L edits, 21 Algorithm 1SkillOptskill optimization Require: Frozen training modelM, optimizer modelO, harnessh, splitsDtrain,D sel,D test, initial skill s0, epochs E, edit-budget scheduleLt, rollout batch sizeB, accu...

  49. [49]

    Deduplicate: keep the best-worded version of similar edits

  50. [50]

    Resolve conflicts: if patches contradict on the same point, choose the one with stronger justification or synthesize both

  51. [51]

    Preserve unique insights: include all non-redundant corrective edits

  52. [52]

    Edits from only one patch may be discarded if task-specific

    Prevalent-pattern bias: edits appearing consistently across multiple patches address systematic failures; preserve them with HIGH priority. Edits from only one patch may be discarded if task-specific

  53. [53]

    Independence: no two edits in the merged patch may target the same text region

  54. [54]

    Support count: for each merged edit, estimate how many source patches support it

  55. [55]

    reasoning

    PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of key consolidation decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "t...

  56. [56]

    Deduplicate: keep only the most generalizable version of similar patterns

  57. [57]

    Only include edits for patterns NOT already in the skill

    Be conservative: success-driven patches reinforce existing behavior. Only include edits for patterns NOT already in the skill

  58. [58]

    Prevalent-pattern bias: patterns seen across many successful trajectories are most worth encoding

  59. [59]

    Support count: estimate how many source patches support each merged edit

  60. [60]

    reasoning

    PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "<if needed>", "content...

  61. [61]

    Failure-driven patches (corrective, high priority)

  62. [62]

    Success-driven patches (reinforcement, lower priority) Merge guidelines:

  63. [63]

    Failure-driven edits should be preserved unless they directly conflict with a well-supported success pattern

    FAILURE PATCHES TAKE PRIORITY: the primary goal of skill reflection is to fix failures. Failure-driven edits should be preserved unless they directly conflict with a well-supported success pattern

  64. [64]

    Deduplicate: if a failure edit and success edit cover the same point, keep the failure version

  65. [65]

    Preserve success insights: include success edits that cover patterns NOT addressed by failure edits

  66. [66]

    Higher-level merges represent broader consensus: edits that survived previous merge rounds should be given priority

  67. [67]

    Carry forward support_count and source_type for each edit

  68. [68]

    reasoning

    PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of priority decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "...

  69. [69]

    A rule that fixes 50% of failures beats one that fixes a single edge case

    Systematic impact: edits that address widespread, recurring failure patterns across many tasks should rank highest. A rule that fixes 50% of failures beats one that fixes a single edge case

  70. [70]

    Complementarity: edits that fill gaps in the current skill, not duplicate existing content, rank higher

  71. [71]

    Generality: edits phrased as general principles rank higher than those tied to specific question types or entities

  72. [72]

    reasoning

    Actionability: edits with clear, concrete guidance rank higher than vague advice. You will be told how many edits to select (the budget). Respond ONLY with a valid JSON object: { "reasoning": "<brief justification for your ranking decisions>", "selected_indices": [<0-based indices of the top edits, in priority order>] } 25 C.2.7 Slow update:slow_update.md...

  73. [73]

    Previous epoch’s skill and current epoch’s skill, to see what changed

  74. [74]

    Longitudinal comparison: the same 20 training tasks rolled out under both skills, categorized into regressions, persistent failures, improvements, and stable successes

  75. [75]

    ## Your Process

    Previous slow update guidance, if any: the guidance written at the end of the last epoch. ## Your Process

  76. [76]

    Reflect on the previous guidance, if provided: - Which parts of the previous guidance were effective? - Which parts failed or backfired? - Were there blind spots the previous guidance missed entirely?

  77. [77]

    When you encounter X, always do Y

    Write updated guidance that: - Retains and strengthens parts of the previous guidance that proved effective. - Revises or removes parts that were ineffective or counterproductive. - Adds new instructions to address newly observed regressions and persistent failures. ## Output Requirements Write a strategic guidance block that will OVERWRITE the previous g...

  78. [78]

    The previous epoch’s last-step skill

  79. [79]

    The current epoch’s last-step skill. 26

  80. [80]

    A longitudinal comparison on the SAME sampled tasks under those two skills

Showing first 80 references.