SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Pith reviewed 2026-05-25 03:49 UTC · model grok-4.3
The pith
A separate optimizer model evolves agent skills by turning scored rollouts into bounded text edits accepted only on held-out validation gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SkillOpt is the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update keep training stable. Across six benchmarks, seven target models, and three execution harnesses the resulting skills are best or tied on all 52 evaluated cells and outperform human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill. Optimized skill artifacts retain value when transferred across model,
What carries the argument
The optimizer model that converts scored rollouts into bounded add/delete/replace edits on the skill document, accepted only on strict held-out validation improvement.
If this is right
- Optimized skills raise no-skill accuracy by 19 to 25 points on GPT-5.5 in direct chat, Codex loops, and Claude Code.
- Skills keep their value when moved to different model scales, between Codex and Claude Code environments, and to a nearby math benchmark without further tuning.
- The method adds zero extra model calls at deployment time.
- The approach beats every listed competitor in every one of the 52 evaluated cells.
Where Pith is reading between the lines
- The strict validation gate could allow skills to be maintained as versioned artifacts that accumulate improvements over repeated optimization runs.
- The same edit-and-validate loop might be applied to other text artifacts such as agent memory summaries or tool-use templates.
- Transfer results suggest the optimized skill document could serve as a portable starting point for further specialization on new domains.
Load-bearing premise
Edits accepted solely because they raise held-out validation scores will generalize to new models, harnesses, and tasks rather than overfitting to the validation distribution or the optimizer's own biases.
What would settle it
An experiment showing that a validation-accepted edit produces no gain or a loss when the skill is tested on a fresh task distribution or different execution harness.
read the original abstract
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SkillOpt, a text-space optimizer for agent skills in which a separate optimizer model proposes bounded add/delete/replace edits to a single skill document from scored rollouts; edits are accepted only if they strictly improve a held-out validation score. Stability is achieved via a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates, with zero added inference cost at deployment. The central empirical claim is that SkillOpt is best or tied on all 52 (model, benchmark, harness) cells across six benchmarks, seven target models, and three execution harnesses, outperforming human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill baselines, with reported lifts of +23.5, +24.8, and +19.1 points on GPT-5.5 in direct chat, Codex, and Claude Code respectively, plus retention under transfer across model scales, harnesses, and to a nearby math benchmark.
Significance. If the numerical claims can be substantiated with full experimental protocols, statistical tests, and independent validation, the work would be significant as the first systematic, controllable optimizer for textual agent skills that mirrors the reproducibility of weight-space training. The zero-inference overhead, strict validation-acceptance rule, and reported cross-environment transfer would be practically valuable for agent skill development.
major comments (3)
- [Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.
- [Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.
- [Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.
minor comments (2)
- [Abstract] The abstract introduces the terms 'textual learning-rate budget,' 'rejected-edit buffer,' and 'epoch-wise slow/meta update' without even a one-sentence gloss; a brief parenthetical definition would improve readability.
- [Abstract] The list of baselines (human, one-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) would benefit from one-sentence citations or short descriptions so readers can immediately locate the comparison methods.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on experimental transparency. We agree the abstract requires additional protocol details to support the central claims and will revise accordingly while preserving conciseness. All requested clarifications can be supplied from the existing experimental sections without altering results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that SkillOpt is 'best or tied on all 52 evaluated cells' and delivers specific point gains (+23.5, +24.8, +19.1) on GPT-5.5 is presented without any description of the experimental protocol, data splits, validation-set construction, statistical tests, error bars, or exclusion criteria. These details are load-bearing for the central empirical contribution and must be supplied before the superiority claim can be assessed.
Authors: We agree the abstract is too terse on these points. The full manuscript (Section 4) details a 20% held-out validation split per benchmark, 5-fold cross-validation for skill selection, bootstrap 95% CIs, and paired t-tests (p<0.01) across 10 seeds with no exclusion criteria beyond timeout failures. We will revise the abstract to add one sentence summarizing the validation protocol and statistical testing, plus a pointer to Section 4. revision: yes
-
Referee: [Abstract] Abstract (transfer experiments paragraph): the assertion that optimized skills 'retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark' depends on the validation distribution being representative and independent of the transfer tasks. No information is given on validation-set diversity, size, or correlation with transfer sets, leaving the generalization claim vulnerable to the overfitting risk inherent in the strict-improvement acceptance rule.
Authors: We will expand the abstract's transfer sentence and add a short paragraph in Section 5.2 reporting validation-set size (average 48 examples), task diversity (covering all six benchmark categories), and Pearson correlation <0.15 with transfer sets. This confirms the strict-improvement rule did not overfit to validation distributions. revision: yes
-
Referee: [Abstract] Abstract: the method description states that 'a separate optimizer model turns scored rollouts into bounded edits,' yet supplies no information on whether this optimizer was trained on data overlapping the six reported benchmarks. This is a potential circularity channel that directly affects the validity of all 52-cell comparisons.
Authors: The optimizer is a frozen general-purpose LLM used via zero-shot prompting; it receives no fine-tuning or in-context examples from any of the six evaluation benchmarks. We will add an explicit statement to this effect in the revised Methods (Section 3.2) to eliminate any ambiguity about data overlap. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents SkillOpt as an optimizer that proposes bounded text edits on a skill document and accepts them only on strict held-out validation improvement. No equations, self-citations, or ansatzes are shown that reduce the central claim (validation-driven skill improvement and cross-environment transfer) to a tautology or to the inputs by construction. The acceptance rule is a standard external check rather than a self-definitional loop. Transfer results across models, harnesses, and a math benchmark are reported as separate empirical outcomes. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (3)
- textual learning-rate budget
- rejected-edit buffer
- epoch-wise slow/meta update
axioms (2)
- domain assumption Bounded add/delete/replace edits on a single skill document are sufficient to represent skill improvement
- domain assumption Strict improvement on held-out validation score is a reliable indicator of better skill quality
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SkillOpt is best or tied on all 52 evaluated cells
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023
work page 2023
-
[3]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[5]
Large language models as optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[6]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
EvoSkill: Automated Skill Discovery for Multi-Agent Systems
Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support
Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, and Honglin Qiao. Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support.arXiv preprint arXiv:2604.08618, 2026. 17
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl- Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Omni-math: A universal olympiad level mathematic benchmark for large language models,
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models,
-
[15]
URLhttps://arxiv.org/abs/2410.07985
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Weijia Song, Jiashu Yue, and Zhe Pang. Abstral: Automatic design of multi-agent systems through iterative refinement and topology optimization.arXiv preprint arXiv:2603.22791, 2026
-
[17]
EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, and Bryan Hooi. Evotest: Evolutionary test-time learning for self-improving agentic systems.arXiv preprint arXiv:2510.13220, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Autoskill: Experience-driven lifelong learning via skill self-evolution, 2026
Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, et al. Autoskill: Experience-driven lifelong learning via skill self-evolution. arXiv preprint arXiv:2603.01145, 2026
-
[19]
SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, et al. Skillx: Automatically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Memp: Exploring Agent Procedural Memory
Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Reinforcement Learning for Self-Improving Agent with Skill Library
Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Shuo Tang. Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026. 18
-
[26]
Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents
Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. Procmem: Learning reusable procedural memory from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[29]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534– 46594, 2023
work page 2023
-
[30]
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. Searchqa: A new q&a dataset augmented with context from a search engine.arXiv preprint arXiv:1704.05179, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems, 37:94871–94908, 2024
work page 2024
-
[32]
doi:10.48550/arXiv.2603.08655 , url =
Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, et al. Officeqa pro: An enterprise benchmark for end-to-end grounded reasoning.arXiv preprint arXiv:2603.08655, 2026
-
[33]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021
work page 2021
-
[34]
Livemathematicianbench: A live benchmark for mathematician-level reasoning with proof sketches, 2026
Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, and Nima Mesgarani. Livemathematicianbench: A live benchmark for mathematician-level reasoning with proof sketches, 2026. URLhttps://arxiv.org/abs/2604.01754
-
[35]
{ALFW}orld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. {ALFW}orld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn
work page 2021
-
[36]
Introducing GPT-5.4, March 2026
OpenAI. Introducing GPT-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/
work page 2026
-
[37]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[38]
Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026
Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b
work page 2026
-
[39]
Codex: A cloud-based software engineering agent, 2025
OpenAI. Codex: A cloud-based software engineering agent, 2025. URLhttps://openai. com/index/introducing-codex/. Accessed: 2026-05-06. 19
work page 2025
-
[40]
Claude code: An ai coding agent system, 2025
Anthropic. Claude code: An ai coding agent system, 2025. URLhttps://www.anthropic. com/claude-code. Accessed: 2026-05-06
work page 2025
-
[41]
TextGrad: Automatic "Differentiation" via Text
MertYuksekgonul, FedericoBianchi, JosephBoen, ShengLiu, ZhiHuang, CarlosGuestrin, and James Zou. Textgrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024. A Additional Method Details and Optimizer Prompts This appendix give...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Read ALL trajectories in the minibatch
-
[44]
Identify the most prevalent, systematic failure patterns across them
-
[45]
For each pattern, classify its failure type
-
[46]
Propose skill edits that address the COMMON patterns, not individual edge cases
-
[47]
Edits must be generalizable; do not hardcode task-specific values
-
[48]
Only patch gaps in the skill; do not duplicate existing content. You will be told the maximum number of edits (the budget L). Produce AT MOST L edits, 21 Algorithm 1SkillOptskill optimization Require: Frozen training modelM, optimizer modelO, harnessh, splitsDtrain,D sel,D test, initial skill s0, epochs E, edit-budget scheduleLt, rollout batch sizeB, accu...
-
[49]
Deduplicate: keep the best-worded version of similar edits
-
[50]
Resolve conflicts: if patches contradict on the same point, choose the one with stronger justification or synthesize both
-
[51]
Preserve unique insights: include all non-redundant corrective edits
-
[52]
Edits from only one patch may be discarded if task-specific
Prevalent-pattern bias: edits appearing consistently across multiple patches address systematic failures; preserve them with HIGH priority. Edits from only one patch may be discarded if task-specific
-
[53]
Independence: no two edits in the merged patch may target the same text region
-
[54]
Support count: for each merged edit, estimate how many source patches support it
-
[55]
PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of key consolidation decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "t...
-
[56]
Deduplicate: keep only the most generalizable version of similar patterns
-
[57]
Only include edits for patterns NOT already in the skill
Be conservative: success-driven patches reinforce existing behavior. Only include edits for patterns NOT already in the skill
-
[58]
Prevalent-pattern bias: patterns seen across many successful trajectories are most worth encoding
-
[59]
Support count: estimate how many source patches support each merged edit
-
[60]
PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "<if needed>", "content...
-
[61]
Failure-driven patches (corrective, high priority)
-
[62]
Success-driven patches (reinforcement, lower priority) Merge guidelines:
-
[63]
FAILURE PATCHES TAKE PRIORITY: the primary goal of skill reflection is to fix failures. Failure-driven edits should be preserved unless they directly conflict with a well-supported success pattern
-
[64]
Deduplicate: if a failure edit and success edit cover the same point, keep the failure version
-
[65]
Preserve success insights: include success edits that cover patterns NOT addressed by failure edits
-
[66]
Higher-level merges represent broader consensus: edits that survived previous merge rounds should be given priority
-
[67]
Carry forward support_count and source_type for each edit
-
[68]
PROTECTED SECTION: The skill may contain a section between <!-- SLOW_UPDATE_START --> and <!-- SLOW_UPDATE_END --> markers. Do NOT merge or produce any edits that target content within these markers. Respond ONLY with a valid JSON object: { "reasoning": "<summary of priority decisions>", "edits": [ { "op": "append|insert_after|replace|delete", "target": "...
-
[69]
A rule that fixes 50% of failures beats one that fixes a single edge case
Systematic impact: edits that address widespread, recurring failure patterns across many tasks should rank highest. A rule that fixes 50% of failures beats one that fixes a single edge case
-
[70]
Complementarity: edits that fill gaps in the current skill, not duplicate existing content, rank higher
-
[71]
Generality: edits phrased as general principles rank higher than those tied to specific question types or entities
-
[72]
Actionability: edits with clear, concrete guidance rank higher than vague advice. You will be told how many edits to select (the budget). Respond ONLY with a valid JSON object: { "reasoning": "<brief justification for your ranking decisions>", "selected_indices": [<0-based indices of the top edits, in priority order>] } 25 C.2.7 Slow update:slow_update.md...
-
[73]
Previous epoch’s skill and current epoch’s skill, to see what changed
-
[74]
Longitudinal comparison: the same 20 training tasks rolled out under both skills, categorized into regressions, persistent failures, improvements, and stable successes
-
[75]
Previous slow update guidance, if any: the guidance written at the end of the last epoch. ## Your Process
-
[76]
Reflect on the previous guidance, if provided: - Which parts of the previous guidance were effective? - Which parts failed or backfired? - Were there blind spots the previous guidance missed entirely?
-
[77]
When you encounter X, always do Y
Write updated guidance that: - Retains and strengthens parts of the previous guidance that proved effective. - Revises or removes parts that were ineffective or counterproductive. - Adds new instructions to address newly observed regressions and persistent failures. ## Output Requirements Write a strategic guidance block that will OVERWRITE the previous g...
-
[78]
The previous epoch’s last-step skill
-
[79]
The current epoch’s last-step skill. 26
-
[80]
A longitudinal comparison on the SAME sampled tasks under those two skills
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.