CODESKILL: Learning Self-Evolving Skills for Coding Agents

Xiaoxia Liu; Xiaoyu Zhang; Yang Liu; Yanzhou Li; Yiran Zhang

arxiv: 2605.25430 · v1 · pith:4FC5LAZUnew · submitted 2026-05-25 · 💻 cs.AI

CODESKILL: Learning Self-Evolving Skills for Coding Agents

Yanzhou Li , Yiran Zhang , Xiaoyu Zhang , Xiaoxia Liu , Yang Liu This is my paper

Pith reviewed 2026-06-29 22:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords coding agentsskill extractionreinforcement learningself-evolutionLLM agentsprocedural skillssoftware engineeringskill bank maintenance

0 comments

The pith

CODESKILL trains a reinforcement learning policy to extract and maintain reusable skills from coding agent trajectories, raising task success rates while keeping the skill bank stable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that coding agents can self-evolve by distilling their trajectories into procedural skills through a learned management policy instead of fixed prompts or heuristics. This policy is optimized with reinforcement learning that rewards both the internal quality of each skill and its measurable effect on downstream task execution by a frozen agent. A sympathetic reader would care because the approach promises agents that accumulate compact, reusable experience across iterations without the skill collection growing out of control or depending on brittle manual rules. Experiments on three benchmarks demonstrate concrete gains in pass rates alongside stable bank size.

Core claim

CODESKILL reformulates skill extraction and skill-bank maintenance as a learnable management policy. The policy is trained with reinforcement learning using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. It extracts multi-granularity procedural skills from trajectories, evolves them with new experience, and maintains a compact bank, producing average pass-rate improvements of 9.69 over the no-skill baseline and 4.01 over the strongest prompt-based or memory baseline on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 while holding bank size stable during iterative construction.

What carries the argument

A reinforcement learning management policy that decides when and how to extract, evolve, and retain multi-granularity procedural skills from agent trajectories using hybrid quality-plus-execution rewards.

If this is right

Downstream agents using the maintained skill bank complete more tasks on software-engineering benchmarks.
The skill bank remains compact and stable in size across repeated construction cycles.
The learned policy outperforms both no-skill setups and strong fixed-prompt or memory baselines by measurable margins.
Multi-granularity skills support tasks of varying complexity without manual abstraction rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same policy-learning approach could be tested on agent domains outside coding, such as web interaction or scientific workflows.
Hybrid rewards may reduce reliance on purely human-curated skill labels by leveraging execution signals.
If the policy transfers across task distributions, deployed agents could keep improving their skill banks without periodic full retraining.

Load-bearing premise

The hybrid reward from skill-quality rubrics and downstream execution success is sufficient to train a policy whose extracted skills genuinely improve agent performance on new tasks rather than merely fitting the training distribution.

What would settle it

Retraining the management policy on fresh trajectories and finding that the resulting skill bank either grows substantially in size or produces downstream pass rates no higher than the no-skill baseline on the same benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.25430 by Xiaoxia Liu, Xiaoyu Zhang, Yang Liu, Yanzhou Li, Yiran Zhang.

**Figure 1.** Figure 1: Overview of the CODESKILL pipeline. vates adaptive skill management rather than relying on a fixed skill management strategy. 3 CODESKILL 3.1 Problem Formalization We consider a frozen downstream coding policy π that solves a software-engineering task x ∈ X by interacting with a repository or terminal environment. A rollout produces a trajectory τ = (o1, a1, . . . , oT , aT , y), where ot is an observatio… view at source ↗

**Figure 2.** Figure 2: Training pipeline of CODESKILL. procedural knowledge over time, each newly extracted or evolved candidate skill is further passed to a maintenance stage. Similar skills are retrieved from the current bank and provided to Mθ together with the candidate. Based on this context, Mθ outputs a maintenance operation that either adds the candidate, merges it with an existing skill, or drops it. The add operation … view at source ↗

**Figure 3.** Figure 3: Analysis of skill-bank maintenance and RL training dynamics. The left panel shows cumulative add, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Example task-level skill from the maintained skill bank. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Example event-driven skill from the maintained skill bank. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for task-level skill extraction. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for event-driven skill extraction. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt Template for skill evolution [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for skill-bank maintenance. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Rubric template for task-level skill quality judgment. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Rubric template for event-driven skill quality judgment. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Rubric template for skill evolution quality judgment. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Rubric template for judging merge decisions in skill-bank maintenance. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Rubric template for behavior alignment judgment. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

read the original abstract

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CODESKILL reframes skill-bank maintenance as an RL policy and reports pass-rate gains, but the abstract gives almost no experimental detail so the numbers are hard to trust yet.

read the letter

The core move is treating skill extraction and bank updates as a single trainable policy instead of prompt engineering or hand-coded rules. That is a clean shift from the fixed methods they contrast against, and the hybrid reward (rubric plus execution signal from a frozen agent) is a reasonable way to supervise it. They also keep the bank size stable across iterations, which is a practical plus.

The reported lifts—9.69 over no-skill and 4.01 over the best baseline—sound useful on EnvBench, SWE-Bench Verified, and Terminal-Bench 2. But the abstract supplies no controls, no ablation on the reward weights, no statistical tests, and no check that the skills still help when the downstream agent changes. Without those, it is easy to imagine the policy learning to produce rubric-pleasing text that happens to work on the training distribution rather than genuinely reusable procedures.

The stress-test worry about distribution fitting is therefore live; the paper would need to show that the gains survive agent swaps or reward-component removal. Right now the evidence is too thin to know.

This is for people building self-improving coding agents who want a concrete RL alternative to prompt-based memory. A serious referee should see it because the formulation is worth testing properly, even if the current experiments need substantial strengthening. I would send it out for review but flag the missing controls as the first thing to fix.

Referee Report

3 major / 2 minor

Summary. The paper proposes CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy trained via reinforcement learning. It extracts multi-granularity procedural skills from coding-agent trajectories, evolves them with new experience, and maintains a compact skill bank. Training uses a hybrid reward combining dense rubric-based skill-quality feedback with sparse execution feedback from a frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 report average pass-rate gains of 9.69 over a no-skill baseline and 4.01 over the strongest prompt-based or memory baseline, while keeping the skill bank size stable.

Significance. If the central claim holds after proper controls, the work would provide a concrete, learnable alternative to heuristic skill management in coding agents, with potential for broader self-evolution pipelines. The hybrid-reward formulation and reported stability of the skill bank during iterative construction are the most distinctive elements.

major comments (3)

[§4] §4 (Experiments): The reported pass-rate improvements (9.69 / 4.01) are presented without any description of statistical significance tests, variance across runs, or explicit controls for baseline implementation details (e.g., prompt templates, memory size, or temperature settings). This makes it impossible to determine whether the gains are robust or attributable to the RL-trained policy.
[§3.2] §3.2 (Hybrid Reward): No ablation is reported that isolates the contribution of the dense rubric component versus the sparse execution feedback. Without such a test, it remains possible that the management policy learns to produce skills that score well on the rubric but do not causally improve downstream agent performance on held-out tasks.
[§4.3] §4.3 (Cross-benchmark evaluation): The paper does not test whether skills learned with one downstream agent transfer when the frozen agent is replaced by a different model or architecture. This directly bears on the claim that the hybrid reward produces generally useful procedural knowledge rather than distribution-specific artifacts.

minor comments (2)

[Abstract, §1] The abstract and §1 should explicitly state the number of independent runs and any hyper-parameter search procedure used for the hybrid reward weights.
[Figure 3] Figure 3 (skill-bank size over iterations) would benefit from error bars or multiple runs to substantiate the stability claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental rigor. We address each major comment below and describe the planned revisions.

read point-by-point responses

Referee: §4 (Experiments): The reported pass-rate improvements (9.69 / 4.01) are presented without any description of statistical significance tests, variance across runs, or explicit controls for baseline implementation details (e.g., prompt templates, memory size, or temperature settings). This makes it impossible to determine whether the gains are robust or attributable to the RL-trained policy.

Authors: We agree that the current manuscript lacks these details. In the revision we will report means and standard deviations over multiple runs with different seeds, include paired statistical significance tests, and document all baseline hyperparameters, prompt templates, memory sizes, and temperature settings for full reproducibility. revision: yes
Referee: §3.2 (Hybrid Reward): No ablation is reported that isolates the contribution of the dense rubric component versus the sparse execution feedback. Without such a test, it remains possible that the management policy learns to produce skills that score well on the rubric but do not causally improve downstream agent performance on held-out tasks.

Authors: We acknowledge the absence of this ablation. The revised manuscript will add an ablation comparing policies trained with rubric-only, execution-only, and hybrid rewards, measuring their effects on downstream pass rates to isolate each component's contribution. revision: yes
Referee: §4.3 (Cross-benchmark evaluation): The paper does not test whether skills learned with one downstream agent transfer when the frozen agent is replaced by a different model or architecture. This directly bears on the claim that the hybrid reward produces generally useful procedural knowledge rather than distribution-specific artifacts.

Authors: Our experiments hold the downstream agent fixed to isolate the skill-management policy while testing across three distinct task distributions. Full cross-agent transfer experiments would require new model pairings and substantial additional compute. We will add a limitations paragraph discussing this scope choice and the implications for generality. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper describes an RL-trained management policy for skill extraction and maintenance, evaluated via pass rates on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 against no-skill and prompt/memory baselines. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce the reported gains (9.69 / 4.01) to the training inputs by construction. The hybrid reward and downstream execution feedback are distinct from the final benchmark metric, and the central claim remains independent of any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the hybrid reward produces transferable skills.

pith-pipeline@v0.9.1-grok · 5738 in / 1172 out tokens · 35440 ms · 2026-06-29T22:01:02.702987+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

Agent skills: A data-driven analysis of claude skills for extending large language model functional- ity.Preprint, arXiv:2602.08004. Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. 2026. SkillClaw: Let skills evolve collectively with agentic evolver.Preprint, arXiv:2604.08377. Mike A. Merrill, Alexa...

work page arXiv 2026
[2]

InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 8634– 8652

Reflexion: Language agents with verbal re- inforcement learning. InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 8634– 8652. SWE-agent Team. 2025. mini-SWE-agent: A 100-line software engineering agent. https://github.com/ SWE-agent/mini-swe-agent. Software repository. Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Yanjun Sha...

2025
[3]

Dynamic Dual-Granularity Skill Bank for Agentic RL

AGENT KB: A hierarchical memory frame- work for cross-domain agentic problem solving. In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows. Oral. Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, and Dongbin Zhao. 2026. Dynamic dual-granularity skill bank for agentic RL.Preprint, arXiv:2603.28716. Guanz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

SWE-smith: Scaling Data for Software Engineering Agents

SWE-smith: Scaling data for software engi- neering agents.arXiv preprint arXiv:2504.21798. Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yux- uan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. 2026. AutoSkill: Experience-driven lifelong learning via skill self- evolution.Preprint, arXiv:2603.01145. Shunyu Yao, Jeffrey Zhao, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

SWE Context Bench: A Benchmark for Context Learning in Coding

SWE Context Bench: A benchmark for con- text learning in coding.Preprint, arXiv:2602.08316. A Training Details This appendix provides additional details on the data construction and training configuration of CODESKILL. We organize the discussion by train- ing stage. Supervised fine-tuning first teaches the model the action schema and basic skill- manageme...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

It must be broader than a single local event

A bootstrap skill is a task-level reusable pattern supported by multiple trajectories. It must be broader than a single local event
[7]

Keep when_to_apply high-level and write transferable, actionablerules

Generate only if the pattern is reusable across similar tasks. Keep when_to_apply high-level and write transferable, actionablerules
[8]

Do not invent unsupported steps, checks, or guidance

Ground every part of the skill in repeated evidence from the trajectories and outcomes. Do not invent unsupported steps, checks, or guidance. If failed trajectories reveal reusable cautions, include them as cautionary rules
[9]

Skip if the evidence is weak, contradictory, accidental, too local, or collapses into an event-level reaction instead of a task-level pattern
[10]

action":

Do not include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals. Output Schema generate { "action": "generate", "skill": { "title": "short reusable skill name", "granularity": "general", "when_to_apply": "high-level task situation where this skill shoul...
[11]

It must focus on one important event inside the trajectory and stay narrower than a whole-task workflow

An event-driven skill is a local trigger-response pattern. It must focus on one important event inside the trajectory and stay narrower than a whole-task workflow
[12]

Keep when_to_apply transferable and write local actionablerules

Generate only if the event yields a reusable lesson beyond this exact task. Keep when_to_apply transferable and write local actionablerules
[13]

Do not invent an event that is not clearly present

Ground the trigger and guidance in the trajectory and result. Do not invent an event that is not clearly present. Failure- derived cautions are valid if they are clearly supported
[14]

If multiple candidate events exist, choose the single most reusable one
[15]

Skip if there is no strong reusable local event, or if the lesson is too task-specific or workflow-level
[16]

action":

Do not include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals. Output Schema generate { "action": "generate", "skill": { "title": "short reusable skill name", "granularity": "event-driven", "when_to_apply": "high-level local signal or situation where ...
[17]

If multiple skills are aligned and all are plausible revision targets, choose the single one that is most worth revising based on the strength and reusability of the evidence

Alignment and need for revision.Choose evolve only when one existing skill is clearly aligned with the current trajectory context and trigger pattern, and the current trajectory result shows that this aligned skill should be revised. If multiple skills are aligned and all are plausible revision targets, choose the single one that is most worth revising ba...
[18]

rules should contain the revised reusable core of the skill as actionable guidance

Reusability and rule writing.Revise the skill only if the trajectory reveals a reusable missing case, missing check, better decision rule, improved ordering, or reusable caution. rules should contain the revised reusable core of the skill as actionable guidance. 3.Applicability.when_to_applyshould remain at a high transferable level rather than concrete task text
[19]

Do not add unsupported refinements

Grounding.The revision must be directly supported by the provided trajectory and result. Do not add unsupported refinements. 5.Identity preservation.Keep the same capability identity. Do not drift into a different skill
[20]

When to skip.Choose skip if the evidence is weak, contradictory, too local, or if the trajectory is better described as a brand-new skill rather than a revision
[21]

action":

Do not include task-specific details.Do not include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals that only make sense for one instance. Output Schema evolve { "action": "evolve", "target_skill_id": "id of the single skill you chose to revise", "reas...
[22]

The merged skill should have clearer applicability and cleaner rules with less duplication

Merge.Choose merge when the candidate and one retrieved skill share the same capability identity and can be combined into one stronger reusable skill. The merged skill should have clearer applicability and cleaner rules with less duplication
[23]

Drop.Choose drop when the candidate is already covered by stronger retrieved skills, redundant, weakly evidenced, unsafe, too local, or too task-specific
[24]

action":

Do not include task-specific details.No resulting skill may include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals that only make sense for one instance. Output Schema add { "action": "add", "reason": "short reason" } drop { "action": "drop", "reason"...

[1] [1]

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

Agent skills: A data-driven analysis of claude skills for extending large language model functional- ity.Preprint, arXiv:2602.08004. Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. 2026. SkillClaw: Let skills evolve collectively with agentic evolver.Preprint, arXiv:2604.08377. Mike A. Merrill, Alexa...

work page arXiv 2026

[2] [2]

InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 8634– 8652

Reflexion: Language agents with verbal re- inforcement learning. InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 8634– 8652. SWE-agent Team. 2025. mini-SWE-agent: A 100-line software engineering agent. https://github.com/ SWE-agent/mini-swe-agent. Software repository. Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Yanjun Sha...

2025

[3] [3]

Dynamic Dual-Granularity Skill Bank for Agentic RL

AGENT KB: A hierarchical memory frame- work for cross-domain agentic problem solving. In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows. Oral. Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, and Dongbin Zhao. 2026. Dynamic dual-granularity skill bank for agentic RL.Preprint, arXiv:2603.28716. Guanz...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

SWE-smith: Scaling Data for Software Engineering Agents

SWE-smith: Scaling data for software engi- neering agents.arXiv preprint arXiv:2504.21798. Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yux- uan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. 2026. AutoSkill: Experience-driven lifelong learning via skill self- evolution.Preprint, arXiv:2603.01145. Shunyu Yao, Jeffrey Zhao, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

SWE Context Bench: A Benchmark for Context Learning in Coding

SWE Context Bench: A benchmark for con- text learning in coding.Preprint, arXiv:2602.08316. A Training Details This appendix provides additional details on the data construction and training configuration of CODESKILL. We organize the discussion by train- ing stage. Supervised fine-tuning first teaches the model the action schema and basic skill- manageme...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

It must be broader than a single local event

A bootstrap skill is a task-level reusable pattern supported by multiple trajectories. It must be broader than a single local event

[7] [7]

Keep when_to_apply high-level and write transferable, actionablerules

Generate only if the pattern is reusable across similar tasks. Keep when_to_apply high-level and write transferable, actionablerules

[8] [8]

Do not invent unsupported steps, checks, or guidance

Ground every part of the skill in repeated evidence from the trajectories and outcomes. Do not invent unsupported steps, checks, or guidance. If failed trajectories reveal reusable cautions, include them as cautionary rules

[9] [9]

Skip if the evidence is weak, contradictory, accidental, too local, or collapses into an event-level reaction instead of a task-level pattern

[10] [10]

action":

Do not include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals. Output Schema generate { "action": "generate", "skill": { "title": "short reusable skill name", "granularity": "general", "when_to_apply": "high-level task situation where this skill shoul...

[11] [11]

It must focus on one important event inside the trajectory and stay narrower than a whole-task workflow

An event-driven skill is a local trigger-response pattern. It must focus on one important event inside the trajectory and stay narrower than a whole-task workflow

[12] [12]

Keep when_to_apply transferable and write local actionablerules

Generate only if the event yields a reusable lesson beyond this exact task. Keep when_to_apply transferable and write local actionablerules

[13] [13]

Do not invent an event that is not clearly present

Ground the trigger and guidance in the trajectory and result. Do not invent an event that is not clearly present. Failure- derived cautions are valid if they are clearly supported

[14] [14]

If multiple candidate events exist, choose the single most reusable one

[15] [15]

Skip if there is no strong reusable local event, or if the lesson is too task-specific or workflow-level

[16] [16]

action":

Do not include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals. Output Schema generate { "action": "generate", "skill": { "title": "short reusable skill name", "granularity": "event-driven", "when_to_apply": "high-level local signal or situation where ...

[17] [17]

If multiple skills are aligned and all are plausible revision targets, choose the single one that is most worth revising based on the strength and reusability of the evidence

Alignment and need for revision.Choose evolve only when one existing skill is clearly aligned with the current trajectory context and trigger pattern, and the current trajectory result shows that this aligned skill should be revised. If multiple skills are aligned and all are plausible revision targets, choose the single one that is most worth revising ba...

[18] [18]

rules should contain the revised reusable core of the skill as actionable guidance

Reusability and rule writing.Revise the skill only if the trajectory reveals a reusable missing case, missing check, better decision rule, improved ordering, or reusable caution. rules should contain the revised reusable core of the skill as actionable guidance. 3.Applicability.when_to_applyshould remain at a high transferable level rather than concrete task text

[19] [19]

Do not add unsupported refinements

Grounding.The revision must be directly supported by the provided trajectory and result. Do not add unsupported refinements. 5.Identity preservation.Keep the same capability identity. Do not drift into a different skill

[20] [20]

When to skip.Choose skip if the evidence is weak, contradictory, too local, or if the trajectory is better described as a brand-new skill rather than a revision

[21] [21]

action":

Do not include task-specific details.Do not include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals that only make sense for one instance. Output Schema evolve { "action": "evolve", "target_skill_id": "id of the single skill you chose to revise", "reas...

[22] [22]

The merged skill should have clearer applicability and cleaner rules with less duplication

Merge.Choose merge when the candidate and one retrieved skill share the same capability identity and can be combined into one stronger reusable skill. The merged skill should have clearer applicability and cleaner rules with less duplication

[23] [23]

Drop.Choose drop when the candidate is already covered by stronger retrieved skills, redundant, weakly evidenced, unsafe, too local, or too task-specific

[24] [24]

action":

Do not include task-specific details.No resulting skill may include repository names, issue descriptions, exact task goals, variable names, function names, class names, module names, exact file paths, or one-off literals that only make sense for one instance. Output Schema add { "action": "add", "reason": "short reason" } drop { "action": "drop", "reason"...