SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Alina Geiger; Dominik Sobania; Federica Sarro; Jie M. Zhang; Jingzhi Gong; Lukas Twist; Ruizhen Gu; Shuo Han; Yazhuo Cao; Zhiwei Fei

arxiv: 2604.09297 · v2 · pith:JFWDTT5Nnew · submitted 2026-04-10 · 💻 cs.SE · cs.AI

SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Jingzhi Gong , Ruizhen Gu , Zhiwei Fei , Yazhuo Cao , Lukas Twist , Alina Geiger , Shuo Han , Dominik Sobania

show 2 more authors

Federica Sarro Jie M. Zhang

This is my paper

Pith reviewed 2026-05-21 09:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords agent skillsmulti-objective optimizationsoftware engineeringskill bundlesPareto selectionevolutionary searchcoding agentsinference cost

0 comments

The pith

Treating skill bundles for coding agents as multi-objective search problems yields configurations with higher pass rates and lower costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current practice treats agent skills for software engineering tasks as fixed or tuned only on success rate, which can raise token costs or introduce bad guidance. The paper instead frames skill bundles as objects to improve through an evolutionary process that proposes changes and keeps the best trade-offs between pass rate and cost. This search produces bundles that lead on pass rate for most benchmark tasks and cut costs by as much as 31.7 percent while lifting success by up to 21 points. A reader would care because it shows that many agents are running on skill setups that leave measurable performance and efficiency gains on the table. The study of which edits worked points to simple rules like removing or swapping skills as reliable ways to refine bundles.

Core claim

SkillMOO evolves skill bundles through LLM-proposed edits and NSGA-II Pareto selection on pass rate and inference cost. Evaluated across all 16 SkillsBench SE tasks, SkillMOO achieves the top pass rate rank on 11 of 12 non-zero-pass tasks while achieving cost reductions of up to 31.7% over static bundles, with pass rate gains up to 21 percentage points. Analysis of 38 skill edits shows that pruning and substitution dominate successful operations.

What carries the argument

SkillMOO, a framework that treats skill bundles as multi-objective search objects and evolves them via LLM-proposed edits followed by Pareto selection on pass rate versus inference cost.

If this is right

Skill bundles exist that improve both task success and computational cost relative to current static or single-objective designs.
Pruning unhelpful skills and substituting stronger ones are the main operations that produce better bundles.
Deploying agent skills without joint cost validation leaves superior configurations undiscovered.
A search-based process can replace manual or pass-rate-only skill engineering in agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-objective search could be used to refine skills for agents in domains other than software engineering.
Periodic re-optimization of bundles might help agents adapt when new task types or cost constraints appear.
The pruning and substitution principles could be turned into reusable guidelines for human skill designers.

Load-bearing premise

The 16 benchmark tasks together with pass rate and inference cost are sufficient stand-ins for real-world software engineering agent performance and expense.

What would settle it

Apply the evolved skill bundles to a fresh collection of real developer coding tasks drawn from outside the benchmark and measure whether the reported pass-rate gains and cost reductions still hold.

Figures

Figures reproduced from arXiv: 2604.09297 by Alina Geiger, Dominik Sobania, Federica Sarro, Jie M. Zhang, Jingzhi Gong, Lukas Twist, Ruizhen Gu, Shuo Han, Yazhuo Cao, Zhiwei Fei.

**Figure 1.** Figure 1: SkillMOO workflow: solver-optimizer loop with evolving skill bundles. This motivates search-based, data-driven skill optimization, and thereby we propose SkillMOO, a multi-objective optimization (MOO) framework that automatically evolves task-specific skill bundles for SE tasks using LLM-proposed edits and NSGA-II survivor selection on pass rate and cost: a task solver agent evaluates candidate skill bund… view at source ↗

read the original abstract

Agent skills are increasingly used to configure coding agents for software engineering (SE) tasks, yet current practice treats them as static, hand-crafted assets, or evolved on pass rate alone. This is insufficient: a skill can improve task success while substantially raising token cost, or introducing misleading guidance. We argue that SE agent skill bundles can be treated as multi-objective search objects and present SkillMOO, a framework that evolves skill bundles through LLM-proposed edits and NSGA-II Pareto selection on pass rate and inference cost. Evaluated across all 16 SkillsBench SE tasks, SkillMOO achieves the top pass rate rank on 11 of 12 non-zero-pass tasks while achieving cost reductions of up to 31.7% over static bundles, with pass rate gains up to 21 percentage points. Analysis of 38 skill edits shows that pruning and substitution dominate successful operations, offering actionable principles for skill bundle design. Thereby, the current practice of deploying skills without cost-aware validation leaves better skill configurations unexplored, motivating a new class of cost-aware, search-based skill engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillMOO combines LLM-proposed edits with NSGA-II to optimize skill bundles on both pass rate and token cost, producing clear gains on SkillsBench, but the same 16 tasks serve for both search and final measurement.

read the letter

The main point is that this paper gives a concrete recipe for evolving skill bundles for coding agents: an LLM suggests edits, NSGA-II keeps the Pareto front on pass rate versus inference cost, and the result beats static bundles on the benchmark. The edit analysis is the part that feels most useful, showing that pruning and substitution drive most of the wins rather than additions or rewrites. That kind of breakdown can actually guide manual tuning even if someone skips the full optimizer. The reported numbers are straightforward to understand: top pass-rate rank on 11 of 12 tasks with non-zero success and cost drops up to 31.7 percent. The method itself is simple enough that a team could re-implement the loop without much trouble. The soft spot is exactly the one the stress-test flagged. Optimization and evaluation both run on the full set of 16 SkillsBench tasks with no held-out portion or separate generalization suite mentioned. Any improvement is therefore measured on the distribution the search already saw, which leaves open how much is real transfer versus fitting to these particular problems. The abstract also gives no variance numbers or statistical tests, so the size of the gains is hard to judge from the summary alone. If the full paper adds multiple runs or external tasks, that would change the picture. This work is aimed at people already building or tuning SE agents who care about token budgets. A practitioner could borrow the edit heuristics; a researcher could use the Pareto setup as a baseline for further search methods. It deserves a serious referee because the core loop is reproducible on a public benchmark and the practical angle is clear. I would send it to review and expect the generalization question to be the main point of discussion.

Referee Report

2 major / 2 minor

Summary. The paper introduces SkillMOO, a framework that evolves SE agent skill bundles via LLM-proposed edits followed by NSGA-II Pareto optimization on two objectives: task pass rate and inference cost. Evaluated on all 16 SkillsBench tasks, the method is reported to achieve the highest pass-rate rank on 11 of 12 non-zero tasks, cost reductions up to 31.7% relative to static bundles, and pass-rate gains up to 21 percentage points; an analysis of 38 edits is used to extract design principles such as the dominance of pruning and substitution operations.

Significance. If the empirical claims are shown to generalize beyond the optimization distribution, the work would establish a practical, search-based alternative to hand-crafted or single-objective skill bundles and supply concrete, actionable heuristics for cost-aware skill engineering. The explicit multi-objective framing and the edit-operation taxonomy are the primary contributions that could influence both agent design practice and future benchmark construction.

major comments (2)

[§4] §4 (Evaluation) and abstract: optimization and final reporting both use the identical 16 SkillsBench tasks with no held-out set, cross-validation, or separate generalization suite mentioned. Because the selection criterion (pass rate + cost) is exactly the evaluation metric, the reported top ranks and cost reductions are guaranteed to be measured on the optimization distribution; this directly undermines the claim that SkillMOO discovers transferable skill principles rather than in-sample artifacts.
[Abstract, §4] Abstract and §4: the central quantitative claims (top rank on 11/12 tasks, 31.7% cost reduction, 21 pp pass-rate gain) are stated without any report of statistical tests, standard deviation across runs, number of independent trials, or controls for prompt sensitivity and LLM stochasticity. These omissions make it impossible to assess whether the observed Pareto improvements are reliable or merely noise.

minor comments (2)

[§3] The description of the NSGA-II implementation (population size, number of generations, mutation/crossover rates) is not detailed enough to allow reproduction; a table or pseudocode block would help.
[Figures 2-3] Figure captions and axis labels for the Pareto fronts should explicitly state the number of runs and whether shaded regions represent standard error or min/max.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging the validity of the concerns raised and proposing targeted revisions to improve the manuscript.

read point-by-point responses

Referee: [§4] §4 (Evaluation) and abstract: optimization and final reporting both use the identical 16 SkillsBench tasks with no held-out set, cross-validation, or separate generalization suite mentioned. Because the selection criterion (pass rate + cost) is exactly the evaluation metric, the reported top ranks and cost reductions are guaranteed to be measured on the optimization distribution; this directly undermines the claim that SkillMOO discovers transferable skill principles rather than in-sample artifacts.

Authors: We agree that optimization and final evaluation occur on the identical 16 SkillsBench tasks, so the reported ranks and cost reductions are measured on the optimization distribution. This limits strong claims of broad transferability. The core contribution remains the multi-objective framework and the empirical taxonomy of 38 edits (showing dominance of pruning and substitution). We will revise the abstract and §4 to explicitly note that results are in-sample, moderate language around 'transferable skill principles' to 'observed patterns from successful edits on these tasks,' and add a limitations paragraph plus future-work discussion on held-out evaluation or cross-validation. revision: partial
Referee: [Abstract, §4] Abstract and §4: the central quantitative claims (top rank on 11/12 tasks, 31.7% cost reduction, 21 pp pass-rate gain) are stated without any report of statistical tests, standard deviation across runs, number of independent trials, or controls for prompt sensitivity and LLM stochasticity. These omissions make it impossible to assess whether the observed Pareto improvements are reliable or merely noise.

Authors: We acknowledge that the current manuscript presents results from single optimization runs without reporting variance, number of trials, or statistical controls. In the revised version we will conduct additional independent runs (minimum three per task) and report means with standard deviations for pass rate and cost. We will also add a short discussion of controls for LLM stochasticity (fixed temperature, seed settings, and prompt-variation averaging) and include basic statistical comparisons (e.g., paired tests or confidence intervals) for the key improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical optimization results measured directly on benchmark

full rationale

The paper presents SkillMOO as an evolutionary framework that applies LLM edits and NSGA-II to optimize skill bundles explicitly on pass-rate and inference-cost objectives computed over the 16 SkillsBench tasks, then reports the achieved ranks and reductions on the identical task set. This constitutes a direct empirical measurement of an optimizer's output rather than any derivation in which a claimed prediction or first-principles result is definitionally equivalent to its inputs. No equations reduce outputs to fitted parameters by construction, no load-bearing self-citations justify uniqueness, and no ansatz or renaming is smuggled in. The lack of a held-out set is a generalization limitation but does not create circularity in the reported derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard assumptions of multi-objective evolutionary algorithms and the capability of LLMs to generate useful edits; no new free parameters, axioms beyond domain-standard ones, or invented entities are introduced in the abstract.

axioms (1)

domain assumption NSGA-II Pareto selection on pass rate and inference cost yields practically useful skill bundles
Invoked by the choice of optimization method and evaluation criteria.

pith-pipeline@v0.9.0 · 5746 in / 1372 out tokens · 53847 ms · 2026-05-21T09:49:38.411243+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SkillMOO optimizes candidate skill bundles (B) with a bi-objective formulation: min f(b) = [-pass(b), cost(b)] ... NSGA-II survivor selection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
cs.CL 2026-05 unverdicted novelty 5.0

SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Alzubi, S., et al.: Evoskill: Automated skill discovery for multi-agent systems (2026),https://arxiv.org/abs/2603.02766

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Anthropic: Introducing agent skills.https://claude.com/blog/skills, published October 16, 2025

work page 2025
[3]

Han, T., Zhang, Y., Song, W., Fang, C., Chen, Z., Sun, Y., Hu, L.: Swe-skills- bench: Do agent skills actually help in real-world software engineering? (2026), https://arxiv.org/abs/2603.15401

work page arXiv 2026
[4]

Li, X., Chen, W., Liu, Y., et al.: Skillsbench: Benchmarking how well agent skills work across diverse tasks (2026),https://arxiv.org/abs/2602.12670

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

ACM Transactions on Software Engineering and Methodology (2024)

Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., Lou, Y.: Large lan- guage model-based agents for software engineering: A survey. ACM Transactions on Software Engineering and Methodology (2024)

work page 2024
[6]

Biometrics pp

Scott,A.J.,Knott,M.:Aclusteranalysismethodforgroupingmeansintheanalysis of variance. Biometrics pp. 507–512 (1974)

work page 1974
[7]

com/trq212/status/2033949937936085378, published March 17, 2025

Shihipar, T.: Lessons from building claude code: How we use skills.https://x. com/trq212/status/2033949937936085378, published March 17, 2025

work page arXiv 2025
[8]

Ye, H., He, X., Arak, V., Dong, H., Song, G.: Meta context engineering via agentic skill evolution (2026),https://arxiv.org/abs/2601.21557

work page arXiv 2026
[9]

Zeng, A., Lv, X., Hou, Z., Du, Z., et al.: Glm-5: from vibe coding to agentic engineering (2026),https://arxiv.org/abs/2602.15763 SkillMOO: Multi-Objective Optimization of Agent Skills 7

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Zhang, H., Fan, S., Zou, H.P., Chen, Y., Wang, Z., Zhou, J., Li, C., Huang, W.C., Yao, Y., Zheng, K., Liu, X., Li, X., Yu, P.S.: Evoskills: Self-evolving agent skills via co-evolutionary verification (2026),https://arxiv.org/abs/2604.01687

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Zheng, Y., Zhang, Z., Ma, C., Yu, Y., Zhu, J., Wu, Y., Xu, T., Dong, B., Zhu, H., Huang, R., Yu, G.: Skillrouter: Skill routing for llm agents at scale (2026), https://arxiv.org/abs/2603.22455

work page arXiv 2026

[1] [1]

Alzubi, S., et al.: Evoskill: Automated skill discovery for multi-agent systems (2026),https://arxiv.org/abs/2603.02766

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Anthropic: Introducing agent skills.https://claude.com/blog/skills, published October 16, 2025

work page 2025

[3] [3]

Han, T., Zhang, Y., Song, W., Fang, C., Chen, Z., Sun, Y., Hu, L.: Swe-skills- bench: Do agent skills actually help in real-world software engineering? (2026), https://arxiv.org/abs/2603.15401

work page arXiv 2026

[4] [4]

Li, X., Chen, W., Liu, Y., et al.: Skillsbench: Benchmarking how well agent skills work across diverse tasks (2026),https://arxiv.org/abs/2602.12670

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

ACM Transactions on Software Engineering and Methodology (2024)

Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., Lou, Y.: Large lan- guage model-based agents for software engineering: A survey. ACM Transactions on Software Engineering and Methodology (2024)

work page 2024

[6] [6]

Biometrics pp

Scott,A.J.,Knott,M.:Aclusteranalysismethodforgroupingmeansintheanalysis of variance. Biometrics pp. 507–512 (1974)

work page 1974

[7] [7]

com/trq212/status/2033949937936085378, published March 17, 2025

Shihipar, T.: Lessons from building claude code: How we use skills.https://x. com/trq212/status/2033949937936085378, published March 17, 2025

work page arXiv 2025

[8] [8]

Ye, H., He, X., Arak, V., Dong, H., Song, G.: Meta context engineering via agentic skill evolution (2026),https://arxiv.org/abs/2601.21557

work page arXiv 2026

[9] [9]

Zeng, A., Lv, X., Hou, Z., Du, Z., et al.: Glm-5: from vibe coding to agentic engineering (2026),https://arxiv.org/abs/2602.15763 SkillMOO: Multi-Objective Optimization of Agent Skills 7

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Zhang, H., Fan, S., Zou, H.P., Chen, Y., Wang, Z., Zhou, J., Li, C., Huang, W.C., Yao, Y., Zheng, K., Liu, X., Li, X., Yu, P.S.: Evoskills: Self-evolving agent skills via co-evolutionary verification (2026),https://arxiv.org/abs/2604.01687

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Zheng, Y., Zhang, Z., Ma, C., Yu, Y., Zhu, J., Wu, Y., Xu, T., Dong, B., Zhu, H., Huang, R., Yu, G.: Skillrouter: Skill routing for llm agents at scale (2026), https://arxiv.org/abs/2603.22455

work page arXiv 2026