Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Chengwei Qin; Hehai Lin; Qi Yang

arxiv: 2606.18837 · v2 · pith:K44Q3IYRnew · submitted 2026-06-17 · 💻 cs.MA · cs.AI· cs.LG

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Hehai Lin , Qi Yang , Chengwei Qin This is my paper

Pith reviewed 2026-06-26 18:44 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.LG

keywords multi-agent systemsmeta-skilllarge language modelsautomatic system generationexperience retentiontrajectory rolloutcontrastive analysis

0 comments

The pith

Skill-MAS evolves a reusable Meta-Skill for multi-agent LLM systems by distilling strategy principles from task trajectories without parametric updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Skill-MAS as a third path for automatic multi-agent system generation that retains experience separately from model training. It treats high-level orchestration as an evolvable Meta-Skill refined in a closed loop of sampling multiple trajectories per task and then applying selective reflection with hierarchical contrastive analysis on priority tasks. This setup is meant to combine the capability of frontier LLMs with accumulated generalizable strategies. A sympathetic reader would care because it addresses the repeated-search waste of inference-time methods and the capability ceiling of training-time methods.

Core claim

Skill-MAS conceptualizes the high-level orchestration capability as an evolvable Meta-Skill and refines architectural knowledge through a closed optimization loop of Multi-Trajectory Rollout, which samples a behavioral distribution for each task, and Selective Reflection, which adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles.

What carries the argument

The Meta-Skill as high-level orchestration capability, refined via the closed loop of Multi-Trajectory Rollout and Selective Reflection with hierarchical contrastive analysis.

If this is right

Automatic MAS generation can achieve performance gains on complex benchmarks while using frontier LLMs without gradient updates.
The method maintains a favorable cost-performance trade-off by avoiding repeated identical searches and large-scale training.
Evolved Meta-Skills exhibit robustness and strong transferability across unseen tasks and different LLMs.
Experience retention is decoupled from parametric updates, allowing scaling to large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support continual adaptation of agent orchestration rules across entirely new domains without retraining base models.
If the distillation step generalizes, similar rollout-plus-reflection loops might apply to other LLM orchestration problems such as tool-use planning.
Transferability across LLMs suggests the Meta-Skill captures structural patterns that are somewhat model-agnostic.

Load-bearing premise

Hierarchical contrastive analysis on selectively chosen tasks can reliably distill generalizable strategy-level principles rather than task-specific patterns or noise.

What would settle it

Testing whether the evolved Meta-Skill produces no performance gain or loses transferability when applied to unseen benchmarks or switched to a different LLM.

Figures

Figures reproduced from arXiv: 2606.18837 by Chengwei Qin, Hehai Lin, Qi Yang.

**Figure 2.** Figure 2: The evolutionary loop of Skill-MAS. The Meta-Skill [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Skill transferability heatmap across LLMs (DS: DeepSeek-V4-Flash, GPT: GPT-5.4-Nano) and tasks (BCP: BrowseComp-Plus, VITA: VitaBench). Right: Performance scaling across increasing multitrajectory rollout numbers (K = 3, 5, 7). scores and the performance gains (∆) over “SkillMAS-init”, while [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Meta-Skill Evolution on BrowseComp-Plus. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the initial Meta-Skill used for Skill-MAS-init and Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the optimized Meta-Skill for DeepResearchBench (DeepSeek-V4-Flash, Part 1/3). [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the optimized Meta-Skill for DeepResearchBench (DeepSeek-V4-Flash, Part 2/3). [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of the optimized Meta-Skill for DeepResearchBench (DeepSeek-V4-Flash, Part 3/3). [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Illustration of the optimized Meta-Skill for HLE-MATH (DeepSeek-V4-Flash, Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of the optimized Meta-Skill for HLE-MATH (DeepSeek-V4-Flash, Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Illustration of the optimized Meta-Skill for BrowseComp-Plus (DeepSeek-V4-Flash, Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Illustration of the optimized Meta-Skill for BrowseComp-Plus (DeepSeek-V4-Flash, Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Illustration of the optimized Meta-Skill for VitaBench (DeepSeek-V4-Flash, Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Illustration of the optimized Meta-Skill for VitaBench (DeepSeek-V4-Flash, Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: LLM-as-a-judge prompts used in DeepResearchBench. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: LLM-as-a-judge prompts used in VitaBench. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: MAS build contract used in the three-stage Skill-MAS construction pipeline (Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: MAS build contract used in the three-stage Skill-MAS construction pipeline (Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Within-task reflection prompt in Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Cross-task reflection prompt in Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Skill optimization prompt for Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

read the original abstract

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Skill-MAS sketches a third route for LLM multi-agent orchestration by evolving a meta-skill outside gradients, but the abstract leaves the contrastive distillation step too underspecified to judge whether it produces transferable rules or benchmark artifacts.

read the letter

The paper's main move is to treat high-level MAS orchestration as an evolvable meta-skill that sits on top of a frozen frontier LLM. Experience gets retained through a closed loop of multi-trajectory rollout followed by selective reflection, rather than through parameter updates or repeated search. That framing is the clearest new element: it explicitly decouples retention from training while still claiming to improve over pure inference-time methods.

The work does a reasonable job stating the capability-experience dilemma and then running the same setup across four benchmarks and four different LLMs, with some checks for transfer to unseen tasks. Those choices at least show an attempt to test generality rather than single-benchmark tuning.

The soft spot is the selective reflection step itself. The abstract says hierarchical contrastive analysis distills "generalizable, strategy-level principles," but gives no account of how contrastive pairs are formed, what the hierarchy consists of, or how priority tasks are chosen so that the output is not just amplified patterns from the four chosen benchmarks. Without that, the robustness and cross-LLM transfer claims rest on an unexamined assumption. No numbers, baselines, or ablation details appear either, so the "remarkable performance gains" and cost trade-off cannot be assessed from what is here.

This is for people already working on automatic MAS construction who want to see whether experience can be captured at the strategy level without retraining. A reader who needs a concrete mechanism or reproducible results will have to wait for the full methods section.

I would send it to review. The conceptual split is worth testing even if the current write-up is thin on the mechanics.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Skill-MAS, a third path for LLM-based automatic multi-agent system generation that evolves a Meta-Skill to retain orchestration experience without parametric updates. It employs a closed loop consisting of (1) Multi-Trajectory Rollout to sample behavioral distributions under the current Meta-Skill and (2) Selective Reflection that adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable strategy-level principles. Experiments across four complex benchmarks and four LLMs are claimed to demonstrate remarkable performance gains, a favorable cost-performance trade-off, robustness, and strong transferability to unseen tasks and different LLMs.

Significance. If the empirical claims hold, the work offers a conceptually appealing bridge between inference-time MAS (which cannot retain experience) and training-time MAS (which are limited by model scale). The decoupling of experience retention from gradient updates via an evolvable Meta-Skill could enable scalable, high-capability automatic MAS. The closed optimization loop and emphasis on hierarchical contrastive distillation represent a novel framing, though the significance hinges on whether the distilled principles are demonstrably general rather than benchmark-specific.

major comments (2)

[§3.2] §3.2 (Selective Reflection): the description of hierarchical contrastive analysis does not specify how contrastive pairs are constructed, how hierarchy levels are defined, or the precise selection criteria for priority tasks. Without these details it is impossible to evaluate whether the procedure reliably extracts transferable orchestration strategies or instead amplifies task idiosyncrasies from the four benchmarks; this mechanism is load-bearing for the robustness and cross-task/cross-LLM transferability claims.
[Experiments] Experiments section (transferability results): the reported strong transferability to unseen tasks is presented without explicit controls that isolate the contribution of the evolved Meta-Skill from possible memorization of benchmark patterns. A direct comparison against a baseline that applies task-specific heuristics distilled from the same rollouts would be required to substantiate that the output constitutes generalizable strategy-level principles rather than benchmark-tuned heuristics.

minor comments (2)

[Abstract] The abstract states performance gains and cost trade-offs but does not name the four benchmarks or the four LLMs; adding these identifiers would improve reproducibility.
[§3] Notation for the Meta-Skill and the contrastive loss (if any) should be introduced consistently in §3 and reused in the experimental tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas for improved clarity and rigor. We address each major comment point by point below and will revise the manuscript accordingly to strengthen the presentation of the Selective Reflection mechanism and the transferability analysis.

read point-by-point responses

Referee: [§3.2] §3.2 (Selective Reflection): the description of hierarchical contrastive analysis does not specify how contrastive pairs are constructed, how hierarchy levels are defined, or the precise selection criteria for priority tasks. Without these details it is impossible to evaluate whether the procedure reliably extracts transferable orchestration strategies or instead amplifies task idiosyncrasies from the four benchmarks; this mechanism is load-bearing for the robustness and cross-task/cross-LLM transferability claims.

Authors: We agree that the current description in §3.2 is high-level and would benefit from explicit specifications to allow readers to assess the mechanism's ability to produce generalizable principles. In the revised manuscript we will expand this section to detail: contrastive pairs are formed from trajectories sampled under the same Meta-Skill that differ substantially in end-to-end task success; hierarchy levels are organized as task-specific orchestration patterns, agent-role coordination rules, and system-wide workflow abstractions; and priority tasks are chosen by ranking tasks according to performance variance across the multi-trajectory rollout combined with a diversity score that favors tasks exposing systemic rather than idiosyncratic failures. These additions will directly address concerns about benchmark idiosyncrasies versus transferable strategy-level principles. revision: yes
Referee: [Experiments] Experiments section (transferability results): the reported strong transferability to unseen tasks is presented without explicit controls that isolate the contribution of the evolved Meta-Skill from possible memorization of benchmark patterns. A direct comparison against a baseline that applies task-specific heuristics distilled from the same rollouts would be required to substantiate that the output constitutes generalizable strategy-level principles rather than benchmark-tuned heuristics.

Authors: We acknowledge that the transferability results, while showing gains on unseen tasks and across LLMs, would be more convincing with an explicit control isolating the Meta-Skill from potential benchmark-specific memorization. In the revision we will add a new baseline experiment that distills task-specific heuristics directly from the identical multi-trajectory rollouts (without the selective reflection and hierarchical contrastive steps) and compares its transfer performance against the full Skill-MAS Meta-Skill. This comparison will provide evidence that the evolved Meta-Skill captures generalizable orchestration strategies beyond task-tuned heuristics. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; method is empirical and self-contained

full rationale

The paper describes Skill-MAS as an iterative loop of Multi-Trajectory Rollout followed by Selective Reflection via hierarchical contrastive analysis to evolve a Meta-Skill. No equations, fitted parameters, predictions, or first-principles derivations are presented that could reduce to inputs by construction. Claims of robustness and transferability rest on external benchmark experiments across four tasks and four LLMs, not on any self-referential fitting or self-citation chain. No self-definitional steps, ansatz smuggling, or renaming of known results appear. The derivation is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level concept of Meta-Skill itself.

invented entities (1)

Meta-Skill no independent evidence
purpose: High-level orchestration capability treated as evolvable without model updates
Introduced in the abstract as the central new object that is refined through rollout and reflection.

pith-pipeline@v0.9.1-grok · 5762 in / 1195 out tokens · 14529 ms · 2026-06-26T18:44:35.304878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 4 linked inside Pith

[1]

Yu Li, Rui Miao, Zhengling Qi, and Tian Lan

A survey on llm-based multi-agent sys- tems: workflow, infrastructure, and challenges.Vici- nagearth, 1(1):9. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. 2026. Arise: Agent reasoning with intrinsic skill evolu- tion in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060. Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yan...

arXiv 2026
[2]

Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, and Yong Yu

Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158. Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, and Yong Yu. 2026. Skillmas: Skill co-evolution with llm-based multi-agent system.arXiv preprint arXiv:2605.09341. Long Phan, Alice Gatti, Ziwe...

Pith/arXiv arXiv 2026
[3]

Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, and Yufei Guo

Skill-r1: Agent skill evolution via reinforce- ment learning.arXiv preprint arXiv:2605.09359. Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, and Yufei Guo. 2025a. Mas 2: Self-generative, self-configuring, self-rectifying multi-agent systems. arXiv preprint arXiv:2509.24323. Qian Wang, Tianyu Wang, Zhenheng ...

Pith/arXiv arXiv 2025
[4]

Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha

Furina: A fully customizable role-playing benchmark via scalable multi-agent collaboration pipeline.arXiv preprint arXiv:2510.06800. Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha. 2026. Co-evolving llm decision and skill bank agents for long-horizon tasks. arXiv preprint arXiv:2604...

Pith/arXiv arXiv 2026
[5]

generate-once-and-deploy

Skillrl: Evolving agents via recursive skill- augmented reinforcement learning.arXiv preprint arXiv:2602.08234. Feng Xiong, Zengbin Wang, Yong Wang, Xuecai Hu, Jinghan He, Liang Lin, Yuan Liu, and Xiangxiang Chu. 2026. Ace-skill: Bootstrapping multimodal agents with prioritized and clustered evolution.arXiv preprint arXiv:2605.08887. Fengli Xu, Qianyue Ha...

Pith/arXiv arXiv 2026
[6]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...
[7]

- Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...
[8]

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchi- cal, or Blackboard) based on Stage 1’s logical dependencies. For those complex b...
[9]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...
[10]

This node frames the entire problem space and ensures all downstream agents operate under a shared understanding

Context-Scoping Root Node: A dedicated root sub-task that defines scope, key concepts, metrics, terminology, and evaluation criteria before any analytical work begins. This node frames the entire problem space and ensures all downstream agents operate under a shared understanding
[11]

These must be designed to run in parallel from the context-scoping root, with no intermediate sequential dependencies among them

Parallel Analytical Branches: One sub-task per distinct analytical component (capped at four branches). These must be designed to run in parallel from the context-scoping root, with no intermediate sequential dependencies among them
[12]

Capability to analyze methods for

Dedicated Synthesis Terminal Node: A final sub-task that receives the outputs of all parallel branches and integrates them into the requested cohesive output (e.g., report, article, synthesis). The synthesis node must be the only terminal node. - Hard Constraint: Strict sequential chaining of analytical components is disallowed for such tasks. If the quer...
[13]

Soft pass: coverage of <missing terms> and token count <X%> below threshold. Please expand in synthesis

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...
[14]

The following required dimension appears to have insufficient coverage: <dimension>. You must include a dedicated section addressing it

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: Design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchical, or Blackboard) based on Stage 1’s logical dependencies. For complex sub-tasks, embed ...
[15]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...
[16]

verification report

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...
[17]

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchi- cal, or Blackboard) based on Stage 1’s logical dependencies. For those complex b...
[18]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...
[19]

Best guess: [answer] (unverified constraints: [list])

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Weighted Constraint Satisfaction Protocol with Partial-Evidence Fallback: Every agent that evaluates or synthesiz...
[20]

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierar- chical, or Blackboard) based on Stage 1’s logical dependencies. For complex but imp...
[21]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...
[22]

Selector

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...
[23]

Reality Check

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: Design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchical, or Blackboard) based on Stage 1’s logical dependencies. For complex but important sub...
[24]

Analyze Each Criterion: Consider how each article fulfills the requirements of each criterion
[25]

Comparative Evaluation: Analyze how the two articles perform on each criterion, referencing the content and criterion explanation
[26]

Standard 1

Score Separately: Based on your comparative analysis, score each article on each criterion (0-10 points). Scoring Rules For each criterion, score both articles on a scale of 0-10 (continuous values). The score should reflect the quality of performance on that criterion: - 0-2 points: Very poor performance. Almost completely fails to meet the criterion req...
[27]

trajectory fails due to poor reasoning

Be specific: Avoid vague statements like "trajectory fails due to poor reasoning". Instead: "trajectory fails at step 5 because it incorrectly assumes X when the constraint requires Y"
[28]

Reference specific steps, actions, or outputs

Use evidence: Ground every claim in concrete observations. Reference specific steps, actions, or outputs
[29]

The DIFFERENCE is where the insight lies

Think contrastively: Always compare high vs low trajectories. The DIFFERENCE is where the insight lies
[30]

task is too hard

Focus on actionability: Every diagnosis should lead to a concrete, implementable fix. Avoid unfixable issues like "task is too hard"
[31]

Quantify when possible: Use numbers (frequencies, percentages, counts) to support claims about patterns
[32]

Start with { and end with }

Output pure JSON: No markdown code blocks, no extra text. Start with { and end with }. Begin your analysis now. Figure 19: Within-task reflection prompt in Skill-MAS evolution. 28 Skill-MAS Evolution (Cross-Task Reflection) System Prompt You are the diagnosis agent for Skill_MAS Step 2 (Trajectory Reflection Synthesis). Your task is to synthesize cross- s...
[33]

High cross-trajectory volatility: Large score variance across rollouts indicates unstable/inconsistent policy behavior
[34]

struggles with multi-step reasoning

High intrinsic difficulty: Low average scores suggest systematic capability gaps === INPUT DATA === Phase 1 already analyzed each task’s rollouts. Below, each block contains: (1) the original problem / instruction text, and (2) the COMPLETE Phase-1 structured JSON for that task — every field in the Phase-1 schema (task_id, num_trajectories, score_statisti...
[35]

Be specific: Tie weaknesses/strengths to task_ids and concrete themes from the summaries when possible
[36]

Use evidence: Ground claims in the Phase-1 structured outputs and task text — do not invent unseen trajectory detail
[37]

Think globally: Patterns across samples drive prioritization
[38]

Focus on actionability: prioritized_fixes must be implementable in Step 3
[39]

Quantify when possible: Use counts where summaries allow
[40]

if text contains ’and’, split it

Output pure JSON: No markdown code blocks, no extra text. Start with { and end with }. Begin your synthesis now. Figure 20: Cross-task reflection prompt in Skill-MAS evolution. 29 Skill-MAS Evolution (Skill Optimization) System PromptYou are an expert author and optimizer for Skill-MAS three-stage SKILL.md files. Your task is to improve the current SKILL....
[41]

Evidence-Driven Abstraction: Every change must resolve a flaw found in Step2, but the solution MUST be abstracted into a universal systems-engineering principle
[42]

Make dependencies clear

Meaningful Depth: Do not just add adjectives. Add new sub-bullet points that introduce a concrete conceptual framework (e.g., instead of "Make dependencies clear", use "Build a Directed Acyclic Graph (DAG) mapping of logic state transitions")
[43]

Do not pile on multiple unrelated changes in a single pass

Incremental evolution (hard limit): In this round, introduce at most one substantive conceptual upgrade per SKILL stage section (1, 2, 3 — each stage at most one focused improvement). Do not pile on multiple unrelated changes in a single pass
[44]

Format Requirements: - MUST start directly with the Y AML frontmatter (—)

Output Format: Produce ONLY the complete updated SKILL.md. Format Requirements: - MUST start directly with the Y AML frontmatter (—). - MUST preserve the exact same Y AML keys and the exactly three-stage markdown structure (1, 2, 3). - NO markdown code fences around the entire output. - NO preamble, NO explanations, NO summary of changes. Output raw SKILL...

2023

[1] [1]

Yu Li, Rui Miao, Zhengling Qi, and Tian Lan

A survey on llm-based multi-agent sys- tems: workflow, infrastructure, and challenges.Vici- nagearth, 1(1):9. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. 2026. Arise: Agent reasoning with intrinsic skill evolu- tion in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060. Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yan...

arXiv 2026

[2] [2]

Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, and Yong Yu

Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158. Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, and Yong Yu. 2026. Skillmas: Skill co-evolution with llm-based multi-agent system.arXiv preprint arXiv:2605.09341. Long Phan, Alice Gatti, Ziwe...

Pith/arXiv arXiv 2026

[3] [3]

Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, and Yufei Guo

Skill-r1: Agent skill evolution via reinforce- ment learning.arXiv preprint arXiv:2605.09359. Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, and Yufei Guo. 2025a. Mas 2: Self-generative, self-configuring, self-rectifying multi-agent systems. arXiv preprint arXiv:2509.24323. Qian Wang, Tianyu Wang, Zhenheng ...

Pith/arXiv arXiv 2025

[4] [4]

Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha

Furina: A fully customizable role-playing benchmark via scalable multi-agent collaboration pipeline.arXiv preprint arXiv:2510.06800. Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha. 2026. Co-evolving llm decision and skill bank agents for long-horizon tasks. arXiv preprint arXiv:2604...

Pith/arXiv arXiv 2026

[5] [5]

generate-once-and-deploy

Skillrl: Evolving agents via recursive skill- augmented reinforcement learning.arXiv preprint arXiv:2602.08234. Feng Xiong, Zengbin Wang, Yong Wang, Xuecai Hu, Jinghan He, Liang Lin, Yuan Liu, and Xiangxiang Chu. 2026. Ace-skill: Bootstrapping multimodal agents with prioritized and clustered evolution.arXiv preprint arXiv:2605.08887. Fengli Xu, Qianyue Ha...

Pith/arXiv arXiv 2026

[6] [6]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

[7] [7]

- Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

[8] [8]

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchi- cal, or Blackboard) based on Stage 1’s logical dependencies. For those complex b...

[9] [9]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

[10] [10]

This node frames the entire problem space and ensures all downstream agents operate under a shared understanding

Context-Scoping Root Node: A dedicated root sub-task that defines scope, key concepts, metrics, terminology, and evaluation criteria before any analytical work begins. This node frames the entire problem space and ensures all downstream agents operate under a shared understanding

[11] [11]

These must be designed to run in parallel from the context-scoping root, with no intermediate sequential dependencies among them

Parallel Analytical Branches: One sub-task per distinct analytical component (capped at four branches). These must be designed to run in parallel from the context-scoping root, with no intermediate sequential dependencies among them

[12] [12]

Capability to analyze methods for

Dedicated Synthesis Terminal Node: A final sub-task that receives the outputs of all parallel branches and integrates them into the requested cohesive output (e.g., report, article, synthesis). The synthesis node must be the only terminal node. - Hard Constraint: Strict sequential chaining of analytical components is disallowed for such tasks. If the quer...

[13] [13]

Soft pass: coverage of <missing terms> and token count <X%> below threshold. Please expand in synthesis

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

[14] [14]

The following required dimension appears to have insufficient coverage: <dimension>. You must include a dedicated section addressing it

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: Design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchical, or Blackboard) based on Stage 1’s logical dependencies. For complex sub-tasks, embed ...

[15] [15]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

[16] [16]

verification report

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

[17] [17]

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchi- cal, or Blackboard) based on Stage 1’s logical dependencies. For those complex b...

[18] [18]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

[19] [19]

Best guess: [answer] (unverified constraints: [list])

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Weighted Constraint Satisfaction Protocol with Partial-Evidence Fallback: Every agent that evaluates or synthesiz...

[20] [20]

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierar- chical, or Blackboard) based on Stage 1’s logical dependencies. For complex but imp...

[21] [21]

- Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

[22] [22]

Selector

Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

[23] [23]

Reality Check

Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: Design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchical, or Blackboard) based on Stage 1’s logical dependencies. For complex but important sub...

[24] [24]

Analyze Each Criterion: Consider how each article fulfills the requirements of each criterion

[25] [25]

Comparative Evaluation: Analyze how the two articles perform on each criterion, referencing the content and criterion explanation

[26] [26]

Standard 1

Score Separately: Based on your comparative analysis, score each article on each criterion (0-10 points). Scoring Rules For each criterion, score both articles on a scale of 0-10 (continuous values). The score should reflect the quality of performance on that criterion: - 0-2 points: Very poor performance. Almost completely fails to meet the criterion req...

[27] [27]

trajectory fails due to poor reasoning

Be specific: Avoid vague statements like "trajectory fails due to poor reasoning". Instead: "trajectory fails at step 5 because it incorrectly assumes X when the constraint requires Y"

[28] [28]

Reference specific steps, actions, or outputs

Use evidence: Ground every claim in concrete observations. Reference specific steps, actions, or outputs

[29] [29]

The DIFFERENCE is where the insight lies

Think contrastively: Always compare high vs low trajectories. The DIFFERENCE is where the insight lies

[30] [30]

task is too hard

Focus on actionability: Every diagnosis should lead to a concrete, implementable fix. Avoid unfixable issues like "task is too hard"

[31] [31]

Quantify when possible: Use numbers (frequencies, percentages, counts) to support claims about patterns

[32] [32]

Start with { and end with }

Output pure JSON: No markdown code blocks, no extra text. Start with { and end with }. Begin your analysis now. Figure 19: Within-task reflection prompt in Skill-MAS evolution. 28 Skill-MAS Evolution (Cross-Task Reflection) System Prompt You are the diagnosis agent for Skill_MAS Step 2 (Trajectory Reflection Synthesis). Your task is to synthesize cross- s...

[33] [33]

High cross-trajectory volatility: Large score variance across rollouts indicates unstable/inconsistent policy behavior

[34] [34]

struggles with multi-step reasoning

High intrinsic difficulty: Low average scores suggest systematic capability gaps === INPUT DATA === Phase 1 already analyzed each task’s rollouts. Below, each block contains: (1) the original problem / instruction text, and (2) the COMPLETE Phase-1 structured JSON for that task — every field in the Phase-1 schema (task_id, num_trajectories, score_statisti...

[35] [35]

Be specific: Tie weaknesses/strengths to task_ids and concrete themes from the summaries when possible

[36] [36]

Use evidence: Ground claims in the Phase-1 structured outputs and task text — do not invent unseen trajectory detail

[37] [37]

Think globally: Patterns across samples drive prioritization

[38] [38]

Focus on actionability: prioritized_fixes must be implementable in Step 3

[39] [39]

Quantify when possible: Use counts where summaries allow

[40] [40]

if text contains ’and’, split it

Output pure JSON: No markdown code blocks, no extra text. Start with { and end with }. Begin your synthesis now. Figure 20: Cross-task reflection prompt in Skill-MAS evolution. 29 Skill-MAS Evolution (Skill Optimization) System PromptYou are an expert author and optimizer for Skill-MAS three-stage SKILL.md files. Your task is to improve the current SKILL....

[41] [41]

Evidence-Driven Abstraction: Every change must resolve a flaw found in Step2, but the solution MUST be abstracted into a universal systems-engineering principle

[42] [42]

Make dependencies clear

Meaningful Depth: Do not just add adjectives. Add new sub-bullet points that introduce a concrete conceptual framework (e.g., instead of "Make dependencies clear", use "Build a Directed Acyclic Graph (DAG) mapping of logic state transitions")

[43] [43]

Do not pile on multiple unrelated changes in a single pass

Incremental evolution (hard limit): In this round, introduce at most one substantive conceptual upgrade per SKILL stage section (1, 2, 3 — each stage at most one focused improvement). Do not pile on multiple unrelated changes in a single pass

[44] [44]

Format Requirements: - MUST start directly with the Y AML frontmatter (—)

Output Format: Produce ONLY the complete updated SKILL.md. Format Requirements: - MUST start directly with the Y AML frontmatter (—). - MUST preserve the exact same Y AML keys and the exactly three-stage markdown structure (1, 2, 3). - NO markdown code fences around the entire output. - NO preamble, NO explanations, NO summary of changes. Output raw SKILL...

2023