pith. sign in

arxiv: 2602.06511 · v3 · pith:67D7U6JOnew · submitted 2026-02-06 · 💻 cs.LG

Evolutionary Generation of Multi-Agent Systems

Pith reviewed 2026-05-21 13:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-agent systemsevolutionary algorithmslarge language modelsautomatic designconfiguration optimizationagent collaborationtask performanceexecution feedback
0
0 comments X

The pith

Evolutionary search over system configurations generates multi-agent teams that outperform both hand-designed setups and prior automatic methods on reasoning, coding, and tool-use tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that multi-agent systems for large language models can be generated more effectively by performing evolutionary operations directly in a configuration space rather than writing code or locking into fixed templates. It starts with a pool of candidate configurations, uses execution traces to guide mutation and crossover, and maintains an experience memory while refining the pool over iterations. This process produces agent systems with higher task accuracy, fewer execution failures, and better runtime stability than human-designed alternatives or earlier generation techniques. A reader would care because manual MAS design is labor-intensive and brittle, so a method that reliably evolves working configurations could make complex agent applications easier to build and deploy across domains.

Core claim

EvoMAS formulates MAS generation as structured configuration generation and performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. On benchmarks covering reasoning, software engineering, and tool-use, the resulting systems show consistent gains in performance, executability, and robustness over human-designed MAS and prior automatic generation methods.

What carries the argument

Evolutionary generation in configuration space, where mutation and crossover operators are conditioned on feedback from execution traces to iteratively improve candidate multi-agent system setups.

If this is right

  • Generated configurations reach higher accuracy on reasoning tasks than previous evolutionary agent methods.
  • Systems achieve strong results on verified software engineering benchmarks while maintaining high executability.
  • The approach works across reasoning, planning, and tool-augmented tasks without relying on rigid architectural templates.
  • Produced MAS exhibit fewer runtime errors compared to methods that generate code directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the configuration-based evolution scales reliably, it could lower the expertise barrier for deploying multi-agent systems in new application areas.
  • Similar evolutionary search over structured configurations might apply to designing other composite AI systems such as tool chains or workflow planners.
  • The experience memory mechanism could support ongoing adaptation if tasks or environments change after initial generation.

Load-bearing premise

Feedback from execution traces supplies reliable and generalizable signals for directing mutations and crossovers without causing overfitting to specific benchmarks or introducing undetected failure modes.

What would settle it

Measure whether performance and robustness gains persist when EvoMAS-generated systems are tested on entirely new task suites or environments that were never used during evolution or evaluation.

Figures

Figures reproduced from arXiv: 2602.06511 by Matthew Trager, Shuo Yang, Stefano Soatto, Wei Xia, Yi Zhang, Yuntong Hu, Yuting Zhang.

Figure 1
Figure 1. Figure 1: Overview of EvoMAS. Given a task, the MAS generator produces structured configurations specifying agent roles, model assignments, prompts, and communication topology. The MAS executor instantiates agents accordingly and executes the task. A verifier evaluates outputs to compute reward signals, which guide evolutionary optimization through mutation and selection over multiple generations. EvoMAS uses an LLM… view at source ↗
Figure 2
Figure 2. Figure 2: Trade-off between execution rate and task performance for MAS generation methods. Each point represents a method, with execution rate (%) on the x-axis and task performance (%) on the y-axis. EvoMAS achieves both high execution reliability and superior task performance across all benchmarks. 48.9%, exceeding the best per-LLM result (44.5% with Claude-3.5-Sonnet) by 4.4 points. On SWE-Bench-Verified, EvoMAS… view at source ↗
Figure 3
Figure 3. Figure 3: Results on state-of-the-art LLM (Claude-4.5-Sonnet). We compare Direct LLM Call, Single Agent, Majority Vote, and EvoMAS using Claude-4.5-Sonnet as both the MAS generator and agent backbone. EvoMAS demonstrates strong performance with the latest frontier model, achieving particularly notable results on SWE-Bench-Verified. Additional results with Claude-3.5-Sonnet and Claude-4-Sonnet are provided in Appendi… view at source ↗
Figure 4
Figure 4. Figure 4: Scaling ability on BBEH-Mini. Each line starts at its natural operating cost (circled). EvoMAS (⋆) outperforms all baselines across budgets and continues to improve with additional compute, while other methods plateau or degrade. ceeding the best predefined MAS results: MetaGPT reaches 40.9% (Claude-3.5-Sonnet), 44.3% (Qwen3-235B), and 43.8% (Qwen3-480B), demonstrating that evolutionary op￾timization disco… view at source ↗
Figure 5
Figure 5. Figure 5: Results with earlier Claude Sonnet models. We compare Direct LLM Call, Single Agent, Majority Vote, and EvoMAS using (a) Claude-3.5-Sonnet and (b) Claude-4-Sonnet as both the MAS generator and agent backbone. EvoMAS consistently outperforms baselines across both model generations. Performance scales with model capability, with Claude-4-Sonnet achieving substantially higher absolute performance than Claude-… view at source ↗
Figure 6
Figure 6. Figure 6: Evolution behavior analysis. (a)–(b) MAS population growth during evolution on SWE-Bench-Verified and BBEH-Mini, showing how new configurations join the population pool as queries are processed. (c) Performance variation on BBEH-Mini when shuffling the query order with different random seeds, displayed as deviations from the default-order baseline (49.1%). As shown in [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison across different backbone model configurations. (a) Direct LLM Call: raw model performance without agent scaffolding. (b) Single Agent: performance with tool-augmented single agent. (c) EvoMAS: our method with self-selective model assignment, where the evolutionary process dynamically assigns optimal models to each agent role. Self-Selective consistently outperforms single-backbone v… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of mutation operations across configuration components for BBEH tasks. As BBEH does not provide external tools, mutations are applied only to prompt editing, communication topology, and model selection [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Schematic YAML template for EvoMAS configurations. Concrete configurations instantiate this template with task-specific agents, models, and coordination patterns. B.2.1. BBEH & WORKBENCH For reasoning and tool-use benchmarks (BBEH, WorkBench), we implement the runtime based on HuggingFace’s smolagents framework. The agent type field specifies the agent implementation. For WorkBench tasks, agents are config… view at source ↗
Figure 10
Figure 10. Figure 10: Meta-model prompt for parent selection and initialization. The prompt instructs the model to analyze task requirements and select configurations based on similarity, capabilities, and diversity. For initialization, selected configurations are adapted to the target task before evolution begins. Mutation Prompt. The mutation prompt constrains the meta-model to focus on exactly one component type per mutatio… view at source ↗
Figure 12
Figure 12. Figure 12: Meta-model prompt for crossover operations. The prompt ensures structural integrity by requiring topology inheritance from a single parent. D.2. Task Execution Prompts For baselines (Direct LLM Call, Single Agent) and MAS agent roles, we use simple, minimal prompts that present the task directly without elaborate scaffolding. Direct LLM Call and Single Agent. For Direct LLM Call and Single Agent baselines… view at source ↗
read the original abstract

Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard. Code is available at https://github.com/amazon-science/EvoMAS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces EvoMAS, which formulates the design of LLM-based multi-agent systems as evolutionary search over a structured configuration space. Configurations are initialized from a pool, then iteratively refined via feedback-conditioned mutation and crossover operations that draw on execution traces; an experience memory is maintained across iterations. The approach is evaluated on BBEH (reasoning), SWE-Bench (software engineering), and WorkBench (tool use), reporting consistent gains over both hand-designed MAS baselines and prior automatic generation methods such as EvoAgent (+10.5 points on BBEH, +7.1 on WorkBench) and a 79.1 % score on SWE-Bench-Verified with Claude-4.5-Sonnet. The generated systems are also claimed to exhibit higher executability and runtime robustness.

Significance. If the performance and robustness claims are substantiated, the work would offer a practical route to reducing the manual effort currently required to engineer reliable multi-agent architectures. The public release of code supports reproducibility and could enable follow-up studies on configuration-space search for agentic systems.

major comments (3)
  1. [Experiments / Evaluation] The evaluation section provides point estimates of improvement but does not report statistical significance, variance across runs, or exact baseline re-implementations (including prompt templates and decoding parameters). Without these, it is difficult to determine whether the reported +10.5 and +7.1 point gains are robust or sensitive to implementation details.
  2. [Method / Evolutionary Generation] The method description indicates that mutation and crossover are conditioned on execution traces collected from the evaluation benchmarks. No explicit held-out validation buffer or cross-benchmark transfer experiment is described; this leaves open the possibility that the evolutionary process exploits benchmark-specific reward patterns rather than learning generally transferable MAS configurations.
  3. [Experiments / Ablations] Ablation studies isolating the contribution of the experience memory versus the trace-conditioned operators are not presented. Consequently, it remains unclear whether the performance edge over EvoAgent stems primarily from the evolutionary mechanism or from other unablated factors such as pool size or iteration count.
minor comments (2)
  1. [Method] Notation for configuration components (e.g., agent roles, communication topology) should be introduced with a compact table or diagram early in the method section to aid readability.
  2. [Abstract] The abstract states that EvoMAS reaches the top of the SWE-Bench leaderboard; a brief comparison table against the current top entries would strengthen this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments / Evaluation] The evaluation section provides point estimates of improvement but does not report statistical significance, variance across runs, or exact baseline re-implementations (including prompt templates and decoding parameters). Without these, it is difficult to determine whether the reported +10.5 and +7.1 point gains are robust or sensitive to implementation details.

    Authors: We agree that statistical rigor and reproducibility details are essential. In the revision we will report results from five independent runs with different random seeds, including mean and standard deviation for all main metrics. We will also add paired statistical significance tests for the key gains over EvoAgent and human-designed baselines. A new appendix will document the exact prompt templates, decoding parameters (temperature, top-p, max tokens), and re-implementation choices for every baseline to allow exact reproduction. revision: yes

  2. Referee: [Method / Evolutionary Generation] The method description indicates that mutation and crossover are conditioned on execution traces collected from the evaluation benchmarks. No explicit held-out validation buffer or cross-benchmark transfer experiment is described; this leaves open the possibility that the evolutionary process exploits benchmark-specific reward patterns rather than learning generally transferable MAS configurations.

    Authors: The evolutionary operators do use execution traces obtained while running on the target benchmarks. Although the three benchmarks cover distinct domains, we acknowledge that the absence of explicit transfer experiments leaves open the possibility of benchmark-specific tuning. In the revision we will add cross-benchmark transfer results: configurations evolved on BBEH will be evaluated on WorkBench (and vice versa), and we will report the resulting performance to demonstrate that the discovered structures generalize beyond the evolution benchmark. revision: yes

  3. Referee: [Experiments / Ablations] Ablation studies isolating the contribution of the experience memory versus the trace-conditioned operators are not presented. Consequently, it remains unclear whether the performance edge over EvoAgent stems primarily from the evolutionary mechanism or from other unablated factors such as pool size or iteration count.

    Authors: We agree that component-level ablations are needed. The revised manuscript will include new ablation tables that (i) disable the experience memory while keeping all other settings fixed and (ii) replace trace-conditioned mutation and crossover with their unconditioned counterparts, again controlling for pool size and iteration count. These results will be reported on BBEH and WorkBench to clarify the source of the observed gains relative to EvoAgent. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical gains on external benchmarks with no self-referential derivations or fitted predictions

full rationale

The paper presents EvoMAS as an evolutionary search procedure over MAS configuration space, using execution-trace feedback for mutation and crossover to refine candidates and an experience memory. Performance improvements (+10.5 on BBEH, +7.1 on WorkBench, 79.1% on SWE-Bench-Verified) are reported as measured outcomes on independent, externally defined benchmarks rather than quantities derived from the method's own parameters or self-citations. No equations appear in the provided description, and the central claims do not reduce by construction to inputs (e.g., no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is smuggled via self-citation). The derivation chain is therefore self-contained against external evaluation standards.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

Method rests on standard evolutionary operators and benchmark evaluations; no new physical entities or unproven mathematical axioms introduced beyond typical hyperparameter choices in evolutionary search.

free parameters (2)
  • mutation and crossover rates
    Evolutionary algorithms require tunable probabilities for applying mutation and crossover operators to configurations.
  • pool size and iteration count
    Size of initial configuration pool and number of evolutionary generations are design choices that affect search behavior.

pith-pipeline@v0.9.0 · 5798 in / 1092 out tokens · 47121 ms · 2026-05-21T13:25:11.603586+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    It cannot create novel topologies

    Topology Selection: The meta-model must inherit the entire communication topology (reports to structure) from exactly one parent. It cannot create novel topologies. This constraint preserves proven coordination patterns

  2. [2]

    chromosome

    Agent Recombination: For each agent position in the inherited topology, the meta-model selects or combines agent-level attributes from either parent: • Take the full agent configuration from Parent 1 • Take the full agent configuration from Parent 2 • Create a hybrid by mixing prompts from one parent with models/tools from the other This design mirrors bi...

  3. [3]

    Task Similarity: Which configs solved similar tasks?

  4. [4]

    Agent Capabilities: Which structures match requirements?

  5. [5]

    Diversity: Select diverse configs for crossover

  6. [6]

    selected

    Performance History: Prioritize proven success ## Output Format **Selected Configurations: ** ‘‘‘json { "selected": [ {"name": "<config_name>", "reason": "<brief reason>"} ] } ‘‘‘ Figure 10.Meta-model prompt for parent selection and initialization. The prompt instructs the model to analyze task requirements and select configurations based on similarity, c...

  7. [7]

    Prompts: Ifagentsmisunderstand task requirements

  8. [8]

    Model IDs: If model capabilities are mismatched

  9. [9]

    Tools: If tool coverage is insufficient

  10. [10]

    Topology: If agent coordination is poor ## Output Format **Root Cause Analysis: ** [Identify main issue] **Component Choice: ** [prompts|model_id|tools|topology] **Updated Configuration: ** [Full mutated YAML] **Expected Improvement: ** [Explain expected fix] 29 Evolutionary Generation of Multi-Agent Systems Figure 11.Meta-model prompt for mutation operat...

  11. [11]

    Topology Selection: Choose parent’s structure

  12. [12]

    Agent Selection: Best config per agent

  13. [13]

    Prompt Combination: Merge or select best prompts

  14. [14]

    FUNCTION_CALLS:

    Model Assignment: Best models per role ## Output Format **Crossover Strategy: ** [Combination approach] **Topology Choice: ** [Which parent, why] **Agent Selection: ** [Source per agent] **Offspring Configuration: ** [Full YAML] Figure 12.Meta-model prompt for crossover operations. The prompt ensures structural integrity by requiring topology inheritance ...

  15. [15]

    star” vs. “layered

    Topology selection: The meta-model must commit to one parent’s entirereports to structure. This is an architec- tural decision, such as choosing “star” vs. “layered” vs. “pipeline”, that determines the offspring’s interaction pattern. The meta-model selects the topology that performed better on thecurrent query’s task type, not on aggregate accuracy

  16. [16]

    always verify your computation by re-executing the code

    Agent recombination: For each agent position in the inherited topology, the meta-model chooses the agent config- uration from Parent 1, Parent 2, or a hybrid. The trace informs this: if Parent 1’s solver prompt produced cleaner intermediate reasoning but Parent 2’s aggregator prompt synthesized results more reliably, the meta-model inherits each from its ...