Evolutionary Generation of Multi-Agent Systems
Pith reviewed 2026-05-21 13:25 UTC · model grok-4.3
The pith
Evolutionary search over system configurations generates multi-agent teams that outperform both hand-designed setups and prior automatic methods on reasoning, coding, and tool-use tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoMAS formulates MAS generation as structured configuration generation and performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. On benchmarks covering reasoning, software engineering, and tool-use, the resulting systems show consistent gains in performance, executability, and robustness over human-designed MAS and prior automatic generation methods.
What carries the argument
Evolutionary generation in configuration space, where mutation and crossover operators are conditioned on feedback from execution traces to iteratively improve candidate multi-agent system setups.
If this is right
- Generated configurations reach higher accuracy on reasoning tasks than previous evolutionary agent methods.
- Systems achieve strong results on verified software engineering benchmarks while maintaining high executability.
- The approach works across reasoning, planning, and tool-augmented tasks without relying on rigid architectural templates.
- Produced MAS exhibit fewer runtime errors compared to methods that generate code directly.
Where Pith is reading between the lines
- If the configuration-based evolution scales reliably, it could lower the expertise barrier for deploying multi-agent systems in new application areas.
- Similar evolutionary search over structured configurations might apply to designing other composite AI systems such as tool chains or workflow planners.
- The experience memory mechanism could support ongoing adaptation if tasks or environments change after initial generation.
Load-bearing premise
Feedback from execution traces supplies reliable and generalizable signals for directing mutations and crossovers without causing overfitting to specific benchmarks or introducing undetected failure modes.
What would settle it
Measure whether performance and robustness gains persist when EvoMAS-generated systems are tested on entirely new task suites or environments that were never used during evolution or evaluation.
Figures
read the original abstract
Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard. Code is available at https://github.com/amazon-science/EvoMAS
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EvoMAS, which formulates the design of LLM-based multi-agent systems as evolutionary search over a structured configuration space. Configurations are initialized from a pool, then iteratively refined via feedback-conditioned mutation and crossover operations that draw on execution traces; an experience memory is maintained across iterations. The approach is evaluated on BBEH (reasoning), SWE-Bench (software engineering), and WorkBench (tool use), reporting consistent gains over both hand-designed MAS baselines and prior automatic generation methods such as EvoAgent (+10.5 points on BBEH, +7.1 on WorkBench) and a 79.1 % score on SWE-Bench-Verified with Claude-4.5-Sonnet. The generated systems are also claimed to exhibit higher executability and runtime robustness.
Significance. If the performance and robustness claims are substantiated, the work would offer a practical route to reducing the manual effort currently required to engineer reliable multi-agent architectures. The public release of code supports reproducibility and could enable follow-up studies on configuration-space search for agentic systems.
major comments (3)
- [Experiments / Evaluation] The evaluation section provides point estimates of improvement but does not report statistical significance, variance across runs, or exact baseline re-implementations (including prompt templates and decoding parameters). Without these, it is difficult to determine whether the reported +10.5 and +7.1 point gains are robust or sensitive to implementation details.
- [Method / Evolutionary Generation] The method description indicates that mutation and crossover are conditioned on execution traces collected from the evaluation benchmarks. No explicit held-out validation buffer or cross-benchmark transfer experiment is described; this leaves open the possibility that the evolutionary process exploits benchmark-specific reward patterns rather than learning generally transferable MAS configurations.
- [Experiments / Ablations] Ablation studies isolating the contribution of the experience memory versus the trace-conditioned operators are not presented. Consequently, it remains unclear whether the performance edge over EvoAgent stems primarily from the evolutionary mechanism or from other unablated factors such as pool size or iteration count.
minor comments (2)
- [Method] Notation for configuration components (e.g., agent roles, communication topology) should be introduced with a compact table or diagram early in the method section to aid readability.
- [Abstract] The abstract states that EvoMAS reaches the top of the SWE-Bench leaderboard; a brief comparison table against the current top entries would strengthen this claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate the changes planned for the revised manuscript.
read point-by-point responses
-
Referee: [Experiments / Evaluation] The evaluation section provides point estimates of improvement but does not report statistical significance, variance across runs, or exact baseline re-implementations (including prompt templates and decoding parameters). Without these, it is difficult to determine whether the reported +10.5 and +7.1 point gains are robust or sensitive to implementation details.
Authors: We agree that statistical rigor and reproducibility details are essential. In the revision we will report results from five independent runs with different random seeds, including mean and standard deviation for all main metrics. We will also add paired statistical significance tests for the key gains over EvoAgent and human-designed baselines. A new appendix will document the exact prompt templates, decoding parameters (temperature, top-p, max tokens), and re-implementation choices for every baseline to allow exact reproduction. revision: yes
-
Referee: [Method / Evolutionary Generation] The method description indicates that mutation and crossover are conditioned on execution traces collected from the evaluation benchmarks. No explicit held-out validation buffer or cross-benchmark transfer experiment is described; this leaves open the possibility that the evolutionary process exploits benchmark-specific reward patterns rather than learning generally transferable MAS configurations.
Authors: The evolutionary operators do use execution traces obtained while running on the target benchmarks. Although the three benchmarks cover distinct domains, we acknowledge that the absence of explicit transfer experiments leaves open the possibility of benchmark-specific tuning. In the revision we will add cross-benchmark transfer results: configurations evolved on BBEH will be evaluated on WorkBench (and vice versa), and we will report the resulting performance to demonstrate that the discovered structures generalize beyond the evolution benchmark. revision: yes
-
Referee: [Experiments / Ablations] Ablation studies isolating the contribution of the experience memory versus the trace-conditioned operators are not presented. Consequently, it remains unclear whether the performance edge over EvoAgent stems primarily from the evolutionary mechanism or from other unablated factors such as pool size or iteration count.
Authors: We agree that component-level ablations are needed. The revised manuscript will include new ablation tables that (i) disable the experience memory while keeping all other settings fixed and (ii) replace trace-conditioned mutation and crossover with their unconditioned counterparts, again controlling for pool size and iteration count. These results will be reported on BBEH and WorkBench to clarify the source of the observed gains relative to EvoAgent. revision: yes
Circularity Check
No significant circularity: empirical gains on external benchmarks with no self-referential derivations or fitted predictions
full rationale
The paper presents EvoMAS as an evolutionary search procedure over MAS configuration space, using execution-trace feedback for mutation and crossover to refine candidates and an experience memory. Performance improvements (+10.5 on BBEH, +7.1 on WorkBench, 79.1% on SWE-Bench-Verified) are reported as measured outcomes on independent, externally defined benchmarks rather than quantities derived from the method's own parameters or self-citations. No equations appear in the provided description, and the central claims do not reduce by construction to inputs (e.g., no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is smuggled via self-citation). The derivation chain is therefore self-contained against external evaluation standards.
Axiom & Free-Parameter Ledger
free parameters (2)
- mutation and crossover rates
- pool size and iteration count
Reference graph
Works this paper leans on
-
[1]
It cannot create novel topologies
Topology Selection: The meta-model must inherit the entire communication topology (reports to structure) from exactly one parent. It cannot create novel topologies. This constraint preserves proven coordination patterns
-
[2]
Agent Recombination: For each agent position in the inherited topology, the meta-model selects or combines agent-level attributes from either parent: • Take the full agent configuration from Parent 1 • Take the full agent configuration from Parent 2 • Create a hybrid by mixing prompts from one parent with models/tools from the other This design mirrors bi...
-
[3]
Task Similarity: Which configs solved similar tasks?
-
[4]
Agent Capabilities: Which structures match requirements?
-
[5]
Diversity: Select diverse configs for crossover
-
[6]
Performance History: Prioritize proven success ## Output Format **Selected Configurations: ** ‘‘‘json { "selected": [ {"name": "<config_name>", "reason": "<brief reason>"} ] } ‘‘‘ Figure 10.Meta-model prompt for parent selection and initialization. The prompt instructs the model to analyze task requirements and select configurations based on similarity, c...
-
[7]
Prompts: Ifagentsmisunderstand task requirements
-
[8]
Model IDs: If model capabilities are mismatched
-
[9]
Tools: If tool coverage is insufficient
-
[10]
Topology: If agent coordination is poor ## Output Format **Root Cause Analysis: ** [Identify main issue] **Component Choice: ** [prompts|model_id|tools|topology] **Updated Configuration: ** [Full mutated YAML] **Expected Improvement: ** [Explain expected fix] 29 Evolutionary Generation of Multi-Agent Systems Figure 11.Meta-model prompt for mutation operat...
-
[11]
Topology Selection: Choose parent’s structure
-
[12]
Agent Selection: Best config per agent
-
[13]
Prompt Combination: Merge or select best prompts
-
[14]
Model Assignment: Best models per role ## Output Format **Crossover Strategy: ** [Combination approach] **Topology Choice: ** [Which parent, why] **Agent Selection: ** [Source per agent] **Offspring Configuration: ** [Full YAML] Figure 12.Meta-model prompt for crossover operations. The prompt ensures structural integrity by requiring topology inheritance ...
-
[15]
Topology selection: The meta-model must commit to one parent’s entirereports to structure. This is an architec- tural decision, such as choosing “star” vs. “layered” vs. “pipeline”, that determines the offspring’s interaction pattern. The meta-model selects the topology that performed better on thecurrent query’s task type, not on aggregate accuracy
-
[16]
always verify your computation by re-executing the code
Agent recombination: For each agent position in the inherited topology, the meta-model chooses the agent config- uration from Parent 1, Parent 2, or a hybrid. The trace informs this: if Parent 1’s solver prompt produced cleaner intermediate reasoning but Parent 2’s aggregator prompt synthesized results more reliably, the meta-model inherits each from its ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.