arxiv: 2605.09539 · v1 · submitted 2026-05-10 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

Chen Xu , Yicheng Hu , Ruizi Wang , Xinyu Lin , Wenjie Wang , Dongrui Liu , Fuli Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-agent systemsLLM agentstest-time adaptationtopology evolutioncapability evolutiononline graph adaptationagent birth-death

0 comments

The pith

Jointly adapting agent capabilities rapidly and communication topology slowly during test time improves LLM multi-agent system performance on complex tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing approaches to self-evolving multi-agent systems either keep the communication topology fixed or adapt only topology or only capabilities at inference time. The paper shows empirically and theoretically that both must change together for effective test-time evolution, but at mismatched speeds: agent capabilities update quickly to address new subtasks while the topology shifts more slowly to avoid breaking coordination. This separation is implemented in TacoMAS by treating the system as an online graph where nodes are agents and edges are communication links. A fast loop refreshes capabilities from trajectory feedback and a slow meta-LLM loop handles structural changes such as adding or removing agents and editing edges. The result is evolution toward a task-conditioned stable equilibrium and measured gains over prior methods.

Core claim

We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents' birth-death operations on MAS, including edge edit, agent addition, and agent,

What carries the argument

Online graph adaptation framework with a fast capability loop that refreshes agent expertise from trajectory feedback and a slow meta-LLM topology loop that executes birth-death operations and edge edits to reach a task-conditioned stable equilibrium.

If this is right

Multi-agent systems become able to respond to emerging subtasks without a pre-fixed structure.
Coordination remains stable because topology evolves on a slower schedule than capabilities.
The overall system converges to a task-conditioned stable equilibrium rather than oscillating.
Average performance rises 13.3 percent above the strongest of nearly twenty prior multi-agent baselines.
Joint dual-axis adaptation at inference time outperforms methods limited to one axis or static topology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fast-slow separation may generalize to other adaptive AI systems in which functional updates must not destabilize underlying structure.
Longer inference traces could be used to measure whether slow topology changes accumulate into more efficient agent teams over repeated tasks.
The framework implies that many current multi-agent designs could be improved by relaxing fixed topologies in favor of test-time structural edits.
Similar rate-separated loops might be tested in non-LLM agent populations or in domains where coordination costs are higher.

Load-bearing premise

The meta-LLM can reliably execute agent birth, death, and edge-edit operations without causing instability or needing per-task tuning.

What would settle it

An experiment in which single-rate adaptation of both capabilities and topology matches or exceeds TacoMAS accuracy on the same four benchmarks would falsify the necessity of distinct time scales.

read the original abstract

Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test-time co-evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents' birth-death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast-slow design drives MAS evolution toward a task-conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi-agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at https://github.com/chenxu2-gif/TacoMAS-MultiAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TacoMAS shows measurable gains from running fast capability updates alongside slower meta-LLM topology edits at test time, but the stability of those topology changes without per-task tuning remains the open question.

read the letter

The main point is that this paper splits test-time adaptation in LLM multi-agent systems into two loops on different timescales: a fast one that refreshes agent capabilities from recent trajectories and a slow one that uses a meta-LLM to add or remove agents and edit edges. They claim this joint but staggered approach is both necessary and sufficient for better coordination, and the numbers back it up with a 13.3% average lift over the strongest of nearly 20 baselines across four benchmarks. The code is released, which helps.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TacoMAS, a test-time co-evolution framework for LLM-based multi-agent systems. It claims that effective inference-time adaptation requires jointly evolving agent capabilities (via a fast loop using trajectory-level feedback) and communication topology (via a slow meta-LLM-driven loop performing birth-death and edge-edit operations), with the two operating on different time scales to reach a task-conditioned stable equilibrium. The work reports an average 13.3% improvement over nearly 20 multi-agent baselines across four benchmarks and releases code.

Significance. If the fast-slow separation proves stable and general without per-task tuning, the framework could advance dynamic MAS design by formalizing online graph adaptation with differentiated time scales. The public code release is a clear strength for reproducibility.

major comments (2)

The abstract states that the fast-slow design 'drives MAS evolution toward a task-conditioned stable equilibrium,' yet the provided description contains no convergence analysis, Lyapunov-style argument, or explicit stability condition for the meta-LLM topology loop when the fast capability loop alters agent behaviors; without such support the attribution of gains to the time-scale separation rather than empirical calibration remains unverified.
The skeptic's concern is load-bearing: the slow meta-LLM topology loop's birth-death and edge-edit decisions are described as 'meta-LLM-driven' without an ablation or sensitivity analysis showing that these operations remain stable across task shifts or rapid capability changes; if prompt engineering or temperature settings must be adjusted per benchmark, the claimed generality of the fast-slow distinction collapses.

minor comments (2)

The abstract mentions four benchmarks but does not name them; adding the names would help readers immediately contextualize the 13.3% claim.
Ensure that the exact prompt templates and decision criteria for the slow topology loop are included in the main text or a clearly referenced appendix so that the meta-LLM operations can be reproduced without reverse-engineering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to improve the manuscript. We address each major comment below with point-by-point responses, clarifying our contributions while committing to revisions where appropriate.

read point-by-point responses

Referee: The abstract states that the fast-slow design 'drives MAS evolution toward a task-conditioned stable equilibrium,' yet the provided description contains no convergence analysis, Lyapunov-style argument, or explicit stability condition for the meta-LLM topology loop when the fast capability loop alters agent behaviors; without such support the attribution of gains to the time-scale separation rather than empirical calibration remains unverified.

Authors: We appreciate this observation regarding the need for stronger theoretical grounding. The manuscript motivates the fast-slow separation through both empirical results (consistent performance gains and observed stabilization of topologies across four benchmarks) and a theoretical argument in the introduction and method sections that rapid capability updates handle subtasks while slower topology changes preserve coordination. However, we acknowledge that no formal convergence proof, Lyapunov analysis, or explicit stability condition is provided. In the revision, we will expand the discussion section to include a more detailed qualitative analysis of stability under time-scale separation, along with additional plots of topology evolution trajectories demonstrating convergence behavior. This will better substantiate the attribution of gains to the design. revision: partial
Referee: The skeptic's concern is load-bearing: the slow meta-LLM topology loop's birth-death and edge-edit decisions are described as 'meta-LLM-driven' without an ablation or sensitivity analysis showing that these operations remain stable across task shifts or rapid capability changes; if prompt engineering or temperature settings must be adjusted per benchmark, the claimed generality of the fast-slow distinction collapses.

Authors: We agree that robustness to meta-LLM variations is essential to support the generality claim. The manuscript already contains ablations (detailed in the experiments section) that isolate the fast capability loop from the slow topology loop and quantify their joint contribution to the 13.3% average improvement. To directly address sensitivity, the revised version will add a new subsection with experiments varying meta-LLM prompt phrasing and temperature settings across all benchmarks. These will show that birth-death and edge-edit decisions remain effective without benchmark-specific retuning, thereby reinforcing that the fast-slow distinction does not rely on per-task calibration. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical measurements and stated theoretical separation without reduction to inputs

full rationale

The paper introduces TacoMAS as an online graph adaptation framework with an explicit fast capability-update loop and slow meta-LLM topology loop, then reports measured performance gains (13.3 % average) on four benchmarks against external baselines. The theoretical statement that the fast-slow design reaches a task-conditioned stable equilibrium is asserted but not derived from any equation or self-referential definition inside the paper; no parameter is fitted to a subset of results and then renamed as a prediction, and no self-citation chain is invoked to justify uniqueness or stability. All load-bearing claims therefore remain externally falsifiable via the released code and benchmark numbers rather than being equivalent to the framework's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework assumes that trajectory-level feedback is sufficient to drive capability updates and that a separate meta-LLM can make stable topology decisions without additional supervision. No explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Trajectory-level feedback from task execution is a reliable and sufficient signal for updating agent capabilities at test time.
Invoked when describing the fast capability loop.
domain assumption A meta-LLM can perform agent birth-death and edge-edit operations that improve coordination without destabilizing the system.
Central to the slow topology loop and the claimed stable equilibrium.

pith-pipeline@v0.9.0 · 5557 in / 1449 out tokens · 28470 ms · 2026-05-12T04:49:13.049066+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

[1]

First conference on language modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=

work page
[2]

The twelfth international conference on learning representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The twelfth international conference on learning representations , year=

work page
[3]

Advances in neural information processing systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in neural information processing systems , volume=

work page
[4]

The Twelfth International Conference on Learning Representations , year=

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors , author=. The Twelfth International Conference on Learning Representations , year=

work page
[5]

AFlow: Automating Agentic Workflow Generation

Aflow: Automating agentic workflow generation , author=. arXiv preprint arXiv:2410.10762 , year=

work page internal anchor Pith review arXiv
[6]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[7]

Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180, 2025

Multi-agent architecture search via agentic supernet , author=. arXiv preprint arXiv:2502.04180 , year=

work page arXiv
[8]

Agentsquare: Automatic llm agent search in modular design space, 2025

Agentsquare: Automatic llm agent search in modular design space , author=. arXiv preprint arXiv:2410.06153 , year=

work page arXiv
[9]

Automated Design of Agentic Systems

Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=

work page internal anchor Pith review arXiv
[10]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

Swarmagentic: Towards fully automated agentic system generation via swarm intelligence , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

work page 2025
[11]

arXiv preprint arXiv:2507.22606 , year=

Metaagent: Automatically constructing multi-agent systems based on finite state machines , author=. arXiv preprint arXiv:2507.22606 , year=

work page arXiv
[12]

arXiv preprint arXiv:2601.19290 , year=

MetaGen: Self-Evolving Roles and Topologies for Multi-Agent LLM Reasoning , author=. arXiv preprint arXiv:2601.19290 , year=

work page arXiv
[13]

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering , author=. arXiv preprint arXiv:2604.05149 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

2604.01658 , archivePrefix =

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery , author=. arXiv preprint arXiv:2604.01658 , year=

work page arXiv
[15]

Nature , volume=

The logic of animal conflict , author=. Nature , volume=. 1973 , publisher=

work page 1973
[16]

Mathematical biosciences , volume=

Evolutionary stable strategies and game dynamics , author=. Mathematical biosciences , volume=. 1978 , publisher=

work page 1978
[17]

science , volume=

Evolutionary dynamics of biological games , author=. science , volume=. 2004 , publisher=

work page 2004
[18]

1998 , publisher=

Evolutionary games and population dynamics , author=. 1998 , publisher=

work page 1998
[19]

2013 , publisher=

The geometry of population genetics , author=. 2013 , publisher=

work page 2013
[20]

Systems & Control Letters , volume=

Stochastic approximation with two time scales , author=. Systems & Control Letters , volume=. 1997 , publisher=

work page 1997
[21]

2003 , publisher=

Stochastic approximation and recursive algorithms and applications , author=. 2003 , publisher=

work page 2003
[22]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[24]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page
[25]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page
[26]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Advances in Neural Information Processing Systems , volume=

Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Transactions of the Association for Computational Linguistics , volume=

♫ MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[31]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Browsecomp: A simple yet challenging benchmark for browsing agents , author=. arXiv preprint arXiv:2504.12516 , year=

work page internal anchor Pith review arXiv
[32]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , year=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , year=

work page
[33]

The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

Multi-Agent Collaboration via Evolving Orchestration , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[34]

The Fourteenth International Conference on Learning Representations , year=

Stochastic Self-Organization in Multi-Agent Systems , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[35]

Towards a Science of Scaling Agent Systems

Towards a science of scaling agent systems , author=. arXiv preprint arXiv:2512.08296 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Frontiers of Computer Science , volume=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=

work page
[37]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Large language model based multi-agents: A survey of progress and challenges , author=. arXiv preprint arXiv:2402.01680 , year=

work page internal anchor Pith review arXiv
[38]

arXiv preprint arXiv:2310.03659 , year=

Balancing autonomy and alignment: A multi-dimensional taxonomy for autonomous LLM-powered multi-agent architectures , author=. arXiv preprint arXiv:2310.03659 , year=

work page arXiv
[39]

Vicinagearth , volume=

A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges , author=. Vicinagearth , volume=

work page
[40]

Expert Systems with Applications , volume=

AgentAI: A comprehensive survey on autonomous agents in distributed AI for industry 4.0 , author=. Expert Systems with Applications , volume=

work page
[41]

Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

Finance agent benchmark: Benchmarking llms on real-world financial research tasks , author=. arXiv preprint arXiv:2508.00828 , year=

work page arXiv
[42]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv:2508.06600, 2025

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent , author=. arXiv preprint arXiv:2508.06600 , year=

work page arXiv
[43]

arXiv preprint arXiv:2412.21033 , year=

Plancraft: an evaluation dataset for planning with LLM agents , author=. arXiv preprint arXiv:2412.21033 , year=

work page arXiv
[44]

arXiv preprint arXiv:2405.00823 , year=

Workbench: a benchmark dataset for agents in a realistic workplace setting , author=. arXiv preprint arXiv:2405.00823 , year=

work page arXiv
[45]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , year=

Evoagentx: An automated framework for evolving agentic workflows , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , year=

work page 2025
[46]

Advances in Neural Information Processing Systems , volume=

Agent modelling under partial observability for deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

, author=

Epistemic Planning in a Fast and Slow Setting. , author=. TFSOCTAI@ AAAI Fall Symposium , year=

work page
[48]

International Conference on Machine Learning , year=

MANSA: learning fast and slow in multi-agent systems , author=. International Conference on Machine Learning , year=

work page