pith. sign in

arxiv: 2605.29790 · v1 · pith:KYNE77WYnew · submitted 2026-05-28 · 💻 cs.MA · cs.AI

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

Pith reviewed 2026-06-29 00:03 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords multi-agent systemsself-evolutionLLM-based agentscollaborative evolutionlong-horizon tasksexperience-driven MAS
0
0 comments X

The pith

Meta-Team lets multi-agent LLM systems improve through collaborative self-evolution from their execution experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLM-based multi-agent systems can evolve effectively by using collaborative self-evolution. It preserves each agent's execution context and coordinates post-task communication to let agents exchange distributed evidence from their intertwined execution chains. This evidence is then used for multi-scale self-evolution that creates reusable improvements in individual behaviors, coordination between agents, and overall team organization. A sympathetic reader would care because current MAS designs struggle with real-world failures that are hard to fix upfront, and this method turns those failures into systematic gains. Experiments show it beats single agents, fixed MAS designs, and earlier evolution techniques on six long-horizon benchmarks.

Core claim

Meta-Team is an experience-driven MAS evolution framework based on collaborative self-evolution that preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution and conducting multi-scale self-evolution to transform execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization.

What carries the argument

The collaborative self-evolution process that relies on context preservation and post-task communication coordination to turn intricate MAS execution experience into multi-scale improvements.

If this is right

  • Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods across six long-horizon agent benchmarks.
  • Further analyses show that Meta-Team enables more reliable and scalable MAS self-evolution.
  • Execution experience that interleaves multiple agents' chains and messages can be transformed into reusable improvements at multiple scales.
  • The design addresses the difficulty of identifying what should be improved in prolonged and intricate MAS experience.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Over repeated tasks, the team organization improvements could lead to emergent division of labor among agents.
  • The method might extend to non-LLM agent systems if they can generate similar execution traces.
  • Reducing reliance on hand-crafted MAS designs could accelerate deployment in dynamic environments.

Load-bearing premise

That preserving each agent's execution context and coordinating post-task communication will allow agents to exchange distributed evidence in a way that produces effective multi-scale improvements.

What would settle it

A controlled experiment on the six benchmarks where Meta-Team shows no performance gain over baselines, or where disabling the post-task communication coordination removes all observed benefits.

Figures

Figures reproduced from arXiv: 2605.29790 by Can Wang, Hande Dong, Hong Wang, huanshuo dong, Jiawei Chen, Qiang Lin, Tianfu Wang, Xiankun Lin, Zhezheng Hao, Ziyan Liu.

Figure 1
Figure 1. Figure 1: Three different schemes of experience-driven MAS self-evolution. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Meta-Team. Bottom-Up Self-Evolution. Collaborative self-evolution also enables agents to surface system-level improvements. Since agents observe different parts of the execution, they can identify team-level issues such as missing roles, redundant responsibilities, unclear handoffs, or ineffective coordination rules. Following the principle of team reflexivity [56, 68], the team can aggregate t… view at source ↗
Figure 3
Figure 3. Figure 3: Experiments on scaling context and out-of-distribution evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experiments with constrained budget. for MAS improvement. L2 and L3 show benchmark-dependent effects: L2 contributes more on Ansible, where structured software tasks rely on precise coordination, while L3 contributes more on ResearchRubrics, where open-ended evaluation benefits from better team organization. Together, these ablation studies demonstrate that Meta-Team’s gains come from both collaborative ex… view at source ↗
read the original abstract

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Meta-Team, an experience-driven evolution framework for LLM-based multi-agent systems (MAS). It preserves each agent's execution context and coordinates post-task communication so that agents can exchange distributed evidence from intricate interleaved execution chains. This design supports multi-scale self-evolution that produces reusable improvements to individual agent behaviors, inter-agent coordination, and team-level organization. The framework is evaluated on six long-horizon agent benchmarks and is reported to outperform single-agent baselines, hand-crafted MAS, and prior MAS evolution methods.

Significance. If the reported gains are robust and attributable to the collaborative mechanism, the work would offer a practical route to scalable, experience-driven improvement of MAS without requiring extensive manual redesign. The multi-scale framing (behavior, coordination, organization) directly targets a recognized difficulty in long-horizon MAS and could influence subsequent research on self-improving agent teams.

major comments (3)
  1. [Section 3] Section 3 (Meta-Team framework): The central claim that preserving execution context plus coordinated post-task communication enables agents to extract reusable improvements from interleaved chains is load-bearing, yet the manuscript provides no concrete algorithm, pseudocode, or worked example showing how agents identify and distill distributed evidence into multi-scale updates rather than simply producing additional prompts. Without this, attribution of benchmark gains to the collaborative mechanism remains unverified.
  2. [Section 4] Section 4 (Experiments): The abstract states consistent outperformance across six benchmarks, but the manuscript does not report ablation variants that disable post-task communication while keeping other components fixed. Such an ablation is required to test whether the identification problem is actually mitigated by the proposed coordination step.
  3. [Section 4.3] Section 4.3 (Analysis): The claim of more reliable and scalable self-evolution is supported only by aggregate success rates; no per-benchmark breakdown of failure modes before versus after evolution, nor any measure of how often post-task messages actually surface actionable improvements, is provided.
minor comments (2)
  1. Notation for the three evolution scales (agent, coordination, organization) is introduced without a compact summary table; adding one would improve readability.
  2. The related-work section cites prior MAS evolution methods but does not explicitly contrast the post-task communication design with the communication protocols used in those baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of Meta-Team. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and evidence.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Meta-Team framework): The central claim that preserving execution context plus coordinated post-task communication enables agents to extract reusable improvements from interleaved chains is load-bearing, yet the manuscript provides no concrete algorithm, pseudocode, or worked example showing how agents identify and distill distributed evidence into multi-scale updates rather than simply producing additional prompts. Without this, attribution of benchmark gains to the collaborative mechanism remains unverified.

    Authors: We agree that Section 3 would benefit from greater concreteness. While the section describes the preservation of execution context and the coordination of post-task communication at a conceptual level, it does not include pseudocode or a worked example of the distillation process. We will add both an algorithm box outlining the multi-scale update procedure and a detailed worked example from one benchmark to illustrate how distributed evidence is identified and transformed into reusable improvements at each scale. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments): The abstract states consistent outperformance across six benchmarks, but the manuscript does not report ablation variants that disable post-task communication while keeping other components fixed. Such an ablation is required to test whether the identification problem is actually mitigated by the proposed coordination step.

    Authors: We concur that an ablation isolating the post-task communication component is necessary to attribute gains specifically to the collaborative mechanism. We will add this ablation (disabling coordinated post-task messages while retaining context preservation and the evolution pipeline) and report the resulting performance on the six benchmarks in the revised experiments section. revision: yes

  3. Referee: [Section 4.3] Section 4.3 (Analysis): The claim of more reliable and scalable self-evolution is supported only by aggregate success rates; no per-benchmark breakdown of failure modes before versus after evolution, nor any measure of how often post-task messages actually surface actionable improvements, is provided.

    Authors: We acknowledge the limitation in the current analysis. We will expand Section 4.3 to include (i) per-benchmark tables or figures showing failure-mode distributions before and after evolution and (ii) statistics on the frequency and actionability of improvements derived from post-task messages. These additions will provide more granular support for the reliability and scalability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: framework design with no equations or fitted predictions

full rationale

The paper describes a proposed MAS evolution framework (Meta-Team) that preserves execution context and uses post-task communication for multi-scale improvements. No equations, parameters, or quantitative derivations appear in the provided text. The central claims rest on the design choices and empirical benchmark results rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The absence of mathematical structure means none of the enumerated circularity patterns can be exhibited by direct quote and reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no mathematical model, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5761 in / 970 out tokens · 25837 ms · 2026-06-29T00:03:36.987779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders

    cs.IR 2026-06 unverdicted novelty 7.0

    Transformer recommenders amplify popularity bias via spectral collapse when scaled; SPRINT constrains attention column-sums and feed-forward spectral norms to improve fairness and scaling behavior.

Reference graph

Works this paper leans on

90 extracted references · 42 canonical work pages · cited by 1 Pith paper · 23 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  2. [2]

    Why does the effective context length of llms fall short? InICLR, 2025

    Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short? InICLR, 2025

  3. [3]

    How we built our multi-agent research system, 2025

    Anthropic. How we built our multi-agent research system, 2025. https://www.anthropic. com/engineering/built-multi-agent-research-system

  4. [4]

    Introducing claude opus 4.6

    Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026

  5. [5]

    Orchestrate teams of Claude Code sessions

    Anthropic. Orchestrate teams of Claude Code sessions. https://code.claude.com/docs/ en/agent-teams, 2026

  6. [6]

    Claude Code Overview

    Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026

  7. [7]

    Introducing claude sonnet 4.6

    Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, 2026

  8. [8]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  9. [9]

    Where did it all go wrong? a hierarchical look into multi-agent error attribution

    Adi Banerjee, Anirudh Nair, and Tarik Borogovac. Where did it all go wrong? a hierarchical look into multi-agent error attribution. InNeurIPS Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025

  10. [10]

    Why do multi- agent llm systems fail? InNeurIPS Datasets and Benchmarks Track, 2025

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail? InNeurIPS Datasets and Benchmarks Track, 2025

  11. [11]

    BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

    Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, et al. Beyondswe: Can current code agent survive beyond single-repo bug fixing?arXiv preprint arXiv:2603.03194, 2026

  12. [12]

    Goagent: Group-of-agents communication topology generation for llm-based multi-agent systems.arXiv preprint arXiv:2603.19677, 2026

    Hongjiang Chen, Xin Zheng, Yixin Liu, Pengfei Jiao, Shiyuan Li, Huan Liu, Zhidong Zhao, Ziqi Xu, Ibrahim Khalil, and Shirui Pan. Goagent: Group-of-agents communication topology generation for llm-based multi-agent systems.arXiv preprint arXiv:2603.19677, 2026

  13. [13]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  14. [14]

    Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

    Mengzhuo Chen, Junjie Wang, Fangwen Mu, Yawen Wang, Zhe Liu, Huanxiang Feng, and Qing Wang. Seeing the whole elephant: A benchmark for failure attribution in llm-based multi-agent systems.arXiv preprint arXiv:2604.22708, 2026

  15. [15]

    Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

    Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, et al. Scoring, reasoning, and selecting the best! ensembling large language models via a peer-review process.arXiv preprint arXiv:2512.23213, 2025

  16. [16]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  17. [17]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025. 10

  18. [18]

    Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

  19. [19]

    Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

    Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

  20. [20]

    A new era of intelligence with gemini 3

    Google. A new era of intelligence with gemini 3. https://blog.google/products/ gemini/gemini-3/, 2025

  21. [21]

    Large language model based multi-agents: A survey of progress and challenges

    T Guo, X Chen, Y Wang, R Chang, S Pei, NV Chawla, O Wiest, and X Zhang. Large language model based multi-agents: A survey of progress and challenges. In33rd International Joint Conference on Artificial Intelligence (IJCAI 2024). IJCAI; Cornell arxiv, 2024

  22. [22]

    ReCreate: Reasoning and Creating Domain Agents Driven by Experience

    Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026

  23. [23]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

  24. [24]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  25. [25]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InICLR, 2024

  26. [26]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  27. [27]

    Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation

    Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Ziyu Ye, Bowei Xia, Tao Sun, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, et al. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  28. [28]

    Automated design of agentic systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InICLR, 2025

  29. [29]

    EvoMAS: Evolutionary Generation of Multi-Agent Systems

    Yuntong Hu, Matthew Trager, Yuting Zhang, Yi Zhang, Shuo Yang, Wei Xia, and Stefano Soatto. Evolutionary generation of multi-agent systems.arXiv preprint arXiv:2602.06511, 2026

  30. [30]

    Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

    Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, and Chanyoung Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

  31. [31]

    Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

    Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems.arXiv preprint arXiv:2509.14295, 2025

  32. [32]

    Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

    Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, and Zhongxiang Dai. Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

  33. [33]

    Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

    Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026. 11

  34. [34]

    Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

  35. [35]

    Towards Self-Improving Error Diagnosis in Multi-Agent Systems

    Jiazheng Li, Emine Yilmaz, Bei Chen, and Dieu-Thu Le. Towards self-improving error diagnosis in multi-agent systems.arXiv preprint arXiv:2604.17658, 2026

  36. [36]

    AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

    Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, et al. Agencybench: Benchmarking the frontiers of autonomous agents in 1m-token real-world contexts.arXiv preprint arXiv:2601.11044, 2026

  37. [37]

    Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation

    Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142–23150, 2026

  38. [38]

    A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

    Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

  39. [39]

    Cross-task experiential learning on llm-based multi-agent collaboration.arXiv preprint arXiv:2505.23187, 2025

    Yilong Li, Chen Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, et al. Cross-task experiential learning on llm-based multi-agent collaboration.arXiv preprint arXiv:2505.23187, 2025

  40. [40]

    Agentswift: Efficient llm agent design via value-guided hierarchical search.arXiv preprint arXiv:2506.06017, 2025

    Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, and Yong Li. Agentswift: Efficient llm agent design via value-guided hierarchical search.arXiv preprint arXiv:2506.06017, 2025

  41. [41]

    Openmanus: An open-source framework for building general ai agents, 2025

    Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, Sirui Hong, Sheng Fan, Xiao Tang, Bang Liu, Yuyu Luo, and Chenglin Wu. Openmanus: An open-source framework for building general ai agents, 2025. URLhttps://doi.org/10.5281/zenodo.15186407

  42. [42]

    Agentask: Multi-agent systems need to ask.arXiv preprint arXiv:2510.07593, 2025

    Bohan Lin, Kuo Yang, Zelin Tan, Yingchuan Lai, Chen Zhang, Guibin Zhang, Xinlei Yu, Miao Yu, Xu Wang, Yudong Zhang, et al. Agentask: Multi-agent systems need to ask.arXiv preprint arXiv:2510.07593, 2025

  43. [43]

    Se-agent: Self-evolution trajectory op- timization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

    Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory op- timization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

  44. [44]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  45. [45]

    Mas-on-the- fly: Dynamic adaptation of llm-based multi-agent systems at test time.arXiv preprint arXiv:2602.13671, 2026

    Guangyi Liu, Haojun Lin, Huan Zeng, Heng Wang, and Quanming Yao. Mas-on-the- fly: Dynamic adaptation of llm-based multi-agent systems at test time.arXiv preprint arXiv:2602.13671, 2026

  46. [46]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  47. [47]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  48. [48]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  49. [49]

    Introducing gpt-5.4

    Open AI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4 , 2026. 12

  50. [50]

    Multi-Agent Teams Hold Experts Back

    Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, and James Zou. Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

  51. [51]

    Multi-agent coordination patterns: Five approaches and when to use them

    Cara Phillips. Multi-agent coordination patterns: Five approaches and when to use them. https://claude.com/blog/multi-agent-coordination-patterns , April 2026. An- thropic Blog, Accessed: 2026-05-05

  52. [52]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

  53. [53]

    Locobench: A benchmark for long-context large language models in complex software engineering.arXiv preprint arXiv:2509.09614, 2025

    Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, et al. Locobench: A benchmark for long-context large language models in complex software engineering.arXiv preprint arXiv:2509.09614, 2025

  54. [54]

    Solving general arithmetic word problems

    Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

  55. [55]

    Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

    Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, et al. Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

  56. [56]

    Team reflexivity and innovation: The moderating role of team context.Journal of Management, 41(3):769–788, 2015

    Michaéla C Schippers, Michael A West, and Jeremy F Dawson. Team reflexivity and innovation: The moderating role of team context.Journal of Management, 41(3):769–788, 2015

  57. [57]

    United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

    HaoYang Shang, Xuan Liu, Zi Liang, Jie Zhang, Haibo Hu, and Song Guo. United minds or isolated agents? exploring coordination of llms under cognitive load theory.arXiv preprint arXiv:2506.06843, 2025

  58. [58]

    Agentsquare: Automatic llm agent search in modular design space

    Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space. InICLR, 2025

  59. [59]

    Hendryx, Brad Kenstler, and Bing Liu

    Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685, 2025

  60. [60]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  61. [61]

    Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

    Linxin Song, Jiale Liu, Jieyu Zhang, Shaokun Zhang, Ao Luo, Shijian Wang, Qingyun Wu, and Chi Wang. Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

  62. [62]

    Coact-1: Computer-using agents with coding as actions

    Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions. InICLR, 2026

  63. [63]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  64. [64]

    Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

    Dat Tran and Douwe Kiela. Single-agent llms outperform multi-agent systems on multi-hop reasoning under equal thinking token budgets.arXiv preprint arXiv:2604.02460, 2026

  65. [65]

    Automated stateful specialization for adaptive agent systems

    Myan Vu, Harrish Ayyanar, PANG JIANG, Anwiketh Reddy, and Mayank Goel. Automated stateful specialization for adaptive agent systems. InICLR, 2026

  66. [66]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. 13

  67. [67]

    Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

    Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

  68. [68]

    Reflexivity and work group effectiveness: A conceptual integration

    Michael West. Reflexivity and work group effectiveness: A conceptual integration. InThe handbook of work group psychology, pages 555–579. John Wiley & Sons, Ltd, 1996

  69. [69]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  70. [70]

    Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37: 122706–122740, 2024

    Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37: 122706–122740, 2024

  71. [71]

    When does divide and conquer work for long context llm? a noise decomposition framework

    Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, and Ce Zhang. When does divide and conquer work for long context llm? a noise decomposition framework. InICLR, 2026

  72. [72]

    Don’t build multi-agents

    Walden Yan. Don’t build multi-agents. https://cognition.ai/blog/ dont-build-multi-agents, June 2025. Cognition AI Blog, Accessed: 2026-05-05

  73. [73]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  74. [74]

    Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

    Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

  75. [75]

    URLhttps://arxiv.org/abs/2504.00587

  76. [76]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  77. [77]

    Masrouter: Learning to route llms for multi-agent systems

    Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

  78. [78]

    Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

    Weihao Zeng, Yuzhen Huang, and Junxian He. Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

  79. [79]

    Multi-agent architecture search via agentic supernet

    Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet. InInternational Conference on Machine Learning, pages 75834–75852. PMLR, 2025

  80. [80]

    Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

    Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

Showing first 80 references.