Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

Can Wang; Hande Dong; Hong Wang; huanshuo dong; Jiawei Chen; Qiang Lin; Tianfu Wang; Xiankun Lin; Zhezheng Hao; Ziyan Liu

arxiv: 2605.29790 · v1 · pith:KYNE77WYnew · submitted 2026-05-28 · 💻 cs.MA · cs.AI

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

Zhezheng Hao , Tianfu Wang , Huanshuo Dong , Ziyan Liu , Hong Wang , Xiankun Lin , Qiang Lin , Can Wang

show 2 more authors

Hande Dong Jiawei Chen

This is my paper

Pith reviewed 2026-06-29 00:03 UTC · model grok-4.3

classification 💻 cs.MA cs.AI

keywords multi-agent systemsself-evolutionLLM-based agentscollaborative evolutionlong-horizon tasksexperience-driven MAS

0 comments

The pith

Meta-Team lets multi-agent LLM systems improve through collaborative self-evolution from their execution experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLM-based multi-agent systems can evolve effectively by using collaborative self-evolution. It preserves each agent's execution context and coordinates post-task communication to let agents exchange distributed evidence from their intertwined execution chains. This evidence is then used for multi-scale self-evolution that creates reusable improvements in individual behaviors, coordination between agents, and overall team organization. A sympathetic reader would care because current MAS designs struggle with real-world failures that are hard to fix upfront, and this method turns those failures into systematic gains. Experiments show it beats single agents, fixed MAS designs, and earlier evolution techniques on six long-horizon benchmarks.

Core claim

Meta-Team is an experience-driven MAS evolution framework based on collaborative self-evolution that preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution and conducting multi-scale self-evolution to transform execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization.

What carries the argument

The collaborative self-evolution process that relies on context preservation and post-task communication coordination to turn intricate MAS execution experience into multi-scale improvements.

If this is right

Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods across six long-horizon agent benchmarks.
Further analyses show that Meta-Team enables more reliable and scalable MAS self-evolution.
Execution experience that interleaves multiple agents' chains and messages can be transformed into reusable improvements at multiple scales.
The design addresses the difficulty of identifying what should be improved in prolonged and intricate MAS experience.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Over repeated tasks, the team organization improvements could lead to emergent division of labor among agents.
The method might extend to non-LLM agent systems if they can generate similar execution traces.
Reducing reliance on hand-crafted MAS designs could accelerate deployment in dynamic environments.

Load-bearing premise

That preserving each agent's execution context and coordinating post-task communication will allow agents to exchange distributed evidence in a way that produces effective multi-scale improvements.

What would settle it

A controlled experiment on the six benchmarks where Meta-Team shows no performance gain over baselines, or where disabling the post-task communication coordination removes all observed benefits.

Figures

Figures reproduced from arXiv: 2605.29790 by Can Wang, Hande Dong, Hong Wang, huanshuo dong, Jiawei Chen, Qiang Lin, Tianfu Wang, Xiankun Lin, Zhezheng Hao, Ziyan Liu.

**Figure 2.** Figure 2: Overview of Meta-Team. Bottom-Up Self-Evolution. Collaborative self-evolution also enables agents to surface system-level improvements. Since agents observe different parts of the execution, they can identify team-level issues such as missing roles, redundant responsibilities, unclear handoffs, or ineffective coordination rules. Following the principle of team reflexivity [56, 68], the team can aggregate t… view at source ↗

**Figure 3.** Figure 3: Experiments on scaling context and out-of-distribution evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Experiments with constrained budget. for MAS improvement. L2 and L3 show benchmark-dependent effects: L2 contributes more on Ansible, where structured software tasks rely on precise coordination, while L3 contributes more on ResearchRubrics, where open-ended evaluation benefits from better team organization. Together, these ablation studies demonstrate that Meta-Team’s gains come from both collaborative ex… view at source ↗

read the original abstract

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Meta-Team adds context preservation and post-task coordination to MAS self-evolution, but the mechanism for turning interleaved traces into concrete multi-scale improvements still needs clearer evidence.

read the letter

The core contribution is a design that keeps each agent's execution history intact and adds a coordinated post-task discussion step so agents can pool distributed signals before updating behaviors, coordination rules, and team structure. This is a reasonable extension of earlier single-agent or hand-crafted MAS evolution work, and the abstract's claim of consistent gains across six long-horizon benchmarks is the kind of result that would matter if the gains hold up under scrutiny.

What the paper does cleanly is name the real difficulty: MAS traces are long, interleaved, and hard to parse for reusable lessons. The proposed fix—preserve context plus structured post-task exchange—is a concrete attempt to make that parsing collaborative rather than left to a single LLM.

The soft spot is the one the stress-test flags. The abstract asserts that the communication step lets agents exchange evidence and produce effective updates at three scales, yet it does not show how agents actually isolate which parts of the interleaved chain caused a failure or how they convert that into a reusable change rather than another prompt. If the full paper only reports end-to-end benchmark wins without ablations on the communication step or examples of the extracted improvements, the attribution to the collaborative mechanism remains open. That is the load-bearing assumption.

The work is aimed at groups already building or evaluating LLM multi-agent systems for complex tasks and who want an experience-driven alternative to static prompting or hand-tuning. It is worth sending to referees because the problem is practical and the proposed combination is new enough to test, even if the current write-up leaves the extraction step underspecified. A serious review would focus on whether the post-task coordination demonstrably reduces the identification problem or simply adds another layer of prompting.

Referee Report

3 major / 2 minor

Summary. The paper proposes Meta-Team, an experience-driven evolution framework for LLM-based multi-agent systems (MAS). It preserves each agent's execution context and coordinates post-task communication so that agents can exchange distributed evidence from intricate interleaved execution chains. This design supports multi-scale self-evolution that produces reusable improvements to individual agent behaviors, inter-agent coordination, and team-level organization. The framework is evaluated on six long-horizon agent benchmarks and is reported to outperform single-agent baselines, hand-crafted MAS, and prior MAS evolution methods.

Significance. If the reported gains are robust and attributable to the collaborative mechanism, the work would offer a practical route to scalable, experience-driven improvement of MAS without requiring extensive manual redesign. The multi-scale framing (behavior, coordination, organization) directly targets a recognized difficulty in long-horizon MAS and could influence subsequent research on self-improving agent teams.

major comments (3)

[Section 3] Section 3 (Meta-Team framework): The central claim that preserving execution context plus coordinated post-task communication enables agents to extract reusable improvements from interleaved chains is load-bearing, yet the manuscript provides no concrete algorithm, pseudocode, or worked example showing how agents identify and distill distributed evidence into multi-scale updates rather than simply producing additional prompts. Without this, attribution of benchmark gains to the collaborative mechanism remains unverified.
[Section 4] Section 4 (Experiments): The abstract states consistent outperformance across six benchmarks, but the manuscript does not report ablation variants that disable post-task communication while keeping other components fixed. Such an ablation is required to test whether the identification problem is actually mitigated by the proposed coordination step.
[Section 4.3] Section 4.3 (Analysis): The claim of more reliable and scalable self-evolution is supported only by aggregate success rates; no per-benchmark breakdown of failure modes before versus after evolution, nor any measure of how often post-task messages actually surface actionable improvements, is provided.

minor comments (2)

Notation for the three evolution scales (agent, coordination, organization) is introduced without a compact summary table; adding one would improve readability.
The related-work section cites prior MAS evolution methods but does not explicitly contrast the post-task communication design with the communication protocols used in those baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of Meta-Team. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and evidence.

read point-by-point responses

Referee: [Section 3] Section 3 (Meta-Team framework): The central claim that preserving execution context plus coordinated post-task communication enables agents to extract reusable improvements from interleaved chains is load-bearing, yet the manuscript provides no concrete algorithm, pseudocode, or worked example showing how agents identify and distill distributed evidence into multi-scale updates rather than simply producing additional prompts. Without this, attribution of benchmark gains to the collaborative mechanism remains unverified.

Authors: We agree that Section 3 would benefit from greater concreteness. While the section describes the preservation of execution context and the coordination of post-task communication at a conceptual level, it does not include pseudocode or a worked example of the distillation process. We will add both an algorithm box outlining the multi-scale update procedure and a detailed worked example from one benchmark to illustrate how distributed evidence is identified and transformed into reusable improvements at each scale. revision: yes
Referee: [Section 4] Section 4 (Experiments): The abstract states consistent outperformance across six benchmarks, but the manuscript does not report ablation variants that disable post-task communication while keeping other components fixed. Such an ablation is required to test whether the identification problem is actually mitigated by the proposed coordination step.

Authors: We concur that an ablation isolating the post-task communication component is necessary to attribute gains specifically to the collaborative mechanism. We will add this ablation (disabling coordinated post-task messages while retaining context preservation and the evolution pipeline) and report the resulting performance on the six benchmarks in the revised experiments section. revision: yes
Referee: [Section 4.3] Section 4.3 (Analysis): The claim of more reliable and scalable self-evolution is supported only by aggregate success rates; no per-benchmark breakdown of failure modes before versus after evolution, nor any measure of how often post-task messages actually surface actionable improvements, is provided.

Authors: We acknowledge the limitation in the current analysis. We will expand Section 4.3 to include (i) per-benchmark tables or figures showing failure-mode distributions before and after evolution and (ii) statistics on the frequency and actionability of improvements derived from post-task messages. These additions will provide more granular support for the reliability and scalability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: framework design with no equations or fitted predictions

full rationale

The paper describes a proposed MAS evolution framework (Meta-Team) that preserves execution context and uses post-task communication for multi-scale improvements. No equations, parameters, or quantitative derivations appear in the provided text. The central claims rest on the design choices and empirical benchmark results rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The absence of mathematical structure means none of the enumerated circularity patterns can be exhibited by direct quote and reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no mathematical model, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5761 in / 970 out tokens · 25837 ms · 2026-06-29T00:03:36.987779+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders
cs.IR 2026-06 unverdicted novelty 7.0

Transformer recommenders amplify popularity bias via spectral collapse when scaled; SPRINT constrains attention column-sums and feed-forward spectral norms to improve fairness and scaling behavior.

Reference graph

Works this paper leans on

90 extracted references · 42 canonical work pages · cited by 1 Pith paper · 23 internal anchors

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Why does the effective context length of llms fall short? InICLR, 2025

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short? InICLR, 2025

2025
[3]

How we built our multi-agent research system, 2025

Anthropic. How we built our multi-agent research system, 2025. https://www.anthropic. com/engineering/built-multi-agent-research-system

2025
[4]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026

2026
[5]

Orchestrate teams of Claude Code sessions

Anthropic. Orchestrate teams of Claude Code sessions. https://code.claude.com/docs/ en/agent-teams, 2026

2026
[6]

Claude Code Overview

Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026

2026
[7]

Introducing claude sonnet 4.6

Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, 2026

2026
[8]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Where did it all go wrong? a hierarchical look into multi-agent error attribution

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. Where did it all go wrong? a hierarchical look into multi-agent error attribution. InNeurIPS Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025

2025
[10]

Why do multi- agent llm systems fail? InNeurIPS Datasets and Benchmarks Track, 2025

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail? InNeurIPS Datasets and Benchmarks Track, 2025

2025
[11]

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, et al. Beyondswe: Can current code agent survive beyond single-repo bug fixing?arXiv preprint arXiv:2603.03194, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Goagent: Group-of-agents communication topology generation for llm-based multi-agent systems.arXiv preprint arXiv:2603.19677, 2026

Hongjiang Chen, Xin Zheng, Yixin Liu, Pengfei Jiao, Shiyuan Li, Huan Liu, Zhidong Zhao, Ziqi Xu, Ibrahim Khalil, and Shirui Pan. Goagent: Group-of-agents communication topology generation for llm-based multi-agent systems.arXiv preprint arXiv:2603.19677, 2026

work page arXiv 2026
[13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

Mengzhuo Chen, Junjie Wang, Fangwen Mu, Yawen Wang, Zhe Liu, Huanxiang Feng, and Qing Wang. Seeing the whole elephant: A benchmark for failure attribution in llm-based multi-agent systems.arXiv preprint arXiv:2604.22708, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, et al. Scoring, reasoning, and selecting the best! ensembling large language models via a peer-review process.arXiv preprint arXiv:2512.23213, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

work page arXiv 2025
[20]

A new era of intelligence with gemini 3

Google. A new era of intelligence with gemini 3. https://blog.google/products/ gemini/gemini-3/, 2025

2025
[21]

Large language model based multi-agents: A survey of progress and challenges

T Guo, X Chen, Y Wang, R Chang, S Pei, NV Chawla, O Wiest, and X Zhang. Large language model based multi-agents: A survey of progress and challenges. In33rd International Joint Conference on Artificial Intelligence (IJCAI 2024). IJCAI; Cornell arxiv, 2024

2024
[22]

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

2021
[24]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021
[25]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InICLR, 2024

2024
[26]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation

Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Ziyu Ye, Bowei Xia, Tao Sun, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, et al. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[28]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InICLR, 2025

2025
[29]

EvoMAS: Evolutionary Generation of Multi-Agent Systems

Yuntong Hu, Matthew Trager, Yuting Zhang, Yi Zhang, Shuo Yang, Wei Xia, and Stefano Soatto. Evolutionary generation of multi-agent systems.arXiv preprint arXiv:2602.06511, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, and Chanyoung Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

work page arXiv 2026
[31]

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems.arXiv preprint arXiv:2509.14295, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, and Zhongxiang Dai. Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

work page arXiv 2026
[33]

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023
[35]

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Jiazheng Li, Emine Yilmaz, Bei Chen, and Dieu-Thu Le. Towards self-improving error diagnosis in multi-agent systems.arXiv preprint arXiv:2604.17658, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, et al. Agencybench: Benchmarking the frontiers of autonomous agents in 1m-token real-world contexts.arXiv preprint arXiv:2601.11044, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142–23150, 2026

2026
[38]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

2024
[39]

Cross-task experiential learning on llm-based multi-agent collaboration.arXiv preprint arXiv:2505.23187, 2025

Yilong Li, Chen Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, et al. Cross-task experiential learning on llm-based multi-agent collaboration.arXiv preprint arXiv:2505.23187, 2025

work page arXiv 2025
[40]

Agentswift: Efficient llm agent design via value-guided hierarchical search.arXiv preprint arXiv:2506.06017, 2025

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, and Yong Li. Agentswift: Efficient llm agent design via value-guided hierarchical search.arXiv preprint arXiv:2506.06017, 2025

work page arXiv 2025
[41]

Openmanus: An open-source framework for building general ai agents, 2025

Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, Sirui Hong, Sheng Fan, Xiao Tang, Bang Liu, Yuyu Luo, and Chenglin Wu. Openmanus: An open-source framework for building general ai agents, 2025. URLhttps://doi.org/10.5281/zenodo.15186407

work page doi:10.5281/zenodo.15186407 2025
[42]

Agentask: Multi-agent systems need to ask.arXiv preprint arXiv:2510.07593, 2025

Bohan Lin, Kuo Yang, Zelin Tan, Yingchuan Lai, Chen Zhang, Guibin Zhang, Xinlei Yu, Miao Yu, Xu Wang, Yudong Zhang, et al. Agentask: Multi-agent systems need to ask.arXiv preprint arXiv:2510.07593, 2025

work page arXiv 2025
[43]

Se-agent: Self-evolution trajectory op- timization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory op- timization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

work page arXiv 2025
[44]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Mas-on-the- fly: Dynamic adaptation of llm-based multi-agent systems at test time.arXiv preprint arXiv:2602.13671, 2026

Guangyi Liu, Haojun Lin, Huan Zeng, Heng Wang, and Quanming Yao. Mas-on-the- fly: Dynamic adaptation of llm-based multi-agent systems at test time.arXiv preprint arXiv:2602.13671, 2026

work page arXiv 2026
[46]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024
[47]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023
[48]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023
[49]

Introducing gpt-5.4

Open AI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4 , 2026. 12

2026
[50]

Multi-Agent Teams Hold Experts Back

Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, and James Zou. Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Multi-agent coordination patterns: Five approaches and when to use them

Cara Phillips. Multi-agent coordination patterns: Five approaches and when to use them. https://claude.com/blog/multi-agent-coordination-patterns , April 2026. An- thropic Blog, Accessed: 2026-05-05

2026
[52]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

2024
[53]

Locobench: A benchmark for long-context large language models in complex software engineering.arXiv preprint arXiv:2509.09614, 2025

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, et al. Locobench: A benchmark for long-context large language models in complex software engineering.arXiv preprint arXiv:2509.09614, 2025

work page arXiv 2025
[54]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

2015
[55]

Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, et al. Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

work page arXiv 2026
[56]

Team reflexivity and innovation: The moderating role of team context.Journal of Management, 41(3):769–788, 2015

Michaéla C Schippers, Michael A West, and Jeremy F Dawson. Team reflexivity and innovation: The moderating role of team context.Journal of Management, 41(3):769–788, 2015

2015
[57]

United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

HaoYang Shang, Xuan Liu, Zi Liang, Jie Zhang, Haibo Hu, and Song Guo. United minds or isolated agents? exploring coordination of llms under cognitive load theory.arXiv preprint arXiv:2506.06843, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Agentsquare: Automatic llm agent search in modular design space

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space. InICLR, 2025

2025
[59]

Hendryx, Brad Kenstler, and Bing Liu

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685, 2025

work page arXiv 2025
[60]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[61]

Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

Linxin Song, Jiale Liu, Jieyu Zhang, Shaokun Zhang, Ao Luo, Shijian Wang, Qingyun Wu, and Chi Wang. Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

work page arXiv 2024
[62]

Coact-1: Computer-using agents with coding as actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions. InICLR, 2026

2026
[63]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Dat Tran and Douwe Kiela. Single-agent llms outperform multi-agent systems on multi-hop reasoning under equal thinking token budgets.arXiv preprint arXiv:2604.02460, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Automated stateful specialization for adaptive agent systems

Myan Vu, Harrish Ayyanar, PANG JIANG, Anwiketh Reddy, and Mayank Goel. Automated stateful specialization for adaptive agent systems. InICLR, 2026

2026
[66]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

work page arXiv 2025
[68]

Reflexivity and work group effectiveness: A conceptual integration

Michael West. Reflexivity and work group effectiveness: A conceptual integration. InThe handbook of work group psychology, pages 555–579. John Wiley & Sons, Ltd, 1996

1996
[69]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024
[70]

Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37: 122706–122740, 2024

Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37: 122706–122740, 2024

2024
[71]

When does divide and conquer work for long context llm? a noise decomposition framework

Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, and Ce Zhang. When does divide and conquer work for long context llm? a noise decomposition framework. InICLR, 2026

2026
[72]

Don’t build multi-agents

Walden Yan. Don’t build multi-agents. https://cognition.ai/blog/ dont-build-multi-agents, June 2025. Cognition AI Blog, Accessed: 2026-05-05

2025
[73]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[74]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,
[75]

URLhttps://arxiv.org/abs/2504.00587

work page arXiv
[76]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[77]

Masrouter: Learning to route llms for multi-agent systems

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

2025
[78]

Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

Weihao Zeng, Yuzhen Huang, and Junxian He. Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

work page arXiv 2026
[79]

Multi-agent architecture search via agentic supernet

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet. InInternational Conference on Machine Learning, pages 75834–75852. PMLR, 2025

2025
[80]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

work page arXiv 2025

Showing first 80 references.

[1] [1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Why does the effective context length of llms fall short? InICLR, 2025

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short? InICLR, 2025

2025

[3] [3]

How we built our multi-agent research system, 2025

Anthropic. How we built our multi-agent research system, 2025. https://www.anthropic. com/engineering/built-multi-agent-research-system

2025

[4] [4]

Introducing claude opus 4.6

Anthropic. Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6, 2026

2026

[5] [5]

Orchestrate teams of Claude Code sessions

Anthropic. Orchestrate teams of Claude Code sessions. https://code.claude.com/docs/ en/agent-teams, 2026

2026

[6] [6]

Claude Code Overview

Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026

2026

[7] [7]

Introducing claude sonnet 4.6

Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, 2026

2026

[8] [8]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Where did it all go wrong? a hierarchical look into multi-agent error attribution

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. Where did it all go wrong? a hierarchical look into multi-agent error attribution. InNeurIPS Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025

2025

[10] [10]

Why do multi- agent llm systems fail? InNeurIPS Datasets and Benchmarks Track, 2025

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail? InNeurIPS Datasets and Benchmarks Track, 2025

2025

[11] [11]

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, et al. Beyondswe: Can current code agent survive beyond single-repo bug fixing?arXiv preprint arXiv:2603.03194, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Goagent: Group-of-agents communication topology generation for llm-based multi-agent systems.arXiv preprint arXiv:2603.19677, 2026

Hongjiang Chen, Xin Zheng, Yixin Liu, Pengfei Jiao, Shiyuan Li, Huan Liu, Zhidong Zhao, Ziqi Xu, Ibrahim Khalil, and Shirui Pan. Goagent: Group-of-agents communication topology generation for llm-based multi-agent systems.arXiv preprint arXiv:2603.19677, 2026

work page arXiv 2026

[13] [13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

Mengzhuo Chen, Junjie Wang, Fangwen Mu, Yawen Wang, Zhe Liu, Huanxiang Feng, and Qing Wang. Seeing the whole elephant: A benchmark for failure attribution in llm-based multi-agent systems.arXiv preprint arXiv:2604.22708, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, et al. Scoring, reasoning, and selecting the best! ensembling large language models via a peer-review process.arXiv preprint arXiv:2512.23213, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

work page arXiv 2025

[20] [20]

A new era of intelligence with gemini 3

Google. A new era of intelligence with gemini 3. https://blog.google/products/ gemini/gemini-3/, 2025

2025

[21] [21]

Large language model based multi-agents: A survey of progress and challenges

T Guo, X Chen, Y Wang, R Chang, S Pei, NV Chawla, O Wiest, and X Zhang. Large language model based multi-agents: A survey of progress and challenges. In33rd International Joint Conference on Artificial Intelligence (IJCAI 2024). IJCAI; Cornell arxiv, 2024

2024

[22] [22]

ReCreate: Reasoning and Creating Domain Agents Driven by Experience

Zhezheng Hao, Hong Wang, Jian Luo, Jianqing Zhang, Yuyan Zhou, Qiang Lin, Can Wang, Hande Dong, and Jiawei Chen. Recreate: Reasoning and creating domain agents driven by experience.arXiv preprint arXiv:2601.11100, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

2021

[24] [24]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021

[25] [25]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InICLR, 2024

2024

[26] [26]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation

Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Ziyu Ye, Bowei Xia, Tao Sun, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, et al. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[28] [28]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InICLR, 2025

2025

[29] [29]

EvoMAS: Evolutionary Generation of Multi-Agent Systems

Yuntong Hu, Matthew Trager, Yuting Zhang, Yi Zhang, Shuo Yang, Wei Xia, and Stefano Soatto. Evolutionary generation of multi-agent systems.arXiv preprint arXiv:2602.06511, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

Yeonjun In, Mehrab Tanjim, Jayakumar Subramanian, Sungchul Kim, Uttaran Bhattacharya, Wonjoong Kim, Sangwu Park, Somdeb Sarkhel, and Chanyoung Park. Rethinking failure attribution in multi-agent systems: A multi-perspective benchmark and evaluation.arXiv preprint arXiv:2603.25001, 2026

work page arXiv 2026

[31] [31]

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems.arXiv preprint arXiv:2509.14295, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, and Zhongxiang Dai. Workflow-r1: Group sub-sequence policy optimization for multi-turn workflow construction.arXiv preprint arXiv:2602.01202, 2026

work page arXiv 2026

[33] [33]

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023

[35] [35]

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Jiazheng Li, Emine Yilmaz, Bei Chen, and Dieu-Thu Le. Towards self-improving error diagnosis in multi-agent systems.arXiv preprint arXiv:2604.17658, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Dayuan Fu, Shijie Xia, Xiaojie Cai, Tianze Xu, et al. Agencybench: Benchmarking the frontiers of autonomous agents in 1m-token real-world contexts.arXiv preprint arXiv:2601.11044, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142–23150, 2026

2026

[38] [38]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

2024

[39] [39]

Cross-task experiential learning on llm-based multi-agent collaboration.arXiv preprint arXiv:2505.23187, 2025

Yilong Li, Chen Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, et al. Cross-task experiential learning on llm-based multi-agent collaboration.arXiv preprint arXiv:2505.23187, 2025

work page arXiv 2025

[40] [40]

Agentswift: Efficient llm agent design via value-guided hierarchical search.arXiv preprint arXiv:2506.06017, 2025

Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, and Yong Li. Agentswift: Efficient llm agent design via value-guided hierarchical search.arXiv preprint arXiv:2506.06017, 2025

work page arXiv 2025

[41] [41]

Openmanus: An open-source framework for building general ai agents, 2025

Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, Sirui Hong, Sheng Fan, Xiao Tang, Bang Liu, Yuyu Luo, and Chenglin Wu. Openmanus: An open-source framework for building general ai agents, 2025. URLhttps://doi.org/10.5281/zenodo.15186407

work page doi:10.5281/zenodo.15186407 2025

[42] [42]

Agentask: Multi-agent systems need to ask.arXiv preprint arXiv:2510.07593, 2025

Bohan Lin, Kuo Yang, Zelin Tan, Yingchuan Lai, Chen Zhang, Guibin Zhang, Xinlei Yu, Miao Yu, Xu Wang, Yudong Zhang, et al. Agentask: Multi-agent systems need to ask.arXiv preprint arXiv:2510.07593, 2025

work page arXiv 2025

[43] [43]

Se-agent: Self-evolution trajectory op- timization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, et al. Se-agent: Self-evolution trajectory op- timization in multi-step reasoning with llm-based agents.arXiv preprint arXiv:2508.02085, 2025

work page arXiv 2025

[44] [44]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Mas-on-the- fly: Dynamic adaptation of llm-based multi-agent systems at test time.arXiv preprint arXiv:2602.13671, 2026

Guangyi Liu, Haojun Lin, Huan Zeng, Heng Wang, and Quanming Yao. Mas-on-the- fly: Dynamic adaptation of llm-based multi-agent systems at test time.arXiv preprint arXiv:2602.13671, 2026

work page arXiv 2026

[46] [46]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024

[47] [47]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023

[48] [48]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

2023

[49] [49]

Introducing gpt-5.4

Open AI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4 , 2026. 12

2026

[50] [50]

Multi-Agent Teams Hold Experts Back

Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, and James Zou. Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Multi-agent coordination patterns: Five approaches and when to use them

Cara Phillips. Multi-agent coordination patterns: Five approaches and when to use them. https://claude.com/blog/multi-agent-coordination-patterns , April 2026. An- thropic Blog, Accessed: 2026-05-05

2026

[52] [52]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

2024

[53] [53]

Locobench: A benchmark for long-context large language models in complex software engineering.arXiv preprint arXiv:2509.09614, 2025

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, et al. Locobench: A benchmark for long-context large language models in complex software engineering.arXiv preprint arXiv:2509.09614, 2025

work page arXiv 2025

[54] [54]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

2015

[55] [55]

Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, et al. Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

work page arXiv 2026

[56] [56]

Team reflexivity and innovation: The moderating role of team context.Journal of Management, 41(3):769–788, 2015

Michaéla C Schippers, Michael A West, and Jeremy F Dawson. Team reflexivity and innovation: The moderating role of team context.Journal of Management, 41(3):769–788, 2015

2015

[57] [57]

United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

HaoYang Shang, Xuan Liu, Zi Liang, Jie Zhang, Haibo Hu, and Song Guo. United minds or isolated agents? exploring coordination of llms under cognitive load theory.arXiv preprint arXiv:2506.06843, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Agentsquare: Automatic llm agent search in modular design space

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space. InICLR, 2025

2025

[59] [59]

Hendryx, Brad Kenstler, and Bing Liu

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, et al. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.07685, 2025

work page arXiv 2025

[60] [60]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[61] [61]

Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

Linxin Song, Jiale Liu, Jieyu Zhang, Shaokun Zhang, Ao Luo, Shijian Wang, Qingyun Wu, and Chi Wang. Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

work page arXiv 2024

[62] [62]

Coact-1: Computer-using agents with coding as actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. Coact-1: Computer-using agents with coding as actions. InICLR, 2026

2026

[63] [63]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [64]

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Dat Tran and Douwe Kiela. Single-agent llms outperform multi-agent systems on multi-hop reasoning under equal thinking token budgets.arXiv preprint arXiv:2604.02460, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[65] [65]

Automated stateful specialization for adaptive agent systems

Myan Vu, Harrish Ayyanar, PANG JIANG, Anwiketh Reddy, and Mayank Goel. Automated stateful specialization for adaptive agent systems. InICLR, 2026

2026

[66] [66]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, and Bryon Aragam. Scoreflow: Mastering llm agent workflows via score-based preference optimization.arXiv preprint arXiv:2502.04306, 2025

work page arXiv 2025

[68] [68]

Reflexivity and work group effectiveness: A conceptual integration

Michael West. Reflexivity and work group effectiveness: A conceptual integration. InThe handbook of work group psychology, pages 555–579. John Wiley & Sons, Ltd, 1996

1996

[69] [69]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024

[70] [70]

Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37: 122706–122740, 2024

Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with ease? efficient ordering-aware automated selection of exemplars.Advances in Neural Information Processing Systems, 37: 122706–122740, 2024

2024

[71] [71]

When does divide and conquer work for long context llm? a noise decomposition framework

Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, and Ce Zhang. When does divide and conquer work for long context llm? a noise decomposition framework. InICLR, 2026

2026

[72] [72]

Don’t build multi-agents

Walden Yan. Don’t build multi-agents. https://cognition.ai/blog/ dont-build-multi-agents, June 2025. Cognition AI Blog, Accessed: 2026-05-05

2025

[73] [73]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024

[74] [74]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

[75] [75]

URLhttps://arxiv.org/abs/2504.00587

work page arXiv

[76] [76]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023

[77] [77]

Masrouter: Learning to route llms for multi-agent systems

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15549–15572, 2025

2025

[78] [78]

Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

Weihao Zeng, Yuzhen Huang, and Junxian He. Loca-bench: Benchmarking language agents under controllable and extreme context growth.arXiv preprint arXiv:2602.07962, 2026

work page arXiv 2026

[79] [79]

Multi-agent architecture search via agentic supernet

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet. InInternational Conference on Machine Learning, pages 75834–75852. PMLR, 2025

2025

[80] [80]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

work page arXiv 2025