arxiv: 2605.08769 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

Chengdong Xu , Kaiqiang Ke , Ziheng Liu , Jiaqi Wei , Zibo Shao , Weile Guo , Chao Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsLLM agentsworkflow adaptationexecution-time learningpolicy gradientslong-horizon taskstask state construction

0 comments

The pith

EvoMAS learns to adapt multi-agent LLM workflows stage by stage during execution instead of fixing them upfront.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that static, one-shot workflow designs fail for long-horizon tasks because subgoals and information needs change across stages. EvoMAS instead treats workflow construction as an online sequential decision process: at each step it builds an explicit task state via a Planner-Evaluator-Updater loop and lets a learned Workflow Adapter pick a fresh layered combination of agents from a fixed pool. The adapter is trained primarily with policy gradients on whether the whole task eventually succeeds, with separate analysis of process rewards under very sparse conditions. Experiments on GAIA, HLE and DeepResearcher show gains over both single-agent baselines and prior automated multi-agent methods, and the two components—state construction and learned adaptation—provide complementary improvements.

Core claim

EvoMAS formulates multi-agent workflow construction as a meta-level sequential decision problem along a single task trajectory. At each execution stage it constructs an explicit task state through the Planner-Evaluator-Updater pipeline and uses a learned Workflow Adapter to instantiate a stage-specific layered workflow drawn from a fixed pool of candidate agents; the adapter is trained via policy gradients whose main signal is sparse terminal task success.

What carries the argument

The Workflow Adapter: a policy trained on terminal success that maps the current explicit task state to a stage-specific selection and layering of agents from a fixed candidate pool.

If this is right

Explicit task-state construction and learned workflow adaptation deliver complementary performance gains on the tested benchmarks.
Process rewards provide the largest benefit precisely when terminal success signals are extremely sparse.
The system changes its agent coordination patterns as the task state evolves, rather than reusing one structure throughout.
The approach outperforms both single-agent baselines and existing automated multi-agent workflow design methods on GAIA, HLE, and DeepResearcher.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the fixed agent pool is rich enough, this style of online adaptation could reduce the need for hand-engineering separate workflows for each new domain.
The same meta-decision framing might apply to other sequential agent systems whose optimal structure shifts unpredictably midway through a task.
When terminal rewards are dense, the extra machinery of process rewards may be unnecessary.

Load-bearing premise

That training an adapter on sparse terminal success alone, using only a fixed pool of candidate agents, is enough to produce useful stage-specific workflows that generalize to new long-horizon tasks.

What would settle it

A set of new long-horizon tasks in which EvoMAS’s stage-by-stage adapted workflows achieve no higher success rate than a single static workflow chosen once at the start of each task.

Figures

Figures reproduced from arXiv: 2605.08769 by Chao Yu, Chengdong Xu, Jiaqi Wei, Kaiqiang Ke, Weile Guo, Zibo Shao, Ziheng Liu.

**Figure 1.** Figure 1: Meta-level MDP formulation of execution-time multi-agent workflow construction. A workflow Gt is selected based on the meta-state s m t , inducing base-level multi-agent execution and an aggregated outcome ξt. Orange arrows denote meta-level transitions, and blue arrows denote base-level agent execution. initialization paradigm: the multi-agent workflow is designed prior to execution and then reused withou… view at source ↗

**Figure 2.** Figure 2: Overview of EvoMAS. The system alternates between execution-time task state construction and state-conditioned workflow adaptation, enabling multi-agent coordination to evolve along a single task trajectory. The Evaluator assesses the previous execution outcome ξt−1 with respect to the active subtask. It produces stage-level assessment signals such as a success/refinement verdict, confidence, and natural… view at source ↗

**Figure 3.** Figure 3: Task-state construction and workflow learning. (a) Explicit execution-time task-state construction improves success rates under fixed workflows across GAIA and HLE. (b) The Workflow Adapter learns steadily from sparse terminal reward on GAIA Level 1 and Level 2 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of process reward across task difficulty. For standard GAIA tasks, bar height reports validation success rate. For very-hard tasks, bar height reports the best number of correct held-out validation tasks out of four; BT denotes the first training step with non-zero validation success. PRM is used in both training and inference. instantiated with GPT-4.1 agents. On HLE, we use a fixed Multi-generate-… view at source ↗

**Figure 5.** Figure 5: Workflow Adapter behavior under different execution-time task states. For the same agent pool, the adapter assigns different layer-wise selection probabilities and instantiates different workflows as the task state evolves [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: End-to-end execution trace of EvoMAS on a GAIA task. EvoMAS updates the task state across execution stages and constructs stage-specific workflows conditioned on the evolving state. End-to-end execution trace [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Validation-time evolution of Workflow Adapter selection probabilities. Columns correspond [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Large language model (LLM)-based multi-agent systems have shown strong potential on complex tasks through agent specialization, tool use, and collaborative reasoning. However, most automated multi-agent system design methods still follow a one-shot paradigm: a workflow is optimized or selected before execution and then reused unchanged throughout the task. This static coordination strategy is ill-suited for long-horizon tasks whose subgoals, intermediate evidence, and information needs evolve over multiple execution stages. We propose EvoMAS, a framework for execution-time multi-agent workflow construction. EvoMAS formulates workflow construction as a meta-level sequential decision problem along a single task trajectory. At each stage, it constructs an explicit task state through a Planner-Evaluator-Updater pipeline and uses a learned Workflow Adapter to instantiate a stage-specific layered workflow from a fixed pool of candidate agents. The adapter is trained with policy gradients using sparse, verifiable terminal task success as the main supervision signal, while evaluator-based process reward is analyzed separately under very-hard sparse-reward settings. Experiments on GAIA, HLE, and DeepResearcher show that EvoMAS outperforms single-agent baselines and recent automated multi-agent workflow design methods. Our analyses further show that explicit task-state construction and learned workflow adaptation provide complementary benefits. Additional results indicate that process reward is most useful when terminal success is extremely sparse, and qualitative case studies illustrate that EvoMAS adapts agent coordination as the task state evolves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoMAS shifts to execution-time workflow adaptation via a learned adapter and state pipeline, but the abstract gives no numbers or controls to back the performance claims.

read the letter

The main takeaway is that this paper moves away from one-shot workflow optimization in multi-agent LLM systems toward building the workflow dynamically during task execution. It frames the choice as a meta-level sequential decision along the trajectory, using a Planner-Evaluator-Updater loop to maintain an explicit task state and a learned Workflow Adapter to select stage-specific agents from a fixed pool. The adapter is trained with policy gradients on terminal task success as the primary signal. That combination is distinct from the static methods cited in the abstract and directly targets the evolving needs in long-horizon tasks like those in GAIA or research benchmarks. The paper reports that this beats single-agent baselines and prior automated designs, with separate analyses showing that state construction and adaptation help each other and that process rewards matter mainly under extremely sparse conditions. The framing is practical and the supervision choice keeps things verifiable without heavy process labels. The soft spots sit in the experimental side. The abstract supplies no performance deltas, error bars, ablation tables, or details on how the sparse-reward policy gradient training was stabilized or whether the adapter generalizes beyond the training tasks. Without those, it is difficult to judge the size or reliability of the gains or to rule out overfitting to the specific benchmarks. If the full paper contains solid controls and variance numbers, this concern shrinks; from the abstract alone it remains the central uncertainty. This work is aimed at researchers building LLM agents for complex, multi-step problems and at people exploring RL for dynamic coordination. Readers working on automated multi-agent design or long-horizon reasoning would find the ideas relevant. It deserves a serious referee because the problem is real and the execution-time framing is fresh, even though the writeup will need tighter experimental reporting to be convincing.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoMAS, a framework for execution-time multi-agent workflow construction in LLM-based systems. It models workflow adaptation as a meta-level sequential decision process along a single task trajectory, using a Planner-Evaluator-Updater pipeline to construct explicit task states and a learned Workflow Adapter to instantiate stage-specific layered workflows from a fixed pool of candidate agents. The adapter is trained via policy gradients with sparse, verifiable terminal task success as the primary supervision signal. Experiments on the GAIA, HLE, and DeepResearcher benchmarks report that EvoMAS outperforms single-agent baselines and recent automated multi-agent workflow design methods, with additional analyses indicating complementary benefits from explicit task-state construction and learned adaptation, plus utility of process rewards under extremely sparse terminal success.

Significance. If the experimental claims hold under rigorous validation, this work would be significant for shifting multi-agent system design from static, one-shot workflows to dynamic, execution-time adaptation. The RL-based approach to learning stage-specific coordination from sparse terminal rewards, combined with the explicit state construction pipeline, addresses a clear limitation in handling long-horizon tasks with evolving subgoals. The reported complementary benefits between state construction and adaptation provide a concrete, testable insight that could guide future agent system architectures.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim of outperformance on GAIA, HLE, and DeepResearcher rests on reported gains without any quantitative metrics, error bars, ablation studies, or details on stabilization of the sparse-reward policy-gradient training. This absence prevents assessment of statistical reliability, effect sizes, or whether gains reduce to overfitting on the training distribution rather than true generalization to new long-horizon tasks.
[§3 (Method)] §3 (Method): The weakest assumption—that a fixed pool of candidate agents plus a learned adapter trained only on terminal success can produce effective, generalizable stage-specific workflows—requires explicit discussion of how the pool is constructed, the risk of distribution shift, and any controls for overfitting. Without these, the complementary-benefit claim cannot be isolated from potential confounds in the experimental setup.

minor comments (2)

[Figures and Tables] Ensure all figures include error bars or confidence intervals and that ablation tables clearly label the contribution of each component (state construction vs. adaptation).
[§3 (Method)] Clarify notation for the Workflow Adapter and Planner-Evaluator-Updater pipeline to avoid ambiguity when describing the sequential decision process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of results and methodological details.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim of outperformance on GAIA, HLE, and DeepResearcher rests on reported gains without any quantitative metrics, error bars, ablation studies, or details on stabilization of the sparse-reward policy-gradient training. This absence prevents assessment of statistical reliability, effect sizes, or whether gains reduce to overfitting on the training distribution rather than true generalization to new long-horizon tasks.

Authors: We agree that the original submission did not provide sufficient quantitative detail to allow full assessment of the claims. In the revised manuscript, §4 now includes tables with exact success rates on each benchmark, standard errors computed over five independent runs, and expanded ablation studies that isolate the contributions of task-state construction versus learned adaptation. We have also added an appendix subsection on the policy-gradient procedure, including learning curves, variance-reduction methods (advantage normalization and baseline subtraction), and convergence diagnostics under sparse terminal rewards. These changes enable evaluation of effect sizes, statistical reliability, and generalization beyond the training distribution. revision: yes
Referee: [§3 (Method)] §3 (Method): The weakest assumption—that a fixed pool of candidate agents plus a learned adapter trained only on terminal success can produce effective, generalizable stage-specific workflows—requires explicit discussion of how the pool is constructed, the risk of distribution shift, and any controls for overfitting. Without these, the complementary-benefit claim cannot be isolated from potential confounds in the experimental setup.

Authors: We accept that the original §3 provided insufficient detail on pool construction and generalization safeguards. The revised §3.1 now explicitly describes the fixed pool as eight specialized agents whose capabilities (information retrieval, code execution, multi-step reasoning, etc.) are drawn from established tool-use literature. We discuss distribution-shift risks by noting that the adapter is trained on a broad task distribution and evaluated on completely disjoint benchmarks. Additional controls have been added, including training on a random subset of tasks and testing on the remainder, together with the existing ablations that separately disable state construction or adaptation. These revisions allow the complementary-benefit claim to be more cleanly isolated from setup confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical RL-based training loop for a Workflow Adapter, using policy gradients supervised by external, verifiable terminal task success on benchmarks such as GAIA. The central claims rest on experimental outperformance and complementary benefits between explicit task-state construction and learned adaptation, without any derivation step that reduces a reported prediction or result to its own fitted inputs or self-referential definitions by construction. No load-bearing self-citation chains or ansatz smuggling appear in the provided description of the method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on the assumption that a fixed agent pool plus learned adapter can cover evolving task needs, and that sparse terminal success provides sufficient learning signal; no free parameters, axioms, or invented entities are explicitly quantified in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1118 out tokens · 40310 ms · 2026-05-12T02:25:51.753014+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EvoMAS formulates workflow construction as a meta-level sequential decision problem... uses a learned Workflow Adapter... trained with policy gradients using sparse, verifiable terminal task success
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The adapter is trained with policy gradients using sparse, verifiable terminal task success as the main supervision signal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 14 internal anchors

[1]

Automated multi-agent workflows for rtl design.arXiv preprint arXiv:2509.20182, 2025

Amulya Bhattaram, Janani Ramamoorthy, Ranit Gupta, Diana Marculescu, and Dimitrios Stamoulis. Automated multi-agent workflows for rtl design.arXiv preprint arXiv:2509.20182, 2025

work page arXiv 2025
[2]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents, 2023

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents, 2023

work page 2023
[4]

Multi-agent collaboration via evolving orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, and Maosong Sun. Multi-agent collaboration via evolving orchestration. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= L0xZPXT3le

work page 2025
[5]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

work page 2023
[6]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate.CoRR, abs/2305.14325, 2023

work page internal anchor Pith review arXiv 2023
[7]

Promptbreeder: Self-referential self-improvement via prompt evolution, 2023

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023

work page arXiv 2023
[8]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2023

work page arXiv 2023
[9]

Metagpt: Meta programming for a multi-agent collaborative framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024

work page 2024
[10]

arXiv preprint arXiv:2410.16946 , year=

Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. Self-evolving multi-agent collaboration networks for software development. arXiv preprint arXiv:2410.16946, 2024

work page arXiv 2024
[11]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[13]

CAMEL: communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. InNeurIPS, 2023

work page 2023
[14]

API-bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, December

work page 2023
[15]

API-Bank : A comprehensive benchmark for tool-augmented LLMs

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.187. URLhttps://aclanthology.org/2023.emnlp-main.187/. 10

work page doi:10.18653/v1/2023.emnlp-main.187 2023
[16]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review arXiv 2023
[17]

Babyagi.https://github.com/yoheinakajima/babyagi, 2023

Yohei Nakajima. Babyagi.https://github.com/yoheinakajima/babyagi, 2023

work page 2023
[18]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Anya Hilgard, Suchir Krishna, Miles Song, Nathan Lambert, Ryan Carroll, John Liu, Niharika Madhusudhan, Daniel Bishop, Yujia Weng, Eric Zelikman, Maxwell Nye, Long Ouyang Zhou, Jong Huang, Claire Kure, and John Schulman. WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[20]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Humanity's Last Exam

Long Phan et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Auto-gpt: An autonomous gpt-4 experiment

Toran Bruce Richards and et al. Auto-gpt: An autonomous gpt-4 experiment. https:// github.com/Significant-Gravitas/Auto-GPT, 2023

work page 2023
[25]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[26]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[27]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection.arXiv preprint, abs/2303.11366, 2023. doi: 10.48550/ arXiv.2303.11366. URLhttps://doi.org/10.48550/arXiv.2303.11366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366 2023
[28]

Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li. RESTGPT: Connecting large language models with real-world applications via RESTful APIs.arXiv preprint arXiv:2306.06624, 2023

work page arXiv 2023
[29]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An Open-Ended Embodied Agent with Large Language Models.arXiv e-prints, art. arXiv:2305.16291, May 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

A Survey on Large Language Model based Autonomous Agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.arXiv preprint arXiv:2308.11432, 2023

work page internal anchor Pith review arXiv 2023
[31]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Jiakai Tang, Xu Chen, Wayne Xin Zhao, and Ji-Rong Wen. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023
[32]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isabella Ful- ford, Hyung Won Chung, Alexandre Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. URLhttps://arxiv.org/abs/2504.12516. 11

work page internal anchor Pith review arXiv 2025
[33]

Autogen: Enabling next-gen llm applications via multi-agent conversation framework, August 01, 2023 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework, August 01, 2023 2023

work page 2023
[34]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, May 01, 2023 2023

work page 2023
[35]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[36]

Evoagent: Towards automatic multi-agent generation via evolutionary algorithms

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dongsheng Li, and Deqing Yang. Evoa- gent: Towards automatic multi-agent generation via evolutionary algorithms.arXiv preprint arXiv:2406.14228, 2024

work page arXiv 2024
[37]

G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

work page arXiv 2024
[38]

Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180, 2025

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180, 2025

work page arXiv 2025
[39]

Cut the crap: An economical communication pipeline for LLM-based multi-agent systems

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the crap: An economical communication pipeline for LLM-based multi-agent systems. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=LkzuPorQ5L

work page 2025
[40]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review arXiv 2024
[41]

arXiv preprint arXiv:2507.22606 , year=

Yaolun Zhang, Xiaogeng Liu, and Chaowei Xiao. Metaagent: Automatically constructing multi-agent systems based on finite state machines.arXiv preprint arXiv:2507.22606, 2025

work page arXiv 2025
[42]

a2f low: Automating agentic workflow generation via self-adaptive abstraction operators.arXiv preprint arXiv:2511.20693, 2025

Mingming Zhao, Xiaokang Wei, Yuanqi Shao, Kaiwen Zhou, Lin Yang, Siwei Rao, Junhui Zhan, and Zhitang Chen. a2f low: Automating agentic workflow generation via self-adaptive abstraction operators.arXiv preprint arXiv:2511.20693, 2025

work page arXiv 2025
[43]

DeepResearcher: Scaling deep research via reinforcement learning in real-world environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URLhttps://aclanthology.org/2025.emnlp-main.22/

work page 2025
[44]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[45]

Gptswarm: Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024. A Algorithmic Summary of EvoMAS This section provides a high-level pseudocode description of EvoMAS, summarizing the interaction between e...

work page 2024
[46]

Updated subtask list with status tags: [TODO / DONE / FAILED / NEED-REFINE]

work page
[47]

Next active subtask: one sentence

work page
[48]

verification

Updated completed-subtasks summary: concise and reusable in later stages [RULES] - Mark a subtask DONE only when evaluator feedback indicates success. - If execution failed or evidence is missing, propose a repair/refinement subtask. - Keep the plan minimal and focused on the next execution stage. C.2 Evaluator Prompt The Evaluator is implemented as a too...

work page
[49]

Analyze: Carefully examine the current solution in the context of the task requirements

work page
[50]

Evaluate: Identify factual errors, missing information, incomplete aspects, logical inconsistencies

work page
[51]

Refine: Correct all identified errors and supplement missing details. Important Guidelines: - If the task has multiple possible solutions, carefully evaluate whether the provided current solution is correct - do not blindly trust it - Base your analysis strictly on the task requirements - Correct factual errors and logical mistakes - Add missing details n...

work page
[52]

ALWAYS call a tool - use web_searcher_tool to search or final_answer_tool to answer

work page
[53]

Use clear and specific search queries to get relevant results

work page
[54]

Never repeat the same search query

work page
[56]

Sources:

Include source URLs in your final answer - add 17 a "Sources:" section with URLs from the search results you used C.4.5 Web-browser Agent The Web-browser agent interacts with web pages, extracts information, and answers questions by browsing websites. Web-browser Agent prompt template. You are a web browser automation specialist assistant. Your role is to...

work page
[57]

ALWAYS call a tool - use custom_browser_tool to browse and interact with web pages or final_answer_tool to answer

work page
[58]

Navigate to relevant websites first, then browse and interact with web pages or final_answer_tool to answer

work page
[59]

Navigate to relevant websites first, then interact with elements (click, input, scroll) as needed

work page
[60]

Use extract_content action to get specific information from the current page

work page
[61]

Always provide your final answer using final_answer_tool

work page
[62]

Sources:

Include source URLs in your final answer - add a "Sources:" section with URLs from the web pages you visited C.4.6 Early-exit Agent The Early-exit agent serves as a signal to terminate the multi-layer agent coordination process early. Early-exit Agent prompt template. You are an early exit signal agent. When selected by the Coordinator, you indicate that ...

work page