pith. sign in

arxiv: 2310.02170 · v2 · pith:DSIAQX4Dnew · submitted 2023-10-03 · 💻 cs.CL · cs.AI· cs.MA

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Pith reviewed 2026-05-21 20:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA
keywords LLM agentsmulti-agent collaborationdynamic networksagent selectiontask-oriented collaborationAgent Importance Scorereasoning benchmarks
0
0 comments X

The pith

DyLAN selects LLM agents via an importance score from trial runs then connects them dynamically for each task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DyLAN, a framework that first optimizes a team of agents by scoring their contributions in a preliminary trial and then lets the selected agents collaborate dynamically for a given query. This addresses the limitation of fixed numbers and static structures in existing multi-agent LLM systems. If correct, it means practitioners can achieve higher accuracy on tasks like reasoning and code generation without always using all available models or rigid communication patterns. The approach shows gains of up to 25 percent on certain MMLU subjects while keeping computational costs moderate.

Core claim

DyLAN operates a two-stage paradigm of Team Optimization followed by Task Solving. In the first stage an agent selection algorithm based on the unsupervised Agent Importance Score chooses the best agents according to their contributions in a preliminary trial oriented to the given task. In the second stage the selected agents collaborate dynamically according to the query, outperforming strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost.

What carries the argument

Agent Importance Score: an unsupervised metric that ranks candidate agents by their measured contributions during a preliminary trial run to select an effective team for dynamic collaboration.

Load-bearing premise

The Agent Importance Score from a small preliminary trial will reliably predict which agents contribute most on new unseen queries.

What would settle it

A new set of queries where the team chosen by the importance score performs no better than a random selection or a fixed full set of agents.

read the original abstract

Recent studies show that collaborating multiple large language model (LLM) powered agents is a promising way for task solving. However, current approaches are constrained by using a fixed number of agents and static communication structures. In this work, we propose automatically selecting a team of agents from candidates to collaborate in a dynamic communication structure toward different tasks and domains. Specifically, we build a framework named Dynamic LLM-Powered Agent Network ($\textbf{DyLAN}$) for LLM-powered agent collaboration, operating a two-stage paradigm: (1) Team Optimization and (2) Task Solving. During the first stage, we utilize an $\textit{agent selection}$ algorithm, based on an unsupervised metric called $\textit{Agent Importance Score}$, enabling the selection of best agents according to their contributions in a preliminary trial, oriented to the given task. Then, in the second stage, the selected agents collaborate dynamically according to the query. Empirically, we demonstrate that DyLAN outperforms strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost. On specific subjects in MMLU, selecting a team of agents in the team optimization stage improves accuracy by up to 25.0% in DyLAN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DyLAN, a two-stage framework for LLM-powered agent collaboration. Stage 1 (Team Optimization) uses an unsupervised Agent Importance Score computed from a preliminary trial run on a small set of examples to automatically select a subset of agents from candidates. Stage 2 (Task Solving) has the selected agents collaborate via a dynamic communication structure tailored to the input query. The central empirical claim is consistent outperformance over strong baselines across code generation, decision-making, general reasoning, and arithmetic reasoning, plus up to 25% accuracy gains on selected MMLU subjects attributable to the team-optimization stage.

Significance. If the Agent Importance Score generalizes reliably, DyLAN would supply a practical, low-overhead method for forming task-specific agent teams without fixing team size or topology in advance. The work supplies reproducible empirical comparisons across four task categories and explicitly credits the unsupervised character of the selection metric; these are genuine strengths. The result would be of interest to the multi-agent LLM literature provided the selection mechanism is shown to be stable.

major comments (2)
  1. [§4.2–4.3] §4.2–4.3 (MMLU and reasoning experiments): the headline 25% accuracy lift is obtained by selecting agents on the basis of the preliminary-trial Importance Score, yet the manuscript provides no held-out validation, cross-validation, or sensitivity analysis showing that the score remains stable when the preliminary examples, prompt phrasing, or query distribution change; without this, the reported gains risk being inflated by implicit tuning to the trial set.
  2. [§3.1] §3.1 (Agent Importance Score definition): the score is computed from observable trial outputs and is presented as unsupervised, but the text does not demonstrate that the ranking it induces on the candidate pool predicts actual contribution on unseen queries; this predictive link is load-bearing for the claim that the two-stage paradigm yields reliable improvement.
minor comments (2)
  1. [Table 1, Figure 3] Table 1 and Figure 3: baseline descriptions should explicitly state whether the same number of agents and the same underlying LLM are used across all compared methods to ensure fair comparison.
  2. [§5] §5 (Discussion): the claim of 'moderate computational cost' would be strengthened by reporting wall-clock time or token usage relative to the strongest baseline rather than absolute numbers alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the strengths of DyLAN, including its reproducible comparisons and unsupervised selection approach. We address the two major comments below regarding validation of the Agent Importance Score.

read point-by-point responses
  1. Referee: [§4.2–4.3] §4.2–4.3 (MMLU and reasoning experiments): the headline 25% accuracy lift is obtained by selecting agents on the basis of the preliminary-trial Importance Score, yet the manuscript provides no held-out validation, cross-validation, or sensitivity analysis showing that the score remains stable when the preliminary examples, prompt phrasing, or query distribution change; without this, the reported gains risk being inflated by implicit tuning to the trial set.

    Authors: We agree that explicit stability analysis would strengthen the empirical claims. The current results already show consistent gains across four distinct task categories using the same selection procedure, which provides indirect evidence of robustness. In the revised manuscript we will add sensitivity experiments that vary the number and distribution of preliminary examples as well as prompt phrasing, together with performance on held-out query sets, to directly quantify stability of the selected teams. revision: yes

  2. Referee: [§3.1] §3.1 (Agent Importance Score definition): the score is computed from observable trial outputs and is presented as unsupervised, but the text does not demonstrate that the ranking it induces on the candidate pool predicts actual contribution on unseen queries; this predictive link is load-bearing for the claim that the two-stage paradigm yields reliable improvement.

    Authors: The Agent Importance Score is computed solely from observable trial outputs without any task-specific labels, satisfying the unsupervised criterion. While the manuscript demonstrates that teams chosen by this ranking outperform fixed-agent baselines on unseen test queries, we acknowledge that an explicit correlation analysis between per-agent scores and downstream contribution would make the predictive link more transparent. We will include such an analysis in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical agent framework

full rationale

The paper introduces DyLAN as an empirical two-stage framework for dynamic LLM agent collaboration, with the central claims resting on performance comparisons against baselines on code generation, reasoning, and MMLU tasks. The Agent Importance Score is computed directly from observable outputs in a preliminary trial run on task-oriented examples and is not defined in terms of the final accuracy gains or reduced to a fitted parameter by any equations in the described method. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify load-bearing steps, and the reported improvements (including up to 25% on specific MMLU subjects) are presented as measured results rather than derived predictions that loop back to inputs by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework introduces one new metric (Agent Importance Score) whose exact formula is not given in the abstract and therefore counts as an ad-hoc construction for this paper. No free parameters or invented physical entities are described.

axioms (1)
  • domain assumption A preliminary trial run on a small number of examples produces an importance score that generalizes to the full task distribution.
    This assumption is required for the team-optimization stage to be useful; it is invoked when the paper states that agents are selected according to their contributions in a preliminary trial.

pith-pipeline@v0.9.0 · 5754 in / 1393 out tokens · 36956 ms · 2026-05-21T20:59:48.317832+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  2. From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

    cs.AI 2026-04 unverdicted novelty 7.0

    OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).

  3. Automated Design of Agentic Systems

    cs.AI 2024-08 conditional novelty 7.0

    Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...

  4. Self-Evolving Multi-Agent Systems via Decentralized Memory

    cs.MA 2026-05 unverdicted novelty 6.0

    DecentMem is a decentralized dual-pool memory framework for self-evolving multi-agent systems that provides O(log T) regret guarantees and yields up to 23.8% accuracy gains over centralized baselines.

  5. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

    cs.AI 2026-05 unverdicted novelty 6.0

    AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.

  6. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

    cs.AI 2026-05 unverdicted novelty 6.0

    SIGMA builds a signed relational graph among LLM agents and uses conflict-aware message passing plus weighted aggregation to produce more consistent predictions than prior cooperative-assumption baselines.

  7. Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

    cs.CL 2026-05 unverdicted novelty 6.0

    TopoPrior learns transferable topology priors offline from multi-domain reference graphs using a conditional variational graph model and adversarial adaptation to initialize collaboration structures for multi-agent LL...

  8. RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems

    cs.CL 2026-04 unverdicted novelty 6.0

    RoadMapper multi-agent system improves LLM roadmap generation for complex research problems by over 8% on average and reduces required human expert time by 84%.

  9. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  10. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

    cs.AI 2026-04 unverdicted novelty 6.0

    SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.

  11. Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    Mem²Evolve integrates experience memory and asset memory so that LLM agents expand capabilities through guided asset creation and distill new experience from those assets, yielding 6-18% gains over single-path evoluti...

  12. Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

    cs.LG 2026-04 conditional novelty 6.0

    LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.

  13. Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.

  14. Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

    cs.CL 2025-10 unverdicted novelty 6.0

    GTD generates task-adaptive, sparse communication topologies for multi-LLM agents via guided iterative graph diffusion steered by a proxy model predicting accuracy, utility, and cost.

  15. GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

    cs.AI 2025-07 unverdicted novelty 6.0

    GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming pri...

  16. Scaling Synthetic Data Creation with 1,000,000,000 Personas

    cs.CL 2024-06 unverdicted novelty 6.0

    A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

  17. Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.

  18. Reinforced Collaboration in Multi-Agent Flow Networks

    cs.LG 2026-05 unverdicted novelty 5.0

    MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.

  19. Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

    cs.AI 2026-04 conditional novelty 5.0

    A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.

  20. Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care

    cs.AI 2026-05 unverdicted novelty 4.0

    Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.

  21. Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    cs.CL 2024-01 unverdicted novelty 4.0

    The paper surveys LLM-based multi-agent systems, covering simulated domains, agent profiling and communication, mechanisms for capacity growth, and common benchmarks.

  22. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

    cs.AI 2024-04 unverdicted novelty 3.0

    A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 22 Pith papers · 8 internal anchors

  1. [1]

    Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs

    Pranjal Aggarwal, Aman Madaan, Yiming Yang, and Mausam. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396, Singapore, December

  2. [2]

    doi: 10.18653/v1/2023.emnlp-main.761

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.761. URL https: //aclanthology.org/2023.emnlp-main.761. Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI ’99, pp. 173–186, USA,

  3. [3]

    Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

    URL https://openreview.net/forum?id=FQepisCUWu. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In Proceedings of The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/ forum?id=ktrw68Cmu9c. Guangyao Chen, Siwei Dong, Yu...

  4. [4]

    URL https://openreview.net/forum?id=EHg5GDnyq1. Filippos Christianos, Georgios Papoudakis, Matthieu Zimmer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khyati Khandelwal, James Doran, Xidong Feng, Jiacheng Liu, Zheng Xiong, Yicheng Luo, Jianye Hao, Kun Shao, Haitham Bou-Ammar, and Jun Wang. Pangu- Agent: A Fine-Tunable Generalist Agent with Structured Reasoni...

  5. [5]

    Self-collaboration Code Generation via ChatGPT

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590,

  6. [6]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Im- proving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

  7. [7]

    Chatllm network: More brains, more intelligence

    Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. Chatllm network: More brains, more intelligence. arXiv preprint arXiv:2304.12998,

  8. [8]

    Surrealdriver: Designing llm-powered generative driver agent framework based on human drivers’ driving-thinking data,

    Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.792. Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model. arXiv preprinit arX...

  9. [9]

    CAMEL: Communicative agents for ”mind” exploration of large language model society

    12 Published as a conference paper at COLM 2024 Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model society. In Proceedings of Thirty-seventh Conference on Neural Information Processing Systems,

  10. [10]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    URL https://openreview.net/forum?id=3IyL2XWDkG. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118,

  11. [11]

    An efficient and truthful pricing mechanism for team formation in crowdsourcing markets

    Qing Liu, Tie Luo, Ruiming Tang, and St´ephane Bressan. An efficient and truthful pricing mechanism for team formation in crowdsourcing markets. In 2015 IEEE International Conference on Communications (ICC), pp. 567–572,

  12. [12]

    URL https://openreview.net/forum?id= zAdUB0aCTQ. Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960,

  13. [13]

    Laser: Llm agent with state-space exploration for web navigation

    Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, and Dong Yu. Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172,

  14. [14]

    GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems

    Nathalia Nascimento, Paulo Alencar, and Donald Cowan. GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems. arXiv preprint arXiv:2308.10435,

  15. [15]

    URL https://openreview.net/forum?id= mqVgBbNCm9. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  16. [16]

    ChatDev: Communicative Agents for Software Development

    Association for Computational Linguistics. URL https://www.aclweb.org/anthology/ W18-6319. Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023a. 13 Published as a conference paper at C...

  17. [17]

    Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563,

    Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563,

  18. [18]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, M. Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arxiv:2009.10297,

  19. [19]

    TPTU: Task planning and tool usage of large language model-based AI agents

    Jingqing Ruan, YiHong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, du qing, shi shiwei, Hangyu Mao, Xingyu Zeng, and Rui Zhao. TPTU: Task planning and tool usage of large language model-based AI agents. In NeurIPS 2023 Foundation Models for Decision Making Workshop,

  20. [20]

    Mirac Suzgun and Adam Tauman Kalai

    URL https://openreview.net/ forum?id=vAElhFcKW6. Mirac Suzgun and Adam Tauman Kalai. Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding. arXiv preprint arXiv:2401.12954,

  21. [21]

    Voyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. In Second Agent Learning in Open-Endedness Workshop , 2023a. URL https://openreview.net/forum?id=pAMNKGwja6. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan N...

  22. [22]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    URL https://openreview. net/forum?id= VjQlMeSB J. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155,

  23. [23]

    Listwise approach to learning to rank: Theory and algorithm

    14 Published as a conference paper at COLM 2024 Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 1192–1199, New York, NY, USA,

  24. [24]

    ISBN 9781605582054

    Association for Computing Machinery. ISBN 9781605582054. Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , pp. 7572–7590, Si...

  25. [25]

    doi: 10.18653/v1/2023.findings-emnlp.508

    Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.508. URL https: //aclanthology.org/2023.findings-emnlp.508. Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference ...

  26. [26]

    Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu

    URL https: //openreview.net/forum?id=WE vluYUL-X. Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in N...

  27. [27]

    doi: 10.18653/v1/2023.emnlp-main.936

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.936. URL https://aclanthology.org/2023.emnlp-main

  28. [28]

    Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S

    Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S. Davis. Nisp: Pruning networks using neuron importance score propagation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9194–9203,

  29. [29]

    Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View

    Jintian Zhang, Xin Xu, and Shumin Deng. Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View. arXiv preprint arXiv:2310.02124, 2023a. Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. Wider and Deeper LLM Networks are Fairer LLM Evaluators. arXiv preprint arXiv:2308.01862, 2023b. Yifa...

  30. [30]

    Progressive-hint prompting improves reasoning in large language models

    Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797,

  31. [31]

    Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406,

  32. [32]

    LLM As DBA

    Xuanhe Zhou, Guoliang Li, and Zhiyuan Liu. LLM As DBA. arXiv preprint arXiv:2308.05481,

  33. [33]

    Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

    Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144,

  34. [34]

    Language agents as optimizable graphs

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and J ¨urgen Schmidhuber. Language agents as optimizable graphs. arXiv preprint arXiv:2402.16823,

  35. [35]

    In fact, the performance could be further leveraged by task- specific methods like CodeBLEU (Ren et al.,

    15 Published as a conference paper at COLM 2024 A Discussion & Limitation In experiments, we view code generation tasks as representative of open-ended generation tasks and adopt BLEU to decide whether two answers are consistent in early stopping mechanism in Section 3.3.2. In fact, the performance could be further leveraged by task- specific methods like...

  36. [36]

    or CodeT (Chen et al., 2023a). For practical usage, the agent-evaluation metrics could cooperate with human annotation to give a more precise evaluation result on individual contributions of agents, mainly when facing data scarcity problems. Furthermore, we simply incorporating agent selection on Dy- LAN with agent team reformation, as a primary step towa...

  37. [37]

    We follow the answer extraction method from the origin paper (Hendrycks et al., 2021b)

    from the MATH dataset (Hendrycks et al., 2021b) and Complex CoT from PHP (Zheng et al., 2023). We follow the answer extraction method from the origin paper (Hendrycks et al., 2021b). We construct DyLAN with 4 agents assigned no specific roles and let agents to interact for at maximum T = 4 rounds under T-FFN formulation. We reported the classification acc...

  38. [38]

    nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1

    To ensure the participation of each agent, early- stopping mechanism functions at the third layer and later ( t ≥ 3). We use BLEU score in the early-stopping mechanism. We calculate BLEU by sacreBLEU2 (Post, 2018). For answer post-processing, we store all unit tests from the unit tester (if exists in the system) and randomly select the final output from t...

  39. [39]

    We sample the subsets with the proportions of 1% and 10% of the original dataset

    and the CG task. We sample the subsets with the proportions of 1% and 10% of the original dataset. Agent Im- portance Score for agent selection is av- eraged on the subsets, and the selected team is tested on the whole dataset. We raise random selection and human prior selection as baselines. The latter is sim- ulated by GPT-4 prompted by the task and age...

  40. [40]

    Doctor” and “Programmer

    Due to budget limits, we directly reuse the performance reported in the paper of base- lines, including LATS (Zhou et al., 2023), Reflex- ion (Shinn et al., 2023), Meta-GPT (Hong et al., 2024), and AgentVerse (Chen et al., 2024), and es- timate the cost in terms of numbers of API calls. DyLAN is also constructed by agents which are optimized based on GPT-...

  41. [41]

    Green annotation denotes the fields related to the role from the human perspective, which are annotated manually

    Role Doctor Programmer Top10Sub-jects high school computer sciencehigh school physicsclinical knowledge electrical engineeringcollege biology high school government and politicsprofessional medicine college computer sciencenutrition college chemistryhigh school US history high school mathematicshuman aging formal logicanatomy abstract algebrahigh school b...

  42. [42]

    Experiments show that temperature greatly influences arithmetic reasoning and code gener- ation tasks

    From experimental results, we found that DyLAN is more stable on different hyper-parameters. Experiments show that temperature greatly influences arithmetic reasoning and code gener- ation tasks. In Figure 4, we found that most baseline methods have significant performance drops when temperature increases, but DyLAN shows strong robustness to various tem-...

  43. [43]

    We tested listwise ranker with our own prompts, pairwise GPT ranker from original LLM-Blender (Jiang et al., 2023), Elo Score from TrueSkill (Herbrich et al.,

    We also tested different ranking methods for agent team reformation of DyLAN on the GR task. We tested listwise ranker with our own prompts, pairwise GPT ranker from original LLM-Blender (Jiang et al., 2023), Elo Score from TrueSkill (Herbrich et al.,

  44. [44]

    also imple- mented with pairwise ranker, and pairwise ranker with Sliding Window algorithm (Qin et al., 2023). In Table 14, we show that different ranking methods have a relatively low impact on performance, probably because of strong discrimination ability of GPT-3.5, but pairwise ranking methods always consume higher computational cost. Thus, we chose a...

  45. [45]

    Mathematician

    to the combination set: Si(R) = 1 |C||R| ∑ T ∈C (Performance(T ∪ { i}) − Performance(T )), (13) where R is the set of agents in the system, C is the combination set of R\{i}, i ∈ R , and Performance denotes the overall performance of the system on the current task, e.g., classification accuracy or Pass@1. The metric requires ground truth and multi-pass re...

  46. [46]

    Examples: ······ ””” Please follow the function signature and complete the implementation of the Python function

    22 Published as a conference paper at COLM 2024 def modp(n: int, p: int): ”””Return 2^n modulo p. Examples: ······ ””” Please follow the function signature and complete the implementation of the Python function. Reply the completion in a code block. ① Query My answer is: ```python res = 1 n = n % (p-1) while n > 0: if n % 2 == 1: ...... ``` (incorrect) Pr...

  47. [47]

    ```python res = np.power(2, n) return res ``` (incorrect) Algorithm Developer Algorithm Developer Apologize for the error

    ```python if p == 1: # edge case when p equals 1 return 0 res = 1 base = 2 % p while n > 0: ······ ``` (correct) It will be effiecient to use `numpy` to solve the modulo problem. ```python res = np.power(2, n) return res ``` (incorrect) Algorithm Developer Algorithm Developer Apologize for the error. I add the `import` command into Python script below. ``...

  48. [48]

    Doctor ④ Agent Importance Scores Figure 6: A case of DyLAN solving general reasoning task

    Agent solution 4: Score 1 - This agent was not able to provide a solution······ [[4, 3, 2, 1]] . Doctor ④ Agent Importance Scores Figure 6: A case of DyLAN solving general reasoning task. Different agents are recruited to give and refine solutions. The result is incorrect at the first time step but correct at the second time step. It includes the ratings ...

  49. [49]

    Action: search[...]

    We annotate the task where each prompt is used in the parenthesis, and the source of each prompt template. We omit the in-context examples of AR tasks from the original dataset of MATH (Hendrycks et al., 2021b) and PHP Zheng et al. (2023), and WebShop from ReAct (Yao et al., 2023). Prompt Content Source MMLU Instruction (GR) Here is the question: {questio...