A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Diyi Yang; Peng Li; Yang Liu; Yanzhe Zhang; Zijun Liu

arxiv: 2310.02170 · v2 · pith:DSIAQX4Dnew · submitted 2023-10-03 · 💻 cs.CL · cs.AI· cs.MA

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu , Yanzhe Zhang , Peng Li , Yang Liu , Diyi Yang This is my paper

Pith reviewed 2026-05-21 20:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA

keywords LLM agentsmulti-agent collaborationdynamic networksagent selectiontask-oriented collaborationAgent Importance Scorereasoning benchmarks

0 comments

The pith

DyLAN selects LLM agents via an importance score from trial runs then connects them dynamically for each task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DyLAN, a framework that first optimizes a team of agents by scoring their contributions in a preliminary trial and then lets the selected agents collaborate dynamically for a given query. This addresses the limitation of fixed numbers and static structures in existing multi-agent LLM systems. If correct, it means practitioners can achieve higher accuracy on tasks like reasoning and code generation without always using all available models or rigid communication patterns. The approach shows gains of up to 25 percent on certain MMLU subjects while keeping computational costs moderate.

Core claim

DyLAN operates a two-stage paradigm of Team Optimization followed by Task Solving. In the first stage an agent selection algorithm based on the unsupervised Agent Importance Score chooses the best agents according to their contributions in a preliminary trial oriented to the given task. In the second stage the selected agents collaborate dynamically according to the query, outperforming strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost.

What carries the argument

Agent Importance Score: an unsupervised metric that ranks candidate agents by their measured contributions during a preliminary trial run to select an effective team for dynamic collaboration.

Load-bearing premise

The Agent Importance Score from a small preliminary trial will reliably predict which agents contribute most on new unseen queries.

What would settle it

A new set of queries where the team chosen by the importance score performs no better than a random selection or a fixed full set of agents.

read the original abstract

Recent studies show that collaborating multiple large language model (LLM) powered agents is a promising way for task solving. However, current approaches are constrained by using a fixed number of agents and static communication structures. In this work, we propose automatically selecting a team of agents from candidates to collaborate in a dynamic communication structure toward different tasks and domains. Specifically, we build a framework named Dynamic LLM-Powered Agent Network ($\textbf{DyLAN}$) for LLM-powered agent collaboration, operating a two-stage paradigm: (1) Team Optimization and (2) Task Solving. During the first stage, we utilize an $\textit{agent selection}$ algorithm, based on an unsupervised metric called $\textit{Agent Importance Score}$, enabling the selection of best agents according to their contributions in a preliminary trial, oriented to the given task. Then, in the second stage, the selected agents collaborate dynamically according to the query. Empirically, we demonstrate that DyLAN outperforms strong baselines in code generation, decision-making, general reasoning, and arithmetic reasoning tasks with moderate computational cost. On specific subjects in MMLU, selecting a team of agents in the team optimization stage improves accuracy by up to 25.0% in DyLAN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DyLAN, a two-stage framework for LLM-powered agent collaboration. Stage 1 (Team Optimization) uses an unsupervised Agent Importance Score computed from a preliminary trial run on a small set of examples to automatically select a subset of agents from candidates. Stage 2 (Task Solving) has the selected agents collaborate via a dynamic communication structure tailored to the input query. The central empirical claim is consistent outperformance over strong baselines across code generation, decision-making, general reasoning, and arithmetic reasoning, plus up to 25% accuracy gains on selected MMLU subjects attributable to the team-optimization stage.

Significance. If the Agent Importance Score generalizes reliably, DyLAN would supply a practical, low-overhead method for forming task-specific agent teams without fixing team size or topology in advance. The work supplies reproducible empirical comparisons across four task categories and explicitly credits the unsupervised character of the selection metric; these are genuine strengths. The result would be of interest to the multi-agent LLM literature provided the selection mechanism is shown to be stable.

major comments (2)

[§4.2–4.3] §4.2–4.3 (MMLU and reasoning experiments): the headline 25% accuracy lift is obtained by selecting agents on the basis of the preliminary-trial Importance Score, yet the manuscript provides no held-out validation, cross-validation, or sensitivity analysis showing that the score remains stable when the preliminary examples, prompt phrasing, or query distribution change; without this, the reported gains risk being inflated by implicit tuning to the trial set.
[§3.1] §3.1 (Agent Importance Score definition): the score is computed from observable trial outputs and is presented as unsupervised, but the text does not demonstrate that the ranking it induces on the candidate pool predicts actual contribution on unseen queries; this predictive link is load-bearing for the claim that the two-stage paradigm yields reliable improvement.

minor comments (2)

[Table 1, Figure 3] Table 1 and Figure 3: baseline descriptions should explicitly state whether the same number of agents and the same underlying LLM are used across all compared methods to ensure fair comparison.
[§5] §5 (Discussion): the claim of 'moderate computational cost' would be strengthened by reporting wall-clock time or token usage relative to the strongest baseline rather than absolute numbers alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the strengths of DyLAN, including its reproducible comparisons and unsupervised selection approach. We address the two major comments below regarding validation of the Agent Importance Score.

read point-by-point responses

Referee: [§4.2–4.3] §4.2–4.3 (MMLU and reasoning experiments): the headline 25% accuracy lift is obtained by selecting agents on the basis of the preliminary-trial Importance Score, yet the manuscript provides no held-out validation, cross-validation, or sensitivity analysis showing that the score remains stable when the preliminary examples, prompt phrasing, or query distribution change; without this, the reported gains risk being inflated by implicit tuning to the trial set.

Authors: We agree that explicit stability analysis would strengthen the empirical claims. The current results already show consistent gains across four distinct task categories using the same selection procedure, which provides indirect evidence of robustness. In the revised manuscript we will add sensitivity experiments that vary the number and distribution of preliminary examples as well as prompt phrasing, together with performance on held-out query sets, to directly quantify stability of the selected teams. revision: yes
Referee: [§3.1] §3.1 (Agent Importance Score definition): the score is computed from observable trial outputs and is presented as unsupervised, but the text does not demonstrate that the ranking it induces on the candidate pool predicts actual contribution on unseen queries; this predictive link is load-bearing for the claim that the two-stage paradigm yields reliable improvement.

Authors: The Agent Importance Score is computed solely from observable trial outputs without any task-specific labels, satisfying the unsupervised criterion. While the manuscript demonstrates that teams chosen by this ranking outperform fixed-agent baselines on unseen test queries, we acknowledge that an explicit correlation analysis between per-agent scores and downstream contribution would make the predictive link more transparent. We will include such an analysis in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical agent framework

full rationale

The paper introduces DyLAN as an empirical two-stage framework for dynamic LLM agent collaboration, with the central claims resting on performance comparisons against baselines on code generation, reasoning, and MMLU tasks. The Agent Importance Score is computed directly from observable outputs in a preliminary trial run on task-oriented examples and is not defined in terms of the final accuracy gains or reduced to a fitted parameter by any equations in the described method. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify load-bearing steps, and the reported improvements (including up to 25% on specific MMLU subjects) are presented as measured results rather than derived predictions that loop back to inputs by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework introduces one new metric (Agent Importance Score) whose exact formula is not given in the abstract and therefore counts as an ad-hoc construction for this paper. No free parameters or invented physical entities are described.

axioms (1)

domain assumption A preliminary trial run on a small number of examples produces an importance score that generalizes to the full task distribution.
This assumption is required for the team-optimization stage to be useful; it is invoked when the paper states that agents are selected according to their contributions in a preliminary trial.

pith-pipeline@v0.9.0 · 5754 in / 1393 out tokens · 36956 ms · 2026-05-21T20:59:48.317832+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
cs.AI 2026-04 unverdicted novelty 7.0

OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
Self-Evolving Multi-Agent Systems via Decentralized Memory
cs.MA 2026-05 unverdicted novelty 6.0

DecentMem is a decentralized dual-pool memory framework for self-evolving multi-agent systems that provides O(log T) regret guarantees and yields up to 23.8% accuracy gains over centralized baselines.
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
cs.AI 2026-05 unverdicted novelty 6.0

AgentCo-op retrieves and assembles existing agents and tools into interoperable workflows for open-world scientific tasks, showing effectiveness in genomics case studies and competitive benchmark results with lower costs.
Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling
cs.AI 2026-05 unverdicted novelty 6.0

SIGMA builds a signed relational graph among LLM agents and uses conflict-aware message passing plus weighted aggregation to produce more consistent predictions than prior cooperative-assumption baselines.
Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains
cs.CL 2026-05 unverdicted novelty 6.0

TopoPrior learns transferable topology priors offline from multi-domain reference graphs using a conditional variational graph model and adversarial adaptation to initialize collaboration structures for multi-agent LL...
RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems
cs.CL 2026-04 unverdicted novelty 6.0

RoadMapper multi-agent system improves LLM roadmap generation for complex research problems by over 8% on average and reduces required human expert time by 84%.
QuantClaw: Precision Where It Matters for OpenClaw
cs.AI 2026-04 unverdicted novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
cs.AI 2026-04 unverdicted novelty 6.0

SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation
cs.CL 2026-04 unverdicted novelty 6.0

Mem²Evolve integrates experience memory and asset memory so that LLM agents expand capabilities through guided asset creation and distill new experience from those assets, yielding 6-18% gains over single-path evoluti...
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
cs.LG 2026-04 conditional novelty 6.0

LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models
cs.CL 2025-10 unverdicted novelty 6.0

GTD generates task-adaptive, sparse communication topologies for multi-LLM agents via guided iterative graph diffusion steered by a proxy model predicting accuracy, utility, and cost.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
cs.AI 2025-07 unverdicted novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming pri...
Scaling Synthetic Data Creation with 1,000,000,000 Personas
cs.CL 2024-06 unverdicted novelty 6.0

A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.
Reinforced Collaboration in Multi-Agent Flow Networks
cs.LG 2026-05 unverdicted novelty 5.0

MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity
cs.AI 2026-04 conditional novelty 5.0

A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care
cs.AI 2026-05 unverdicted novelty 4.0

Interactive LLM dialogue raised residents' hard-case diagnostic correctness from 0.589 to 0.734 and produced medium effect sizes in a blinded study of seven physicians on 52 emergency cases.
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
cs.CL 2024-01 unverdicted novelty 4.0

The paper surveys LLM-based multi-agent systems, covering simulated domains, agent profiling and communication, mechanisms for capacity growth, and common benchmarks.
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
cs.AI 2024-04 unverdicted novelty 3.0

A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 22 Pith papers · 8 internal anchors

[1]

Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs

Pranjal Aggarwal, Aman Madaan, Yiming Yang, and Mausam. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396, Singapore, December

work page 2023
[2]

doi: 10.18653/v1/2023.emnlp-main.761

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.761. URL https: //aclanthology.org/2023.emnlp-main.761. Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI ’99, pp. 173–186, USA,

work page doi:10.18653/v1/2023.emnlp-main.761 2023
[3]

Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

URL https://openreview.net/forum?id=FQepisCUWu. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In Proceedings of The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/ forum?id=ktrw68Cmu9c. Guangyao Chen, Siwei Dong, Yu...

work page arXiv 2024
[4]

URL https://openreview.net/forum?id=EHg5GDnyq1. Filippos Christianos, Georgios Papoudakis, Matthieu Zimmer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khyati Khandelwal, James Doran, Xidong Feng, Jiacheng Liu, Zheng Xiong, Yicheng Luo, Jianye Hao, Kun Shao, Haitham Bou-Ammar, and Jun Wang. Pangu- Agent: A Fine-Tunable Generalist Agent with Structured Reasoni...

work page arXiv
[5]

Self-collaboration Code Generation via ChatGPT

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590,

work page arXiv
[6]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Im- proving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Chatllm network: More brains, more intelligence

Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. Chatllm network: More brains, more intelligence. arXiv preprint arXiv:2304.12998,

work page arXiv
[8]

Surrealdriver: Designing llm-powered generative driver agent framework based on human drivers’ driving-thinking data,

Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.792. Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model. arXiv preprinit arX...

work page arXiv 2023
[9]

CAMEL: Communicative agents for ”mind” exploration of large language model society

12 Published as a conference paper at COLM 2024 Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model society. In Proceedings of Thirty-seventh Conference on Neural Information Processing Systems,

work page 2024
[10]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

URL https://openreview.net/forum?id=3IyL2XWDkG. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

An efficient and truthful pricing mechanism for team formation in crowdsourcing markets

Qing Liu, Tie Luo, Ruiming Tang, and St´ephane Bressan. An efficient and truthful pricing mechanism for team formation in crowdsourcing markets. In 2015 IEEE International Conference on Communications (ICC), pp. 567–572,

work page 2015
[12]

URL https://openreview.net/forum?id= zAdUB0aCTQ. Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960,

work page arXiv
[13]

Laser: Llm agent with state-space exploration for web navigation

Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, and Dong Yu. Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172,

work page arXiv
[14]

GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems

Nathalia Nascimento, Paulo Alencar, and Donald Cowan. GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems. arXiv preprint arXiv:2308.10435,

work page arXiv
[15]

URL https://openreview.net/forum?id= mqVgBbNCm9. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

ChatDev: Communicative Agents for Software Development

Association for Computational Linguistics. URL https://www.aclweb.org/anthology/ W18-6319. Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023a. 13 Published as a conference paper at C...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563,

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563,

work page arXiv
[18]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, M. Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arxiv:2009.10297,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[19]

TPTU: Task planning and tool usage of large language model-based AI agents

Jingqing Ruan, YiHong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, du qing, shi shiwei, Hangyu Mao, Xingyu Zeng, and Rui Zhao. TPTU: Task planning and tool usage of large language model-based AI agents. In NeurIPS 2023 Foundation Models for Decision Making Workshop,

work page 2023
[20]

Meta-prompting: Enhancing language models with task-agnostic scaffolding.arXiv preprint arXiv:2401.12954, 2024

URL https://openreview.net/ forum?id=vAElhFcKW6. Mirac Suzgun and Adam Tauman Kalai. Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding. arXiv preprint arXiv:2401.12954,

work page arXiv
[21]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. In Second Agent Learning in Open-Endedness Workshop , 2023a. URL https://openreview.net/forum?id=pAMNKGwja6. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan N...

work page arXiv
[22]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

URL https://openreview. net/forum?id= VjQlMeSB J. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Listwise approach to learning to rank: Theory and algorithm

14 Published as a conference paper at COLM 2024 Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 1192–1199, New York, NY, USA,

work page 2024
[24]

ISBN 9781605582054

Association for Computing Machinery. ISBN 9781605582054. Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , pp. 7572–7590, Si...

work page 2023
[25]

doi: 10.18653/v1/2023.findings-emnlp.508

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.508. URL https: //aclanthology.org/2023.findings-emnlp.508. Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference ...

work page doi:10.18653/v1/2023.findings-emnlp.508 2023
[26]

Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu

URL https: //openreview.net/forum?id=WE vluYUL-X. Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in N...

work page 2023
[27]

doi: 10.18653/v1/2023.emnlp-main.936

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.936. URL https://aclanthology.org/2023.emnlp-main

work page doi:10.18653/v1/2023.emnlp-main.936 2023
[28]

Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S

Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S. Davis. Nisp: Pruning networks using neuron importance score propagation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9194–9203,

work page 2018
[29]

Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View

Jintian Zhang, Xin Xu, and Shumin Deng. Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View. arXiv preprint arXiv:2310.02124, 2023a. Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. Wider and Deeper LLM Networks are Fairer LLM Evaluators. arXiv preprint arXiv:2308.01862, 2023b. Yifa...

work page arXiv
[30]

Progressive-hint prompting improves reasoning in large language models

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797,

work page arXiv
[31]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

LLM As DBA

Xuanhe Zhou, Guoliang Li, and Zhiyuan Liu. LLM As DBA. arXiv preprint arXiv:2308.05481,

work page arXiv
[33]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and J ¨urgen Schmidhuber. Language agents as optimizable graphs. arXiv preprint arXiv:2402.16823,

work page arXiv
[35]

In fact, the performance could be further leveraged by task- specific methods like CodeBLEU (Ren et al.,

15 Published as a conference paper at COLM 2024 A Discussion & Limitation In experiments, we view code generation tasks as representative of open-ended generation tasks and adopt BLEU to decide whether two answers are consistent in early stopping mechanism in Section 3.3.2. In fact, the performance could be further leveraged by task- specific methods like...

work page 2024
[36]

or CodeT (Chen et al., 2023a). For practical usage, the agent-evaluation metrics could cooperate with human annotation to give a more precise evaluation result on individual contributions of agents, mainly when facing data scarcity problems. Furthermore, we simply incorporating agent selection on Dy- LAN with agent team reformation, as a primary step towa...

work page 2024
[37]

We follow the answer extraction method from the origin paper (Hendrycks et al., 2021b)

from the MATH dataset (Hendrycks et al., 2021b) and Complex CoT from PHP (Zheng et al., 2023). We follow the answer extraction method from the origin paper (Hendrycks et al., 2021b). We construct DyLAN with 4 agents assigned no specific roles and let agents to interact for at maximum T = 4 rounds under T-FFN formulation. We reported the classification acc...

work page 2023
[38]

nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1

To ensure the participation of each agent, early- stopping mechanism functions at the third layer and later ( t ≥ 3). We use BLEU score in the early-stopping mechanism. We calculate BLEU by sacreBLEU2 (Post, 2018). For answer post-processing, we store all unit tests from the unit tester (if exists in the system) and randomly select the final output from t...

work page 2018
[39]

We sample the subsets with the proportions of 1% and 10% of the original dataset

and the CG task. We sample the subsets with the proportions of 1% and 10% of the original dataset. Agent Im- portance Score for agent selection is av- eraged on the subsets, and the selected team is tested on the whole dataset. We raise random selection and human prior selection as baselines. The latter is sim- ulated by GPT-4 prompted by the task and age...

work page 2024
[40]

Doctor” and “Programmer

Due to budget limits, we directly reuse the performance reported in the paper of base- lines, including LATS (Zhou et al., 2023), Reflex- ion (Shinn et al., 2023), Meta-GPT (Hong et al., 2024), and AgentVerse (Chen et al., 2024), and es- timate the cost in terms of numbers of API calls. DyLAN is also constructed by agents which are optimized based on GPT-...

work page 2023
[41]

Green annotation denotes the fields related to the role from the human perspective, which are annotated manually

Role Doctor Programmer Top10Sub-jects high school computer sciencehigh school physicsclinical knowledge electrical engineeringcollege biology high school government and politicsprofessional medicine college computer sciencenutrition college chemistryhigh school US history high school mathematicshuman aging formal logicanatomy abstract algebrahigh school b...

work page 2024
[42]

Experiments show that temperature greatly influences arithmetic reasoning and code gener- ation tasks

From experimental results, we found that DyLAN is more stable on different hyper-parameters. Experiments show that temperature greatly influences arithmetic reasoning and code gener- ation tasks. In Figure 4, we found that most baseline methods have significant performance drops when temperature increases, but DyLAN shows strong robustness to various tem-...

work page 2024
[43]

We tested listwise ranker with our own prompts, pairwise GPT ranker from original LLM-Blender (Jiang et al., 2023), Elo Score from TrueSkill (Herbrich et al.,

We also tested different ranking methods for agent team reformation of DyLAN on the GR task. We tested listwise ranker with our own prompts, pairwise GPT ranker from original LLM-Blender (Jiang et al., 2023), Elo Score from TrueSkill (Herbrich et al.,

work page 2023
[44]

also imple- mented with pairwise ranker, and pairwise ranker with Sliding Window algorithm (Qin et al., 2023). In Table 14, we show that different ranking methods have a relatively low impact on performance, probably because of strong discrimination ability of GPT-3.5, but pairwise ranking methods always consume higher computational cost. Thus, we chose a...

work page 2023
[45]

Mathematician

to the combination set: Si(R) = 1 |C||R| ∑ T ∈C (Performance(T ∪ { i}) − Performance(T )), (13) where R is the set of agents in the system, C is the combination set of R\{i}, i ∈ R , and Performance denotes the overall performance of the system on the current task, e.g., classification accuracy or Pass@1. The metric requires ground truth and multi-pass re...

work page 2024
[46]

Examples: ······ ””” Please follow the function signature and complete the implementation of the Python function

22 Published as a conference paper at COLM 2024 def modp(n: int, p: int): ”””Return 2^n modulo p. Examples: ······ ””” Please follow the function signature and complete the implementation of the Python function. Reply the completion in a code block. ① Query My answer is: ```python res = 1 n = n % (p-1) while n > 0: if n % 2 == 1: ...... ``` (incorrect) Pr...

work page 2024
[47]

```python res = np.power(2, n) return res ``` (incorrect) Algorithm Developer Algorithm Developer Apologize for the error

```python if p == 1: # edge case when p equals 1 return 0 res = 1 base = 2 % p while n > 0: ······ ``` (correct) It will be effiecient to use `numpy` to solve the modulo problem. ```python res = np.power(2, n) return res ``` (incorrect) Algorithm Developer Algorithm Developer Apologize for the error. I add the `import` command into Python script below. ``...

work page 2024
[48]

Doctor ④ Agent Importance Scores Figure 6: A case of DyLAN solving general reasoning task

Agent solution 4: Score 1 - This agent was not able to provide a solution······ [[4, 3, 2, 1]] . Doctor ④ Agent Importance Scores Figure 6: A case of DyLAN solving general reasoning task. Different agents are recruited to give and refine solutions. The result is incorrect at the first time step but correct at the second time step. It includes the ratings ...

work page 2024
[49]

Action: search[...]

We annotate the task where each prompt is used in the parenthesis, and the source of each prompt template. We omit the in-context examples of AR tasks from the original dataset of MATH (Hendrycks et al., 2021b) and PHP Zheng et al. (2023), and WebShop from ReAct (Yao et al., 2023). Prompt Content Source MMLU Instruction (GR) Here is the question: {questio...

work page 2023

[1] [1]

Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs

Pranjal Aggarwal, Aman Madaan, Yiming Yang, and Mausam. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396, Singapore, December

work page 2023

[2] [2]

doi: 10.18653/v1/2023.emnlp-main.761

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.761. URL https: //aclanthology.org/2023.emnlp-main.761. Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI ’99, pp. 173–186, USA,

work page doi:10.18653/v1/2023.emnlp-main.761 2023

[3] [3]

Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

URL https://openreview.net/forum?id=FQepisCUWu. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In Proceedings of The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/ forum?id=ktrw68Cmu9c. Guangyao Chen, Siwei Dong, Yu...

work page arXiv 2024

[4] [4]

URL https://openreview.net/forum?id=EHg5GDnyq1. Filippos Christianos, Georgios Papoudakis, Matthieu Zimmer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khyati Khandelwal, James Doran, Xidong Feng, Jiacheng Liu, Zheng Xiong, Yicheng Luo, Jianye Hao, Kun Shao, Haitham Bou-Ammar, and Jun Wang. Pangu- Agent: A Fine-Tunable Generalist Agent with Structured Reasoni...

work page arXiv

[5] [5]

Self-collaboration Code Generation via ChatGPT

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590,

work page arXiv

[6] [6]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Im- proving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Chatllm network: More brains, more intelligence

Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. Chatllm network: More brains, more intelligence. arXiv preprint arXiv:2304.12998,

work page arXiv

[8] [8]

Surrealdriver: Designing llm-powered generative driver agent framework based on human drivers’ driving-thinking data,

Association for Computational Linguistics. URL https://aclanthology.org/2023.acl-long.792. Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, and Jiangtao Gong. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model. arXiv preprinit arX...

work page arXiv 2023

[9] [9]

CAMEL: Communicative agents for ”mind” exploration of large language model society

12 Published as a conference paper at COLM 2024 Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model society. In Proceedings of Thirty-seventh Conference on Neural Information Processing Systems,

work page 2024

[10] [10]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

URL https://openreview.net/forum?id=3IyL2XWDkG. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

An efficient and truthful pricing mechanism for team formation in crowdsourcing markets

Qing Liu, Tie Luo, Ruiming Tang, and St´ephane Bressan. An efficient and truthful pricing mechanism for team formation in crowdsourcing markets. In 2015 IEEE International Conference on Communications (ICC), pp. 567–572,

work page 2015

[12] [12]

URL https://openreview.net/forum?id= zAdUB0aCTQ. Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960,

work page arXiv

[13] [13]

Laser: Llm agent with state-space exploration for web navigation

Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, and Dong Yu. Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172,

work page arXiv

[14] [14]

GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems

Nathalia Nascimento, Paulo Alencar, and Donald Cowan. GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems. arXiv preprint arXiv:2308.10435,

work page arXiv

[15] [15]

URL https://openreview.net/forum?id= mqVgBbNCm9. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

ChatDev: Communicative Agents for Software Development

Association for Computational Linguistics. URL https://www.aclweb.org/anthology/ W18-6319. Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023a. 13 Published as a conference paper at C...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563,

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563,

work page arXiv

[18] [18]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, M. Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arxiv:2009.10297,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[19] [19]

TPTU: Task planning and tool usage of large language model-based AI agents

Jingqing Ruan, YiHong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, du qing, shi shiwei, Hangyu Mao, Xingyu Zeng, and Rui Zhao. TPTU: Task planning and tool usage of large language model-based AI agents. In NeurIPS 2023 Foundation Models for Decision Making Workshop,

work page 2023

[20] [20]

Meta-prompting: Enhancing language models with task-agnostic scaffolding.arXiv preprint arXiv:2401.12954, 2024

URL https://openreview.net/ forum?id=vAElhFcKW6. Mirac Suzgun and Adam Tauman Kalai. Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding. arXiv preprint arXiv:2401.12954,

work page arXiv

[21] [21]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. In Second Agent Learning in Open-Endedness Workshop , 2023a. URL https://openreview.net/forum?id=pAMNKGwja6. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan N...

work page arXiv

[22] [22]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

URL https://openreview. net/forum?id= VjQlMeSB J. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Listwise approach to learning to rank: Theory and algorithm

14 Published as a conference paper at COLM 2024 Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 1192–1199, New York, NY, USA,

work page 2024

[24] [24]

ISBN 9781605582054

Association for Computing Machinery. ISBN 9781605582054. Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , pp. 7572–7590, Si...

work page 2023

[25] [25]

doi: 10.18653/v1/2023.findings-emnlp.508

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.508. URL https: //aclanthology.org/2023.findings-emnlp.508. Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference ...

work page doi:10.18653/v1/2023.findings-emnlp.508 2023

[26] [26]

Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu

URL https: //openreview.net/forum?id=WE vluYUL-X. Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuanjing Huang, and Xipeng Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in N...

work page 2023

[27] [27]

doi: 10.18653/v1/2023.emnlp-main.936

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.936. URL https://aclanthology.org/2023.emnlp-main

work page doi:10.18653/v1/2023.emnlp-main.936 2023

[28] [28]

Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S

Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S. Davis. Nisp: Pruning networks using neuron importance score propagation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9194–9203,

work page 2018

[29] [29]

Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View

Jintian Zhang, Xin Xu, and Shumin Deng. Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View. arXiv preprint arXiv:2310.02124, 2023a. Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. Wider and Deeper LLM Networks are Fairer LLM Evaluators. arXiv preprint arXiv:2308.01862, 2023b. Yifa...

work page arXiv

[30] [30]

Progressive-hint prompting improves reasoning in large language models

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797,

work page arXiv

[31] [31]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

LLM As DBA

Xuanhe Zhou, Guoliang Li, and Zhiyuan Liu. LLM As DBA. arXiv preprint arXiv:2308.05481,

work page arXiv

[33] [33]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and J ¨urgen Schmidhuber. Language agents as optimizable graphs. arXiv preprint arXiv:2402.16823,

work page arXiv

[35] [35]

In fact, the performance could be further leveraged by task- specific methods like CodeBLEU (Ren et al.,

15 Published as a conference paper at COLM 2024 A Discussion & Limitation In experiments, we view code generation tasks as representative of open-ended generation tasks and adopt BLEU to decide whether two answers are consistent in early stopping mechanism in Section 3.3.2. In fact, the performance could be further leveraged by task- specific methods like...

work page 2024

[36] [36]

or CodeT (Chen et al., 2023a). For practical usage, the agent-evaluation metrics could cooperate with human annotation to give a more precise evaluation result on individual contributions of agents, mainly when facing data scarcity problems. Furthermore, we simply incorporating agent selection on Dy- LAN with agent team reformation, as a primary step towa...

work page 2024

[37] [37]

We follow the answer extraction method from the origin paper (Hendrycks et al., 2021b)

from the MATH dataset (Hendrycks et al., 2021b) and Complex CoT from PHP (Zheng et al., 2023). We follow the answer extraction method from the origin paper (Hendrycks et al., 2021b). We construct DyLAN with 4 agents assigned no specific roles and let agents to interact for at maximum T = 4 rounds under T-FFN formulation. We reported the classification acc...

work page 2023

[38] [38]

nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1

To ensure the participation of each agent, early- stopping mechanism functions at the third layer and later ( t ≥ 3). We use BLEU score in the early-stopping mechanism. We calculate BLEU by sacreBLEU2 (Post, 2018). For answer post-processing, we store all unit tests from the unit tester (if exists in the system) and randomly select the final output from t...

work page 2018

[39] [39]

We sample the subsets with the proportions of 1% and 10% of the original dataset

and the CG task. We sample the subsets with the proportions of 1% and 10% of the original dataset. Agent Im- portance Score for agent selection is av- eraged on the subsets, and the selected team is tested on the whole dataset. We raise random selection and human prior selection as baselines. The latter is sim- ulated by GPT-4 prompted by the task and age...

work page 2024

[40] [40]

Doctor” and “Programmer

Due to budget limits, we directly reuse the performance reported in the paper of base- lines, including LATS (Zhou et al., 2023), Reflex- ion (Shinn et al., 2023), Meta-GPT (Hong et al., 2024), and AgentVerse (Chen et al., 2024), and es- timate the cost in terms of numbers of API calls. DyLAN is also constructed by agents which are optimized based on GPT-...

work page 2023

[41] [41]

Green annotation denotes the fields related to the role from the human perspective, which are annotated manually

Role Doctor Programmer Top10Sub-jects high school computer sciencehigh school physicsclinical knowledge electrical engineeringcollege biology high school government and politicsprofessional medicine college computer sciencenutrition college chemistryhigh school US history high school mathematicshuman aging formal logicanatomy abstract algebrahigh school b...

work page 2024

[42] [42]

Experiments show that temperature greatly influences arithmetic reasoning and code gener- ation tasks

From experimental results, we found that DyLAN is more stable on different hyper-parameters. Experiments show that temperature greatly influences arithmetic reasoning and code gener- ation tasks. In Figure 4, we found that most baseline methods have significant performance drops when temperature increases, but DyLAN shows strong robustness to various tem-...

work page 2024

[43] [43]

We tested listwise ranker with our own prompts, pairwise GPT ranker from original LLM-Blender (Jiang et al., 2023), Elo Score from TrueSkill (Herbrich et al.,

We also tested different ranking methods for agent team reformation of DyLAN on the GR task. We tested listwise ranker with our own prompts, pairwise GPT ranker from original LLM-Blender (Jiang et al., 2023), Elo Score from TrueSkill (Herbrich et al.,

work page 2023

[44] [44]

also imple- mented with pairwise ranker, and pairwise ranker with Sliding Window algorithm (Qin et al., 2023). In Table 14, we show that different ranking methods have a relatively low impact on performance, probably because of strong discrimination ability of GPT-3.5, but pairwise ranking methods always consume higher computational cost. Thus, we chose a...

work page 2023

[45] [45]

Mathematician

to the combination set: Si(R) = 1 |C||R| ∑ T ∈C (Performance(T ∪ { i}) − Performance(T )), (13) where R is the set of agents in the system, C is the combination set of R\{i}, i ∈ R , and Performance denotes the overall performance of the system on the current task, e.g., classification accuracy or Pass@1. The metric requires ground truth and multi-pass re...

work page 2024

[46] [46]

Examples: ······ ””” Please follow the function signature and complete the implementation of the Python function

22 Published as a conference paper at COLM 2024 def modp(n: int, p: int): ”””Return 2^n modulo p. Examples: ······ ””” Please follow the function signature and complete the implementation of the Python function. Reply the completion in a code block. ① Query My answer is: ```python res = 1 n = n % (p-1) while n > 0: if n % 2 == 1: ...... ``` (incorrect) Pr...

work page 2024

[47] [47]

```python res = np.power(2, n) return res ``` (incorrect) Algorithm Developer Algorithm Developer Apologize for the error

```python if p == 1: # edge case when p equals 1 return 0 res = 1 base = 2 % p while n > 0: ······ ``` (correct) It will be effiecient to use `numpy` to solve the modulo problem. ```python res = np.power(2, n) return res ``` (incorrect) Algorithm Developer Algorithm Developer Apologize for the error. I add the `import` command into Python script below. ``...

work page 2024

[48] [48]

Doctor ④ Agent Importance Scores Figure 6: A case of DyLAN solving general reasoning task

Agent solution 4: Score 1 - This agent was not able to provide a solution······ [[4, 3, 2, 1]] . Doctor ④ Agent Importance Scores Figure 6: A case of DyLAN solving general reasoning task. Different agents are recruited to give and refine solutions. The result is incorrect at the first time step but correct at the second time step. It includes the ratings ...

work page 2024

[49] [49]

Action: search[...]

We annotate the task where each prompt is used in the parenthesis, and the source of each prompt template. We omit the in-context examples of AR tasks from the original dataset of MATH (Hendrycks et al., 2021b) and PHP Zheng et al. (2023), and WebShop from ReAct (Yao et al., 2023). Prompt Content Source MMLU Instruction (GR) Here is the question: {questio...

work page 2023