The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

Alex Chao; Mason Sawtell; Sandi Besen; Tula Masterman

arxiv: 2404.11584 · v1 · pith:P5APWQ5Hnew · submitted 2024-04-17 · 💻 cs.AI · cs.CL

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

Tula Masterman , Sandi Besen , Mason Sawtell , Alex Chao This is my paper

Pith reviewed 2026-05-16 23:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords AI agentsagent architecturesreasoningplanningtool callingmulti-agent systemssingle-agent systemsleadership in agents

0 comments

The pith

AI agent architectures achieve complex goals through specific choices in leadership, communication styles, and planning-execution-reflection phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines current AI agent systems designed for reasoning, planning, and using tools to meet complex objectives. It compares single-agent and multi-agent setups to identify common design patterns and differences in how they perform. The authors point out important themes for picking an architecture, the effects of having a leader in agent teams, how agents exchange information, and the stages of planning ahead, carrying out actions, and reviewing results. These insights matter because they show what makes some agent systems more dependable than others when tackling tasks that need multiple steps. Understanding these elements helps guide the creation of better AI assistants that can handle real problems with less human intervention.

Core claim

The survey provides overviews of single-agent and multi-agent architectures for AI agents. It identifies key patterns in design choices and evaluates their impact on goal accomplishment. The central contribution is outlining themes for architecture selection, the role of leadership in agent systems, styles of agent communication, and the essential phases of planning, execution, and reflection that support robust performance.

What carries the argument

The identification and analysis of leadership structures, communication styles, and the three-phase cycle of planning, execution, and reflection as the core mechanisms that enable effective reasoning and tool use in agent architectures.

If this is right

Multi-agent systems benefit from defined leadership to coordinate efforts effectively.
Communication styles between agents affect collaboration efficiency on shared goals.
Explicit phases for planning, execution, and reflection lead to more reliable outcomes in complex tasks.
Designers should weigh these factors when choosing between single-agent and multi-agent approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These phases might be tested by measuring performance improvements when added to existing agent frameworks in specific domains like code generation or data analysis.
The survey's patterns could extend to hybrid human-AI agent teams where leadership roles shift dynamically.
Future surveys might track how these elements evolve with new model capabilities to see if the themes remain consistent.

Load-bearing premise

The selected AI agent implementations represent the broader landscape without significant bias in the authors' observations of their capabilities and limitations.

What would settle it

Demonstration of a high-performing AI agent system that succeeds at complex reasoning and planning tasks while lacking any leadership structure, specialized communication, or distinct planning-execution-reflection phases would challenge the survey's key themes.

read the original abstract

This survey paper examines the recent advancements in AI agent implementations, with a focus on their ability to achieve complex goals that require enhanced reasoning, planning, and tool execution capabilities. The primary objectives of this work are to a) communicate the current capabilities and limitations of existing AI agent implementations, b) share insights gained from our observations of these systems in action, and c) suggest important considerations for future developments in AI agent design. We achieve this by providing overviews of single-agent and multi-agent architectures, identifying key patterns and divergences in design choices, and evaluating their overall impact on accomplishing a provided goal. Our contribution outlines key themes when selecting an agentic architecture, the impact of leadership on agent systems, agent communication styles, and key phases for planning, execution, and reflection that enable robust AI agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a basic survey that groups existing AI agent examples around themes like leadership and planning phases but provides no method for how the examples were chosen.

read the letter

This survey pulls together descriptions of single-agent and multi-agent systems that try to handle complex goals through reasoning, planning, and tool use. It flags patterns in leadership roles within teams of agents, styles of communication between them, and the sequence of planning, execution, and reflection steps that seem to matter for getting work done. The authors also note some design divergences and their effects on performance. That is the core of what the paper offers: a compiled set of observations rather than any new architecture or test results. It does organize scattered recent implementations into a few readable categories, which can give a newcomer a faster way to see the main options without reading every new arXiv post. The focus stays practical, on how choices affect goal completion, and that matches what many people building these systems actually care about. The main gap is the absence of any documented way the authors picked which systems to cover. There is no search protocol, no list of sources or keywords, no date range, and no inclusion rules. Without that, it is difficult to know whether the highlighted themes reflect the broader landscape or just the subset the authors happened to look at. The comments on capabilities and limitations stay high-level too, with little direct comparison to specific results from the cited papers. This kind of overview is mainly useful for readers who are new to agent design and want a starting map before they dive into the primary literature. People already working in the area will probably find the patterns familiar. The paper shows clear engagement with the existing work and tries to draw practical lessons, so it is coherent on its own terms. I would send it to peer review, but only after asking the authors to add a clear methods section on paper selection and to ground their observations in more concrete examples from the reviewed systems.

Referee Report

1 major / 1 minor

Summary. This survey examines recent advancements in AI agent implementations, with a focus on their capabilities for complex goals involving reasoning, planning, and tool execution. It provides overviews of single-agent and multi-agent architectures, identifies patterns and divergences in design choices, evaluates their impact on goal accomplishment, and outlines key themes for selecting agentic architectures, the impact of leadership, agent communication styles, and phases for planning, execution, and reflection.

Significance. If the reviewed implementations are representative, the paper offers a useful synthesis of design patterns and practical considerations that could inform the development of more robust AI agent systems. It highlights actionable elements such as leadership structures and reflection phases, which may help practitioners navigate trade-offs in agent design. The descriptive nature limits its novelty but could still serve as a reference for the field if the coverage is comprehensive.

major comments (1)

[Introduction] The manuscript provides no documented search protocol, keyword list, database sources, date range, or inclusion/exclusion criteria for selecting the AI agent implementations reviewed. This is load-bearing for the central claims, as the outlined key themes, insights on leadership and communication, and evaluations of capabilities/limitations depend on the surveyed systems being a fair sample of the landscape rather than a selective subset.

minor comments (1)

[Abstract] The abstract would benefit from specifying the approximate number of papers or architectures reviewed and the time period covered to immediately convey the scope of the survey.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestion regarding methodological transparency. We agree that explicitly documenting the literature search process strengthens a survey paper and supports the validity of its synthesized themes. We will revise the manuscript accordingly by adding a dedicated methodology subsection.

read point-by-point responses

Referee: [Introduction] The manuscript provides no documented search protocol, keyword list, database sources, date range, or inclusion/exclusion criteria for selecting the AI agent implementations reviewed. This is load-bearing for the central claims, as the outlined key themes, insights on leadership and communication, and evaluations of capabilities/limitations depend on the surveyed systems being a fair sample of the landscape rather than a selective subset.

Authors: We acknowledge the validity of this point. The current version of the manuscript does not contain an explicit search protocol, which is a limitation for a survey claiming to map the landscape. In the revised manuscript we will insert a new subsection (e.g., “Literature Search and Selection Methodology”) immediately after the introduction. This subsection will specify: (1) primary sources (arXiv, Google Scholar, ACL Anthology, and selected workshop proceedings), (2) keyword combinations used (e.g., “LLM agent” OR “AI agent architecture” AND (“reasoning” OR “planning” OR “tool use” OR “reflection” OR “multi-agent”)), (3) date range (primarily January 2022–March 2024 to capture post-LLM developments), (4) inclusion criteria (papers that describe implemented agent architectures demonstrating at least one of reasoning, planning, tool calling, or multi-agent coordination), and (5) exclusion criteria (purely theoretical position papers, non-implemented frameworks, or prior surveys). We will also report the approximate number of papers initially retrieved and finally retained. This addition will clarify the scope and selection process, allowing readers to better evaluate the representativeness of the discussed systems and the resulting design patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive survey with no derivations or self-referential claims

full rationale

This is a survey paper that reviews external AI agent implementations, identifies patterns in architectures, and outlines themes based on cited works. It contains no equations, no fitted parameters, no predictions derived from its own data, and no self-citation chains that bear the central load. The contribution is observational and pattern-identification from external sources, making the derivation chain self-contained against benchmarks with no reduction to inputs by construction. Lack of explicit search methodology affects representativeness but does not create circularity in any claimed result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, no free parameters, axioms, or invented entities are introduced; the content rests entirely on synthesis of previously published agent implementations.

pith-pipeline@v0.9.0 · 5443 in / 990 out tokens · 34432 ms · 2026-05-16T23:14:12.413730+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FP-Agent: Fingerprinting AI Browsing Agents
cs.CR 2026-05 unverdicted novelty 7.0

Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration
cs.AI 2026-03 unverdicted novelty 7.0

GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.
Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
cs.SE 2026-02 unverdicted novelty 7.0

Agent-Diff benchmarks LLM agents on enterprise API tasks using code execution and state-diff contracts to define success, evaluated on nine models across 224 tasks with code released.
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
cs.SE 2025-09 conditional novelty 7.0

Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering
cs.CL 2025-05 unverdicted novelty 7.0

A 7B Qwen-2.5 LLM trained with a new RL framework on only 9 ML tasks achieves performance comparable to much larger proprietary LLM agents at lower computational cost with cross-task generalization.
GRAFT: Graph-Tokenized LLMs for Tool Planning
cs.LG 2026-05 unverdicted novelty 6.0

GRAFT internalizes tool dependency graphs via dedicated special tokens in LLMs and applies on-policy context distillation to achieve higher exact sequence matching and dependency legality than prior external-graph methods.
Towards Security-Auditable LLM Agents: A Unified Graph Representation
cs.AI 2026-05 unverdicted novelty 6.0

Agent-BOM is a unified hierarchical attributed directed graph that models static capability bases and dynamic semantic states of LLM agents for path-level security auditing and risk assessment.
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
cs.AI 2026-04 unverdicted novelty 6.0

EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow
cs.LG 2026-04 unverdicted novelty 6.0

AutoSurrogate is a multi-agent LLM framework that autonomously constructs, tunes, and validates deep learning surrogates for subsurface flow from natural language, outperforming expert baselines on a 3D carbon storage task.
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
Evaluating Privilege Usage of Agents with Real-World Tools
cs.CR 2026-03 unverdicted novelty 6.0

GrantBox evaluates LLM agents using real-world tools and finds they remain vulnerable to sophisticated prompt injection attacks with an 84.80% average success rate.
DoubleAgents: Human-Agent Alignment in a Socially Embedded Workflow
cs.HC 2025-09 unverdicted novelty 6.0

DoubleAgents shows that a distributed-cognition design with coordination agent, dashboard, and policy module increases user comfort and reliance on AI agents for coordination tasks over time.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
cs.AI 2025-08 unverdicted novelty 6.0

BlindGuard introduces an unsupervised hierarchical agent encoder plus corruption-guided contrastive detector that identifies malicious agents in LLM-based multi-agent systems without any attack labels or prior knowled...
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
cs.CL 2025-06 conditional novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
Against the Monolithic Wireless World Model: Why NextG Needs Composable and Agentic Intelligence
eess.SP 2026-05 unverdicted novelty 5.0

Wireless data lacks the self-contained tokenized substrate of text, so monolithic wireless world models are unsuitable for 6G; composable agentic systems using specialized components and explicit interfaces are the re...
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 5.0

Agentic AI needs social theory as a structural prior, formalized via the MASS dynamical system framework with four priors: strategic heterogeneity, networked-constrained dependence, co-evolution, and distributional in...
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
cs.RO 2026-04 unverdicted novelty 5.0

EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
cs.LG 2026-04 unverdicted novelty 5.0

AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...
Small Language Models are the Future of Agentic AI
cs.AI 2025-06 unverdicted novelty 5.0

Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
cs.AI 2024-11 unverdicted novelty 5.0

Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.
Bridging the Gap on AI-Assisted Scientific Software Development Through Transparency and Traceability
cs.SE 2026-05 conditional novelty 4.0

Proposes guidance for responsible AI use in scientific software development under NQA-1 standards, illustrated with TMAP8 V&V cases to ensure accountability and auditability.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 4.0

Agentic AI requires social theory as a structural prior in the proposed MASS framework to model emergent outcomes from agent interactions and influence.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 4.0

Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.
Large Language Model-Brained GUI Agents: A Survey
cs.AI 2024-11 unverdicted novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
Large Language Model-Based Agents for Software Engineering: A Survey
cs.SE 2024-09 unverdicted novelty 4.0

A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 3.0

The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
cs.CL 2025-03 accept novelty 3.0

A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Rethinking Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
LLM-Powered AI Agent Systems and Their Applications in Industry
cs.AI 2025-05 unverdicted novelty 2.0

A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
cs.AI 2025-03 unverdicted novelty 2.0

This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 29 Pith papers · 14 internal anchors

[1]

Autogpt+ p: Affordance- based task planning with large language models

Timo Birr et al. AutoGPT+P: Affordance-based Task Planning with Large Language Models. arXiv:2402.10778 [cs] version: 1. Feb. 2024. URL: http://arxiv.org/abs/2402.10778

work page arXiv 2024
[2]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors . arXiv:2308.10848 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2308.10848

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs]. Nov. 2021. URL: http://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

Xueyang Feng et al. Large Language Model-based Human-Agent Collaboration for Complex Task Solving. 2024. arXiv: 2402.12914 [cs.CL]

work page arXiv 2024
[6]

URL: http://arxiv.org/abs/2309.00770

work page arXiv
[7]

Efficient tool use with chain-of-abstraction rea- soning.arXiv preprint arXiv:2401.17464, 2024

Silin Gao et al. Efficient Tool Use with Chain-of-Abstraction Reasoning. arXiv:2401.17464 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2401.17464

work page arXiv 2024
[8]

Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies

Mor Geva et al.Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. arXiv:2101.02235 [cs]. Jan. 2021. URL: http://arxiv.org/abs/2101.02235

work page arXiv 2021
[9]

Golchin, M

Shahriar Golchin and Mihai Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv:2308.08493 [cs] version: 3. Feb. 2024. URL: http://arxiv.org/abs/2308.08493

work page arXiv 2024
[10]

Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

Xudong Guo et al. Embodied LLM Agents Learn to Cooperate in Organized Teams. 2024. arXiv: 2403.12482 [cs.AI]

work page arXiv 2024
[11]

Measuring Massive Multitask Language Understanding

Dan Hendrycks et al. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs]. Jan. 2021. URL: http://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework . 2023. arXiv: 2308.00352 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Understanding the planning of LLM agents: A survey

Xu Huang et al. Understanding the planning of LLM agents: A survey. 2024. arXiv: 2402.02716 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Fangyu Lei et al. S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models . arXiv:2310.15147 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2310.15147

work page arXiv 2023
[16]

Graph-enhanced large language models in asynchronous plan reasoning.arXiv preprint arXiv:2402.02805, 2024

Fangru Lin et al. Graph-enhanced Large Language Models in Asynchronous Plan Reasoning. arXiv:2402.02805 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.02805

work page arXiv 2024
[17]

From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models

Na Liu et al. From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models. arXiv:2401.02777 [cs]. Jan. 2024. URL: http://arxiv.org/abs/2401.02777

work page arXiv 2024
[18]

AgentBench: Evaluating LLMs as Agents

Xiao Liu et al. AgentBench: Evaluating LLMs as Agents . arXiv:2308.03688 [cs]. Oct. 2023. URL: http : //arxiv.org/abs/2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu et al. Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization. 2023. arXiv: 2310.02170 [cs.CL]

work page internal anchor Pith review arXiv 2023
[20]

yoheinakajima/babyagi

Yohei Nakajima. yoheinakajima/babyagi. original-date: 2023-04-03T00:40:27Z. Apr. 2024. URL: https:// github.com/yoheinakajima/babyagi

work page 2023
[21]

S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D

Peter S. Park et al. AI Deception: A Survey of Examples, Risks, and Potential Solutions. arXiv:2308.14752 [cs]. Aug. 2023. URL: http://arxiv.org/abs/2308.14752

work page arXiv 2023
[22]

arXiv preprint arXiv:2307.00184 (2023)

Greg Serapio-García et al. Personality Traits in Large Language Models. 2023. arXiv: 2307.00184 [cs.CL]

work page arXiv 2023
[24]

URL: http://arxiv.org/abs/2403.03031

work page arXiv
[26]

URL: http://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Systematic biases in llm simulations of debates

Amir Taubenfeld et al. Systematic Biases in LLM Simulations of Debates. arXiv:2402.04049 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.04049

work page arXiv 2024
[28]

Evil geniuses: Delving into the safety of llm-based agents,

Yu Tian et al.Evil Geniuses: Delving into the Safety of LLM-based Agents. arXiv:2311.11855 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2311.11855

work page arXiv 2024
[29]

Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song

Qineng Wang et al. Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? arXiv:2402.18272 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.18272

work page arXiv 2024
[30]

Benchmark self- evolving: A multi-agent framework for dynamic llm evaluation,

Siyuan Wang et al. Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation . arXiv:2402.11443 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.11443. 12

work page arXiv 2024
[31]

Zhenhailong Wang et al.Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. 2024. arXiv: 2307.05300 [cs.AI]

work page arXiv 2024
[32]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs]. Jan. 2023. URL: http://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Mitchell, and Yuanzhi Li

Yue Wu et al.SmartPlay: A Benchmark for LLMs as Intelligent Agents. arXiv:2310.01557 [cs]. Mar. 2024. URL: http://arxiv.org/abs/2310.01557

work page arXiv 2024
[34]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi et al. The Rise and Potential of Large Language Model Based Agents: A Survey . 2023. arXiv: 2309.07864 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

URL: http://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao et al.Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs]. Dec. 2023. URL: http://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

arXiv preprint arXiv:2305.13534 , year=

Muru Zhang et al. How Language Model Hallucinations Can Snowball. arXiv:2305.13534 [cs]. May 2023. URL: http://arxiv.org/abs/2305.13534

work page arXiv 2023
[39]

(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild

Wenting Zhao et al. “(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild”. In: The Twelfth In- ternational Conference on Learning Representations. 2024. URL: https://openreview.net/forum?id= Bl8u7ZRlbM

work page 2024
[40]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou et al. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. arXiv:2310.04406 [cs]. Dec. 2023. URL: http://arxiv.org/abs/2310.04406

work page internal anchor Pith review arXiv 2023
[41]

Dyval 2: Dynamic evaluation of large language models by meta probing agents

Kaijie Zhu et al. DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents . arXiv:2402.14865 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.14865

work page arXiv 2024
[42]

Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

Kaijie Zhu et al. DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks. arXiv:2309.17167 [cs]. Mar. 2024. URL: http://arxiv.org/abs/2309.17167. 13

work page arXiv 2024

[1] [1]

Autogpt+ p: Affordance- based task planning with large language models

Timo Birr et al. AutoGPT+P: Affordance-based Task Planning with Large Language Models. arXiv:2402.10778 [cs] version: 1. Feb. 2024. URL: http://arxiv.org/abs/2402.10778

work page arXiv 2024

[2] [2]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Weize Chen et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors . arXiv:2308.10848 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2308.10848

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs]. Nov. 2021. URL: http://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

Xueyang Feng et al. Large Language Model-based Human-Agent Collaboration for Complex Task Solving. 2024. arXiv: 2402.12914 [cs.CL]

work page arXiv 2024

[5] [6]

URL: http://arxiv.org/abs/2309.00770

work page arXiv

[6] [7]

Efficient tool use with chain-of-abstraction rea- soning.arXiv preprint arXiv:2401.17464, 2024

Silin Gao et al. Efficient Tool Use with Chain-of-Abstraction Reasoning. arXiv:2401.17464 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2401.17464

work page arXiv 2024

[7] [8]

Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies

Mor Geva et al.Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. arXiv:2101.02235 [cs]. Jan. 2021. URL: http://arxiv.org/abs/2101.02235

work page arXiv 2021

[8] [9]

Golchin, M

Shahriar Golchin and Mihai Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv:2308.08493 [cs] version: 3. Feb. 2024. URL: http://arxiv.org/abs/2308.08493

work page arXiv 2024

[9] [10]

Embodied llm agents learn to cooperate in organized teams.arXiv preprint arXiv:2403.12482, 2024

Xudong Guo et al. Embodied LLM Agents Learn to Cooperate in Organized Teams. 2024. arXiv: 2403.12482 [cs.AI]

work page arXiv 2024

[10] [11]

Measuring Massive Multitask Language Understanding

Dan Hendrycks et al. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs]. Jan. 2021. URL: http://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2009

[11] [12]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework . 2023. arXiv: 2308.00352 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [13]

Understanding the planning of LLM agents: A survey

Xu Huang et al. Understanding the planning of LLM agents: A survey. 2024. arXiv: 2402.02716 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [14]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [15]

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Fangyu Lei et al. S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models . arXiv:2310.15147 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2310.15147

work page arXiv 2023

[15] [16]

Graph-enhanced large language models in asynchronous plan reasoning.arXiv preprint arXiv:2402.02805, 2024

Fangru Lin et al. Graph-enhanced Large Language Models in Asynchronous Plan Reasoning. arXiv:2402.02805 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.02805

work page arXiv 2024

[16] [17]

From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models

Na Liu et al. From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models. arXiv:2401.02777 [cs]. Jan. 2024. URL: http://arxiv.org/abs/2401.02777

work page arXiv 2024

[17] [18]

AgentBench: Evaluating LLMs as Agents

Xiao Liu et al. AgentBench: Evaluating LLMs as Agents . arXiv:2308.03688 [cs]. Oct. 2023. URL: http : //arxiv.org/abs/2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [19]

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu et al. Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization. 2023. arXiv: 2310.02170 [cs.CL]

work page internal anchor Pith review arXiv 2023

[19] [20]

yoheinakajima/babyagi

Yohei Nakajima. yoheinakajima/babyagi. original-date: 2023-04-03T00:40:27Z. Apr. 2024. URL: https:// github.com/yoheinakajima/babyagi

work page 2023

[20] [21]

S., Goldstein, S., O'Gara, A., Chen, M., and Hendrycks, D

Peter S. Park et al. AI Deception: A Survey of Examples, Risks, and Potential Solutions. arXiv:2308.14752 [cs]. Aug. 2023. URL: http://arxiv.org/abs/2308.14752

work page arXiv 2023

[21] [22]

arXiv preprint arXiv:2307.00184 (2023)

Greg Serapio-García et al. Personality Traits in Large Language Models. 2023. arXiv: 2307.00184 [cs.CL]

work page arXiv 2023

[22] [24]

URL: http://arxiv.org/abs/2403.03031

work page arXiv

[23] [26]

URL: http://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv

[24] [27]

Systematic biases in llm simulations of debates

Amir Taubenfeld et al. Systematic Biases in LLM Simulations of Debates. arXiv:2402.04049 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.04049

work page arXiv 2024

[25] [28]

Evil geniuses: Delving into the safety of llm-based agents,

Yu Tian et al.Evil Geniuses: Delving into the Safety of LLM-based Agents. arXiv:2311.11855 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2311.11855

work page arXiv 2024

[26] [29]

Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song

Qineng Wang et al. Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key? arXiv:2402.18272 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.18272

work page arXiv 2024

[27] [30]

Benchmark self- evolving: A multi-agent framework for dynamic llm evaluation,

Siyuan Wang et al. Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation . arXiv:2402.11443 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.11443. 12

work page arXiv 2024

[28] [31]

Zhenhailong Wang et al.Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. 2024. arXiv: 2307.05300 [cs.AI]

work page arXiv 2024

[29] [32]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs]. Jan. 2023. URL: http://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [33]

Mitchell, and Yuanzhi Li

Yue Wu et al.SmartPlay: A Benchmark for LLMs as Intelligent Agents. arXiv:2310.01557 [cs]. Mar. 2024. URL: http://arxiv.org/abs/2310.01557

work page arXiv 2024

[31] [34]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi et al. The Rise and Potential of Large Language Model Based Agents: A Survey . 2023. arXiv: 2309.07864 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [36]

URL: http://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv

[33] [37]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao et al.Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs]. Dec. 2023. URL: http://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [38]

arXiv preprint arXiv:2305.13534 , year=

Muru Zhang et al. How Language Model Hallucinations Can Snowball. arXiv:2305.13534 [cs]. May 2023. URL: http://arxiv.org/abs/2305.13534

work page arXiv 2023

[35] [39]

(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild

Wenting Zhao et al. “(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild”. In: The Twelfth In- ternational Conference on Learning Representations. 2024. URL: https://openreview.net/forum?id= Bl8u7ZRlbM

work page 2024

[36] [40]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Andy Zhou et al. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. arXiv:2310.04406 [cs]. Dec. 2023. URL: http://arxiv.org/abs/2310.04406

work page internal anchor Pith review arXiv 2023

[37] [41]

Dyval 2: Dynamic evaluation of large language models by meta probing agents

Kaijie Zhu et al. DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents . arXiv:2402.14865 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.14865

work page arXiv 2024

[38] [42]

Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

Kaijie Zhu et al. DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks. arXiv:2309.17167 [cs]. Mar. 2024. URL: http://arxiv.org/abs/2309.17167. 13

work page arXiv 2024