pith. machine review for the scientific record. sign in

arxiv: 2402.01680 · v2 · submitted 2024-01-21 · 💻 cs.CL · cs.AI· cs.MA

Recognition: 3 theorem links

· Lean Theorem

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA
keywords Large Language ModelsMulti-Agent SystemsSurveyAutonomous AgentsComplex Problem-SolvingWorld SimulationBenchmarksCommunication Protocols
0
0 comments X

The pith

Large language model based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey maps out how teams of large language models function as autonomous agents that collaborate on tasks too difficult for single models. It covers the environments these systems model, the ways agents are defined and exchange information, and the processes that let their abilities expand through interaction. A reader would care because such systems could unlock more reliable AI for planning, decision-making, and simulating complex real-world dynamics. The paper also gathers the main benchmarks and datasets so others can test and extend the work. It keeps an updated list of studies to track rapid developments in the area.

Core claim

LLM-based multi-agent systems build on the planning and reasoning strengths of individual large language models by coordinating multiple agents to solve complex problems and simulate environments, with advances driven by agent profiling, communication protocols, and mechanisms that increase agent capacities over time.

What carries the argument

The combination of agent profiling, inter-agent communication, and capacity-growth mechanisms that allow multiple LLMs to collaborate beyond single-agent limits.

If this is right

  • Developers can draw on the compiled benchmarks to evaluate and compare new multi-agent designs directly.
  • Research attention is likely to shift toward refining communication and coordination to address the challenges identified.
  • These systems may support more detailed simulations of social, economic, or scientific scenarios that require sustained collaboration.
  • The structured breakdown of profiling and growth mechanisms gives a template for designing agents tailored to specific tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multi-agent LLM setups could extend to collaborative scientific discovery where specialized agents handle different stages of hypothesis testing.
  • The coordination challenges noted may require additional techniques for detecting and correcting collective errors among agents.
  • Hybrid combinations of these systems with non-LLM components might yield more stable performance in long-running simulations.

Load-bearing premise

The survey assumes that the papers and benchmarks it selects adequately represent the essential aspects and challenges of the field without major omissions or selection bias.

What would settle it

A new survey that documents substantial bodies of work on LLM multi-agent systems in domains, profiling methods, or benchmarks not covered here would show the overview is incomplete.

read the original abstract

Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-agent systems based on LLMs, as well as the challenges. Our goal is for readers to gain substantial insights on the following questions: What domains and environments do LLM-based multi-agents simulate? How are these agents profiled and how do they communicate? What mechanisms contribute to the growth of agents' capacities? For those interested in delving into this field of study, we also summarize the commonly used datasets or benchmarks for them to have convenient access. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository, dedicated to outlining the research on LLM-based multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This survey overviews LLM-based multi-agent systems, addressing domains and environments simulated, agent profiling and communication methods, mechanisms for capacity growth, and commonly used datasets/benchmarks. It maintains an open-source GitHub repository for ongoing updates on the literature.

Significance. If the coverage is balanced and representative, the survey would offer useful synthesis of progress in complex problem-solving and world simulation using LLM agents. The GitHub repository is a concrete strength for reproducibility and timeliness.

major comments (2)
  1. [Abstract/Introduction] Abstract and Introduction: The claim that 'LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation' is not accompanied by any search protocol, inclusion/exclusion criteria, date range, or quantitative summary of reviewed papers. This leaves the representativeness of the selected works (and thus the 'progress' narrative) unverifiable and vulnerable to selection bias.
  2. [Introduction] The survey's central questions on domains, profiling, communication, and capacity-growth mechanisms rest on an opaque curation step. Without explicit methodology, it is impossible to assess whether negative results, failed replications, or non-LLM multi-agent baselines were systematically considered or omitted.
minor comments (1)
  1. The abstract promises 'in-depth discussion' but the provided text does not yet show how the taxonomy or benchmark summaries are organized; ensure each major section explicitly maps back to the four guiding questions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. The comments correctly identify a lack of transparency in our literature curation process, which we agree weakens the verifiability of our claims about progress in the field. We will revise the manuscript to add an explicit methodology section and quantitative details, while preserving the survey's focus on LLM-based multi-agent systems.

read point-by-point responses
  1. Referee: [Abstract/Introduction] Abstract and Introduction: The claim that 'LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation' is not accompanied by any search protocol, inclusion/exclusion criteria, date range, or quantitative summary of reviewed papers. This leaves the representativeness of the selected works (and thus the 'progress' narrative) unverifiable and vulnerable to selection bias.

    Authors: We acknowledge that the current abstract and introduction do not document the curation process, leaving the selection open to questions of bias. In the revision we will insert a dedicated 'Survey Methodology' subsection (placed after the introduction's research questions) that specifies: search databases (arXiv, Google Scholar, major NLP/ML conferences), keywords and Boolean strings used, date range (primarily 2022–early 2024 to capture the post-ChatGPT surge), inclusion criteria (peer-reviewed or preprint works that implement or analyze systems with two or more LLM-based agents for problem-solving or simulation tasks), and exclusion criteria (purely single-agent LLM work or non-LLM multi-agent systems unless they serve as explicit baselines). We will also add a short quantitative summary (total papers reviewed, breakdown by the four main categories) and a PRISMA-style flow diagram in the appendix. These additions will make the 'considerable progress' claim traceable to the reviewed corpus. revision: yes

  2. Referee: [Introduction] The survey's central questions on domains, profiling, communication, and capacity-growth mechanisms rest on an opaque curation step. Without explicit methodology, it is impossible to assess whether negative results, failed replications, or non-LLM multi-agent baselines were systematically considered or omitted.

    Authors: We agree that the absence of a documented curation protocol makes it difficult to judge coverage of negative or contrasting results. The new methodology subsection will explicitly describe how papers were allocated to each of the four central questions and will note that we prioritized works that directly address LLM multi-agent architectures. In the revised 'Challenges' section we will add a paragraph discussing the scarcity of published negative results and failed replications in this nascent area, and we will include brief comparisons with non-LLM multi-agent baselines (e.g., classical MARL or rule-based systems) where they illuminate limitations of current LLM approaches. While a fully exhaustive systematic review of all negative findings is beyond the scope of a timely survey, the added text will make our selection criteria and acknowledged gaps transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without derivations or self-referential reductions

full rationale

This paper is a literature survey with no equations, derivations, fitted parameters, or mathematical claims. Its statements about 'considerable progress' summarize reviewed works rather than deriving new results from prior outputs by the same authors. No load-bearing step reduces by construction to a self-citation, ansatz, or renamed input; the contribution is organizational and the selection of papers is presented as an external curation step, not a deductive one. The paper is self-contained as a review against the cited external benchmarks and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; its contribution rests on selection and organization of prior publications.

pith-pipeline@v0.9.0 · 5531 in / 977 out tokens · 32500 ms · 2026-05-12T06:54:35.947350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  2. Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

    cs.CL 2026-05 unverdicted novelty 7.0

    TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...

  3. Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

    cs.MA 2026-05 unverdicted novelty 7.0

    Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...

  4. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  5. Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

    cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

    LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

  6. TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...

  7. TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data

    cs.AI 2026-04 unverdicted novelty 7.0

    TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design matter...

  8. Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows

    cs.DB 2026-04 unverdicted novelty 7.0

    A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.

  9. WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents

    cs.LG 2026-04 unverdicted novelty 7.0

    WaterAdmin uses a bi-level design with LLM agents for dynamic context abstraction and optimization for real-time pump/valve control, achieving better pressure reliability and lower energy use than traditional methods ...

  10. SoK: Blockchain Agent-to-Agent Payments

    q-fin.GN 2026-04 unverdicted novelty 7.0

    The first systematization of blockchain-based agent-to-agent payments organizes designs into discovery, authorization, execution, and accounting stages while identifying trust and security gaps.

  11. Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

    cs.AI 2026-04 conditional novelty 7.0

    NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.

  12. GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

    cs.AI 2026-03 unverdicted novelty 7.0

    GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.

  13. Automated Design of Agentic Systems

    cs.AI 2024-08 conditional novelty 7.0

    Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...

  14. LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    LEMON trains an LLM orchestrator with counterfactual-augmented GRPO to produce deployable multi-agent specifications that reach state-of-the-art results on six reasoning and coding benchmarks.

  15. CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

    cs.CL 2026-05 unverdicted novelty 6.0

    CANTANTE uses contrastive rollouts to attribute system rewards to individual agents, enabling better prompt optimization than prior methods on programming, math, and QA benchmarks.

  16. CHAL: Council of Hierarchical Agentic Language

    cs.AI 2026-05 unverdicted novelty 6.0

    CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.

  17. Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

    cs.MA 2026-05 unverdicted novelty 6.0

    Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...

  18. Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization

    cs.SE 2026-05 conditional novelty 6.0

    Deterministic orchestration matches LLM-controlled methods in COBOL-to-Python translation accuracy but improves worst-case robustness, reduces run-to-run variability, and cuts token consumption by up to 3.5 times.

  19. Why Does Agentic Safety Fail to Generalize Across Tasks?

    cs.LG 2026-05 conditional novelty 6.0

    Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...

  20. When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

    cs.MA 2026-05 unverdicted novelty 6.0

    CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.

  21. When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

    cs.MA 2026-05 unverdicted novelty 6.0

    CAFE detects positive distributional Jensen Gaps across five multi-agent LLM architectures on a banking-risk benchmark, showing that quality drops under semantic stress can coexist with statistically detectable antifr...

  22. Multi-Agent Empowerment and Emergence of Complex Behavior in Groups

    cs.AI 2026-04 unverdicted novelty 6.0

    A multi-agent extension of empowerment produces emergent group organizations in tendon-coupled agent pairs and controllable Vicsek flocks.

  23. Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design

    cs.AI 2026-04 unverdicted novelty 6.0

    Mol-Debate applies multi-agent debate in an iterative loop with perspective orchestration to achieve state-of-the-art text-guided molecular design, scoring 59.82% exact match on ChEBI-20 and 50.52% weighted success on...

  24. ThreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts

    cs.CL 2026-04 unverdicted novelty 6.0

    ThreadSumm improves structured summarization of nested discourse threads by combining LLM-based aspect and content unit extraction with sentence ordering and Tree of Thoughts search for better coherence and opinion coverage.

  25. GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    GraphDC applies divide-and-conquer multi-agent LLM reasoning to graph algorithms by decomposing graphs into subgraphs for local agents and integrating via a master agent, outperforming direct methods especially on lar...

  26. Agentic Frameworks for Reasoning Tasks: An Empirical Study

    cs.AI 2026-04 unverdicted novelty 6.0

    An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

  27. AgentClick: A Skill-Based Human-in-the-Loop Review Layer for Terminal AI Agents

    cs.HC 2026-04 unverdicted novelty 6.0

    AgentClick is a localhost npm server and skill-based plugin that connects terminal AI agents to a structured web UI for human review of plans, code execution, memory, and errors.

  28. Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Large language models display three universal scale-dependent regimes of behavior—stable, chaotic, and signal-dominated—driven by floating-point rounding errors that produce an avalanche effect in early layers.

  29. In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach

    cs.AI 2026-04 unverdicted novelty 6.0

    A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.

  30. Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

    cs.MA 2026-04 unverdicted novelty 6.0

    HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.

  31. Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.

  32. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  33. Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents

    physics.soc-ph 2026-05 unverdicted novelty 5.0

    LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.

  34. AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

    cs.AI 2026-04 unverdicted novelty 5.0

    AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.

  35. WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems

    cs.SE 2026-04 unverdicted novelty 5.0

    WebMAC uses three specialized multi-agent modules to clarify test scenarios, partition them for adequacy, and generate executable scripts, yielding 30-60% higher success rates and 29% better efficiency than SOTA on fo...

  36. Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

    cs.SE 2026-04 unverdicted novelty 5.0

    Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...

  37. Spec Kit Agents: Context-Grounded Agentic Workflows

    cs.SE 2026-04 unverdicted novelty 5.0

    A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.

  38. Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges

    cs.AI 2026-05 unverdicted novelty 4.0

    A literature survey finds foundation-model agents in industry are 75% at prototype stages with gains in human interaction and uncertainty handling but deficits in negotiation, plus limitations like hallucinations and latency.

  39. Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures

    cs.AI 2026-04 unverdicted novelty 4.0

    A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.

  40. Agentic Microphysics: A Manifesto for Generative AI Safety

    cs.CY 2026-04 unverdicted novelty 4.0

    The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.

  41. EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

    cs.AI 2026-04 unverdicted novelty 4.0

    EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.

  42. Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

    cs.CL 2026-03 unverdicted novelty 4.0

    Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.

  43. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

  44. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  45. A Survey on the Memory Mechanism of Large Language Model based Agents

    cs.AI 2024-04 accept novelty 3.0

    A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 43 Pith papers · 8 internal anchors

  1. [1]

    Evaluating multi-agent coordination abilities in large language models,

    [Agashe et al., 2023] Saaket Agashe, Yue Fan, and Xin Eric Wang. Evaluating multi-agent coordination abilities in large language models,

  2. [2]

    Arriaga, and Adam Tauman Kalai

    [Aher et al., 2023] Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies,

  3. [3]

    Playing repeated games with large language models

    [Akata et al., 2023] Elif Akata, Lion Schulz, Julian Coda- Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.arXiv preprint arXiv:2305.16867,

  4. [4]

    Rethinking the buyer’s in- spection paradox in information markets with language agents

    [Anonymous, 2023] Anonymous. Rethinking the buyer’s in- spection paradox in information markets with language agents. In Submitted to The Twelfth International Con- ference on Learning Representations,

  5. [5]

    [Chan et al., 2023] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu

    under review. [Chan et al., 2023] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evalua- tors through multi-agent debate,

  6. [6]

    In Proceedings of the 26th ACM Conference on Eco- nomics and Computation, pages 786–786

    [Chen et al., 2023a] Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B ¨orje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288,

  7. [7]

    Multi-agent consensus seeking via large language models

    [Chen et al., 2023b] Huaben Chen, Wenkang Ji, Lufeng Xu, and Shiyu Zhao. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151,

  8. [8]

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem

    [Chen et al., 2023c] Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facil- itating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848 ,

  9. [9]

    Scalable multi- robot collaboration with large language models: Centralized or decentralized systems? arXiv preprint arXiv:2309.15943, 2023

    [Chen et al., 2023d] Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. Scalable multi- robot collaboration with large language models: Cen- tralized or decentralized systems? arXiv preprint arXiv:2309.15943,

  10. [10]

    Training Verifiers to Solve Math Word Problems

    [Cobbe et al., 2021] Karl Cobbe, Vineet Kosaraju, Moham- mad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word prob- lems. arXiv preprint arXiv:2110.14168,

  11. [11]

    Collaborating with lan- guage models for embodied reasoning

    [Dasgupta et al., 2023] Ishita Dasgupta, Christine Kaeser- Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus. Collaborating with lan- guage models for embodied reasoning. arXiv preprint arXiv:2302.00763,

  12. [12]

    Multi-agent llm applica- tions — a review of current research, tools, and challenges

    [Dibia, 2023] Victor Dibia. Multi-agent llm applica- tions — a review of current research, tools, and challenges. https://newsletter.victordibia.com/p/ multi-agent-llm-applications-a-review,

  13. [13]

    Tenenbaum, and Igor Mordatch

    [Du et al., 2023] Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving fac- tuality and reasoning in language models through multia- gent debate,

  14. [14]

    Can large language models serve as rational play- ers in game theory? a systematic analysis

    [Fan et al., 2023] Caoyun Fan, Jindou Chen, Yaohui Jin, and Hao He. Can large language models serve as rational play- ers in game theory? a systematic analysis. arXiv preprint arXiv:2312.05488,

  15. [15]

    Doyne Farmer and Robert L

    [Farmer and Axtell, 2022] J. Doyne Farmer and Robert L. Axtell. Agent-Based Modeling in Economics and Finance: Past, Present, and Future. INET Oxford Working Papers 2022-10, Institute for New Economic Thinking at the Ox- ford Martin School, University of Oxford, June

  16. [16]

    S3: Social-network simulation system with large language model-empowered agents

    [Gao et al., 2023a] Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S 3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984,

  17. [17]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    [Gao et al., 2023b] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997,

  18. [18]

    Did aris- totle use a laptop? a question answering benchmark with implicit reasoning strategies,

    [Geva et al., 2021] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aris- totle use a laptop? a question answering benchmark with implicit reasoning strategies,

  19. [19]

    Generative agent-based modeling: Unveiling social sys- tem dynamics through coupling mechanistic models with generative artificial intelligence

    [Ghaffarzadegan et al., 2023] Navid Ghaffarzadegan, Aritra Majumdar, Ross Williams, and Niyousha Hosseinichimeh. Generative agent-based modeling: Unveiling social sys- tem dynamics through coupling mechanistic models with generative artificial intelligence. arXiv preprint arXiv:2309.11456,

  20. [20]

    MindAgent : Emergent gaming interaction

    [Gong et al., 2023] Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi V o, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, et al. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971,

  21. [21]

    V., Wiest, O., and Zhang, X

    [Guo et al., 2023] Taicheng Guo, Kehan Guo, Zhengwen Liang, Zhichun Guo, Nitesh V Chawla, Olaf Wiest, Xi- angliang Zhang, et al. What indeed can gpt models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365,

  22. [22]

    Measuring Massive Multitask Language Understanding

    [Hendrycks et al., 2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask lan- guage understanding. arXiv preprint arXiv:2009.03300 ,

  23. [23]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    [Hong et al., 2023] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt: Meta programming for multi-agent col- laborative framework. arXiv preprint arXiv:2308.00352 ,

  24. [24]

    Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research,

    [Horton, 2023] John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research,

  25. [25]

    War and peace (waragent): Large lan- guage model-based multi-agent simulation of world wars,

    [Hua et al., 2023] Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large lan- guage model-based multi-agent simulation of world wars,

  26. [26]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    [Huang et al., 2023b] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Prin- ciples, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232,

  27. [27]

    Lyfe agents: Generative agents for low-cost real-time social interactions

    [Kaiya et al., 2023] Zhao Kaiya, Michelangelo Naim, Jo- vana Kondic, Manuel Cortes, Jiaxin Ge, Shuying Luo, Guangyu Robert Yang, and Andrew Ahn. Lyfe agents: Generative agents for low-cost real-time social interac- tions. arXiv preprint arXiv:2310.02172,

  28. [28]

    Decomposed prompting: A modular approach for solving complex tasks,

    [Khot et al., 2023] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks,

  29. [29]

    The socialai school: Insights from developmental psychol- ogy towards artificial socio-cultural agents

    [Kovaˇc et al., 2023] Grgur Kovaˇc, R´emy Portelas, Peter Ford Dominey, and Pierre-Yves Oudeyer. The socialai school: Insights from developmental psychology towards artificial socio-cultural agents. arXiv preprint arXiv:2307.07871 ,

  30. [30]

    Retrieval-augmented generation for knowledge-intensive nlp tasks,

    [Lewis et al., 2021] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks,

  31. [31]

    Psychology-informed recommender systems: A human- centric perspective on recommender systems

    [Lex and Schedl, 2022] Elisabeth Lex and Markus Schedl. Psychology-informed recommender systems: A human- centric perspective on recommender systems. In Proceed- ings of the 2022 Conference on Human Information In- teraction and Retrieval , CHIIR ’22, page 367–368, New York, NY , USA,

  32. [32]

    Quantify- ing the impact of large language models on collective opinion dynamics

    Association for Computing Ma- chinery. [Li et al., 2023a] Chao Li, Xing Su, Chao Fan, Haoying Han, Cong Xue, and Chunmo Zheng. Quantifying the impact of large language models on collective opinion dynamics. arXiv preprint arXiv:2308.03313,

  33. [33]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    [Li et al., 2023b] Guohao Li, Hasan Abed Al Kader Ham- moud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” ex- ploration of large scale language model society. arXiv preprint arXiv:2303.17760,

  34. [34]

    Are you in a masquerade? explor- ing the behavior and impact of large language model driven social bots in online social networks

    [Li et al., 2023f] Siyu Li, Jin Yang, and Kui Zhao. Are you in a masquerade? exploring the behavior and impact of large language model driven social bots in online social networks. arXiv preprint arXiv:2307.10337,

  35. [35]

    Metaagents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents

    [Li et al., 2023h] Yuan Li, Yixuan Zhang, and Lichao Sun. Metaagents: Simulating interactions of human behaviors for llm-based task-oriented coordination via collaborative generative agents. arXiv preprint arXiv:2310.06500,

  36. [36]

    Let gpt be a math tutor: Teaching math word prob- lem solvers with customized exercise generation

    [Liang et al., 2023] Zhenwen Liang, Wenhao Yu, Tanmay Rajpurohit, Peter Clark, Xiangliang Zhang, and Ashwin Kaylan. Let gpt be a math tutor: Teaching math word prob- lem solvers with customized exercise generation. arXiv preprint arXiv:2305.14386,

  37. [37]

    From text to tactic: Evaluating llms playing the game of avalon

    [Light et al., 2023b] Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. From text to tactic: Evaluating llms play- ing the game of avalon. arXiv preprint arXiv:2310.05036,

  38. [38]

    Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization

    [Liu et al., 2023] Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. Dynamic llm-agent network: An llm- agent collaboration framework with agent team optimiza- tion. arXiv preprint arXiv:2310.02170,

  39. [39]

    Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support

    [Ma et al., 2023] Zilin Ma, Yiyang Mei, and Zhaoyuan Su. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. arXiv preprint arXiv:2307.15810 ,

  40. [40]

    Roco: Dialectic multi-robot collaboration with large language models

    [Mandi et al., 2023] Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738,

  41. [41]

    Alympics: Language agents meet game theory.arXiv preprint arXiv:2311.03220, 2023

    [Mao et al., 2023] Shaoguang Mao, Yuzhe Cai, Yan Xia, Wenshan Wu, Xun Wang, Fengyi Wang, Tao Ge, and Furu Wei. Alympics: Language agents meet game theory.arXiv preprint arXiv:2311.03220,

  42. [42]

    [Moura, 2023] Jo˜ao Moura. Crewai. https://github.com/ joaomdmoura/crewAI,

  43. [43]

    Talal Rahwan, Tomasz P

    [Mukobi et al., 2023] Gabriel Mukobi, Hannah Erlebach, Niklas Lauffer, Lewis Hammond, Alan Chan, and Jesse Clifton. Welfare diplomacy: Benchmarking language model cooperation. arXiv preprint arXiv:2310.08901 ,

  44. [44]

    Self-adaptive large language model (llm)-based multiagent systems

    [Nascimento et al., 2023] Nathalia Nascimento, Paulo Alen- car, and Donald Cowan. Self-adaptive large language model (llm)-based multiagent systems. In 2023 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C) , pages 104–109. IEEE,

  45. [45]

    Social simulacra: Creating popu- lated prototypes for social computing systems

    [Park et al., 2022] Joon Sung Park, Lindsay Popowski, Car- rie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Social simulacra: Creating popu- lated prototypes for social computing systems. In Pro- ceedings of the 35th Annual ACM Symposium on User In- terface Software and Technology, pages 1–18,

  46. [46]

    Generative Agents: Interactive Simulacra of Human Behavior

    [Park et al., 2023] Joon Sung Park, Joseph C O’Brien, Car- rie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interac- tive simulacra of human behavior. arXiv preprint arXiv:2304.03442,

  47. [47]

    Communicative agents for software development,

    [Qian et al., 2023] Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development,

  48. [48]

    Tptu: Large language model-based ai agents for task planning and tool usage,

    [Ruan et al., 2023] Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu Mao, Ziyue Li, Xingyu Zeng, and Rui Zhao. Tptu: Large language model-based ai agents for task planning and tool usage,

  49. [49]

    Artificial Intelligence: A Modern Approach

    [Russell and Norvig, 2009] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall Press, USA, 3rd edition,

  50. [50]

    Reflexion: Language agents with verbal re- inforcement learning,

    [Shinn et al., 2023] Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal re- inforcement learning,

  51. [51]

    Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

    [Sumers et al., 2023] Theodore R Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L Griffiths. Cogni- tive architectures for language agents. arXiv preprint arXiv:2309.02427,

  52. [52]

    Medagents: Large language models as col- laborators for zero-shot medical reasoning,

    [Tang et al., 2023] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as col- laborators for zero-shot medical reasoning,

  53. [53]

    Wang, Dongjin Choi, Shenyu Xu, and Diyi Yang

    [Wang et al., 2021] Zijie J. Wang, Dongjin Choi, Shenyu Xu, and Diyi Yang. Putting humans in the natural lan- guage processing loop: A survey,

  54. [54]

    Avalon’s game of thoughts: Battle against deception through recursive contemplation,

    [Wang et al., 2023c] Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalon’s game of thoughts: Battle against deception through recursive contemplation. arXiv preprint arXiv:2310.01320,

  55. [55]

    Chain-of-thought prompting elicits reasoning in large language models

    [Wei et al., 2022] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837,

  56. [56]

    Llm powered au- tonomous agents

    [Weng, 2023] Lilian Weng. Llm powered au- tonomous agents. https://lilianweng.github.io/posts/ 2023-06-23-agent/,

  57. [57]

    Epidemic modeling with generative agents

    [Williams et al., 2023] Ross Williams, Niyousha Hos- seinichimeh, Aritra Majumdar, and Navid Ghaffarzade- gan. Epidemic modeling with generative agents. arXiv preprint arXiv:2307.04986,

  58. [58]

    Jennings

    [Wooldridge and Jennings, 1995] Michael Wooldridge and Nicholas R. Jennings. Intelligent agents: theory and prac- tice. The Knowledge Engineering Review , 10:115 – 152,

  59. [59]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    [Wu et al., 2023a] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: En- abling next-gen llm applications via multi-agent conversa- tion framework. arXiv preprint arXiv:2308.08155,

  60. [60]

    The rise and potential of large language model based agents: A survey,

    [Xi et al., 2023] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Jun- zhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qi...

  61. [61]

    Simulating public administration crisis: A novel generative agent-based simulation system to lower technology barriers in social science research,

    [Xiao et al., 2023] Bushi Xiao, Ziyuan Yin, and Zixuan Shan. Simulating public administration crisis: A novel generative agent-based simulation system to lower tech- nology barriers in social science research. arXiv preprint arXiv:2311.06957,

  62. [62]

    arXiv preprint arXiv:2310.10634 , year=

    [Xie et al., 2023] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Jun- ning Zhao, Qian Liu, Che Liu, et al. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634,

  63. [63]

    Examining inter-consistency of large language models collaboration: An in-depth analysis via debate,

    [Xiong et al., 2023] Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate,

  64. [64]

    Ex- ploring large language models for communica- tion games: An empirical study on werewolf

    [Xu et al., 2023b] Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658,

  65. [65]

    Language agents with reinforcement learning for strategic play in the werewolf game

    [Xu et al., 2023c] Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940,

  66. [66]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    [Yao et al., 2023] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models,

  67. [67]

    Co-navgpt: Multi-robot cooperative visual semantic navigation using large language models,

    [Yu et al., 2023] Bangguo Yu, Hamidreza Kasaei, and Ming Cao. Co-navgpt: Multi-robot cooperative visual semantic navigation using large language models,

  68. [68]

    Proagent: Building proactive cooperative ai with large language models

    [Zhang et al., 2023b] Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proa- gent: Building proactive cooperative ai with large lan- guage models. arXiv preprint arXiv:2308.11339,

  69. [69]

    Building cooperative embodied agents modularly with large language models

    [Zhang et al., 2023c] Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485,

  70. [70]

    Competeai: Understanding the competition behaviors in large language model-based agents,

    [Zhao et al., 2023] Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, and Xing Xie. Competeai: Understanding the competition behaviors in large language model-based agents,

  71. [71]

    Nguyen, Nakul Rampal, Ali H

    [Zheng et al., 2023] Zhiling Zheng, Oufan Zhang, Ha L. Nguyen, Nakul Rampal, Ali H. Alawadhi, Zichao Rong, Teresa Head-Gordon, Christian Borgs, Jennifer T. Chayes, and Omar M. Yaghi. Chatgpt research group for optimiz- ing the crystallinity of mofs and cofs. ACS Central Sci- ence, 9(11):2161–2170,

  72. [72]

    Agents: An open-source framework for autonomous lan- guage agents

    [Zhou et al., 2023a] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jin- tian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. Agents: An open-source framework for autonomous lan- guage agents. arXiv preprint arXiv:2309.07870,

  73. [73]

    Can large language models transform computational social sci- ence? Computational Linguistics, pages 1–53, 2023

    [Ziems et al., 2023] Caleb Ziems, Omar Shaikh, Zhehao Zhang, William Held, Jiaao Chen, and Diyi Yang. Can large language models transform computational social sci- ence? Computational Linguistics, pages 1–53, 2023