Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Amirhossein Nadiri; Yashar Talebirad

arxiv: 2306.03314 · v1 · pith:V6XPQS4Wnew · submitted 2023-06-05 · 💻 cs.AI · cs.LG· cs.MA

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Yashar Talebirad , Amirhossein Nadiri This is my paper

Pith reviewed 2026-05-24 07:45 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords multi-agent systemslarge language modelscollaborative agentsrole assignmenttask efficiencyknowledge exchangeAGI applicationssystem limitations

0 comments

The pith

Multiple intelligent agents with assigned roles collaborate inside large language models to handle complex tasks more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework in which several LLM-based agent components, each carrying distinct attributes and roles, operate together in a shared environment. This setup is presented as a way to improve how such models address demanding work by exchanging knowledge among the agents. A sympathetic reader would care because the claim implies that single-agent limitations can be reduced through structured division of labor and joint problem solving.

Core claim

The paper claims that a collaborative multi-agent environment, built by giving each component distinctive attributes and roles, allows large language models to manage complex tasks with greater efficiency and effectiveness than isolated agents, while also addressing issues such as repeated loops, scalability, and security through this division of responsibilities and knowledge exchange.

What carries the argument

The multi-agent collaboration framework that assigns distinctive attributes and roles to each agent component so they can jointly process tasks and exchange knowledge.

If this is right

Tasks that exceed single-agent capacity become feasible through role-based division and knowledge exchange.
Problems such as looping and scalability receive direct mitigation from the collaborative structure.
Applications across varied domains gain from the same agent-role mechanism without requiring separate redesigns.
Overall LLM performance improves as agents contribute specialized outputs to a shared result.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same role-assignment idea could be tested on tasks that require real-time adaptation rather than fixed roles.
Measuring coordination overhead directly would clarify whether the assumed efficiency gain holds once communication costs are counted.
Extending the framework to include agents that can request external data sources might reduce reliance on internal knowledge alone.

Load-bearing premise

That splitting work across multiple agents with assigned roles will produce net gains in efficiency and effectiveness without new coordination failures or added costs that cancel the benefit.

What would settle it

A side-by-side test on identical complex tasks that records completion rate, time taken, and error count for a single-agent version versus the multi-agent version under controlled conditions.

read the original abstract

In this paper, we present a novel framework for enhancing the capabilities of large language models (LLMs) by leveraging the power of multi-agent systems. Our framework introduces a collaborative environment where multiple intelligent agent components, each with distinctive attributes and roles, work together to handle complex tasks more efficiently and effectively. We demonstrate the practicality and versatility of our framework through case studies in artificial general intelligence (AGI), specifically focusing on the Auto-GPT and BabyAGI models. We also examine the "Gorilla" model, which integrates external APIs into the LLM. Our framework addresses limitations and challenges such as looping issues, security risks, scalability, system evaluation, and ethical considerations. By modeling various domains such as courtroom simulations and software development scenarios, we showcase the potential applications and benefits of our proposed multi-agent system. Our framework provides an avenue for advancing the capabilities and performance of LLMs through collaboration and knowledge exchange among intelligent agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sketches a multi-agent wrapper around existing LLM tools like Auto-GPT but supplies no measurements to show the wrapper improves anything.

read the letter

The core point is that the authors describe a role-based multi-agent system for LLMs and illustrate it with courtroom and software-development scenarios, yet report no numbers on whether the setup actually reduces looping, improves task success, or beats simpler single-agent baselines. The contribution stays at the level of a proposed organizational pattern applied to previously published models. No new derivations, datasets, or controlled runs appear. What the paper does lay out is a set of agent roles and how they might interact in those two domains, plus a short list of practical issues such as security and ethics. That part is straightforward and could serve as a quick checklist for someone already building agent systems. The soft spots are more central. The abstract and case studies assert gains in efficiency and effectiveness, but the text gives only narrative descriptions of the scenarios. There are no iteration counts, success rates, resource comparisons, or failure-mode measurements. Without those, it is impossible to tell whether the added coordination overhead is offset by any benefit. The work also does not engage deeply with the cited prior systems beyond naming them, so the novelty claim rests on the framing rather than on new technical content. Readers who are already experimenting with LLM agent orchestration might pick up one or two role ideas here. Anyone needing reproducible evidence or a technical advance will not find it. I would not send this to peer review as written; it would need at least one section with quantitative comparisons before it merits referee time.

Referee Report

2 major / 0 minor

Summary. The paper proposes a novel multi-agent collaborative framework for LLMs in which multiple agents with distinct roles and attributes collaborate to solve complex tasks more efficiently and effectively than single agents. It illustrates the framework via descriptive case studies on Auto-GPT, BabyAGI, and the Gorilla API-augmented model, modeling courtroom simulations and software-development scenarios while claiming to mitigate looping, scalability, security, and ethical issues.

Significance. If the central claim were supported by controlled measurements, the work could offer a practical template for organizing LLM agents to improve performance on multi-step tasks. The absence of any quantitative evaluation, however, means the manuscript currently functions as a high-level position paper rather than an empirical contribution.

major comments (2)

[Abstract] Abstract: the claim that the framework enables agents to 'handle complex tasks more efficiently and effectively' is presented without any success rates, iteration counts, resource-consumption figures, or single-agent baselines, rendering the efficiency/effectiveness assertion unsupported.
[Case studies] Case-study descriptions (courtroom and software-development scenarios): the text asserts that the multi-agent setup addresses looping, scalability, and coordination problems, yet supplies no measurable outcomes or controlled comparisons that would demonstrate net gains over the cited single-agent systems (Auto-GPT, BabyAGI).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments. The manuscript is a conceptual proposal of a multi-agent framework illustrated through descriptive case studies, not an empirical study with controlled measurements. We will revise the text to ensure all claims are appropriately qualified and to clarify the illustrative nature of the case studies.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the framework enables agents to 'handle complex tasks more efficiently and effectively' is presented without any success rates, iteration counts, resource-consumption figures, or single-agent baselines, rendering the efficiency/effectiveness assertion unsupported.

Authors: We agree that the abstract contains an unsupported assertion. The work describes a framework and applies it conceptually to existing systems via case studies; no quantitative experiments were performed. We will revise the abstract to remove the efficiency/effectiveness claim and instead describe the framework as a proposed organizational structure whose benefits remain to be measured. revision: yes
Referee: [Case studies] Case-study descriptions (courtroom and software-development scenarios): the text asserts that the multi-agent setup addresses looping, scalability, and coordination problems, yet supplies no measurable outcomes or controlled comparisons that would demonstrate net gains over the cited single-agent systems (Auto-GPT, BabyAGI).

Authors: The case studies are narrative illustrations of how distinct agent roles could be assigned within the framework; they do not constitute empirical evaluations. We will revise these sections to state explicitly that the scenarios are hypothetical demonstrations of the framework's structure and do not provide measured improvements or comparisons against single-agent baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual framework with no equations or fitted predictions

full rationale

The paper advances a multi-agent LLM framework via descriptive architecture and illustrative case studies (courtroom, software development) but contains no equations, parameters, or quantitative predictions. No derivation chain exists that could reduce outputs to inputs by construction. Self-citations, if present, are not invoked to establish uniqueness theorems or to substitute for independent evidence. The central claim of efficiency gains remains an untested assertion rather than a result forced by the authors' own definitions or prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the untested premise that role-based agent collaboration improves outcomes; no free parameters, mathematical axioms, or independently evidenced invented entities are stated because the text is a high-level proposal.

invented entities (1)

Multi-agent collaborative framework no independent evidence
purpose: To enhance LLM task handling through agent specialization and interaction
Presented as the novel contribution but without external verification or falsifiable predictions supplied in the abstract.

pith-pipeline@v0.9.0 · 5689 in / 1166 out tokens · 31962 ms · 2026-05-24T07:45:58.682003+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate
cs.CL 2026-01 unverdicted novelty 7.0

SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data
cs.LG 2024-06 unverdicted novelty 7.0

MALLM-GAN uses multi-agent LLMs to emulate GAN architecture for generating higher-quality synthetic tabular data from small samples than prior models, while preserving privacy.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning
cs.MA 2026-05 unverdicted novelty 6.0

Formalizes design space for human-LLM collaborative planning along mode, scope, and level axes; evaluates AMBIPOM prototype via user study and benchmark revealing hybrid workflows and trade-offs.
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
cs.CL 2026-05 unverdicted novelty 6.0

MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
cs.AI 2026-05 unverdicted novelty 6.0

A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
cs.CL 2026-04 unverdicted novelty 6.0

EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling...
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
Explicit Trait Inference for Multi-Agent Coordination
cs.AI 2026-04 unverdicted novelty 6.0

ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach
cs.AI 2026-04 unverdicted novelty 6.0

A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

Multi-agent systems amplify minor stochastic biases into systemic polarization via echo-chamber effects in structured workflows, even with neutral agents.
PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy
cs.CR 2026-04 unverdicted novelty 6.0

PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.
From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration
cs.MA 2026-03 unverdicted novelty 6.0

A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.
Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
cs.AI 2026-01 unverdicted novelty 6.0

Multi-agent actor-critic methods with a centralized critic improve decentralized LLM collaboration over Monte Carlo baselines in long-horizon and sparse-reward settings.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
cs.AI 2025-07 unverdicted novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming pri...
Language Model Networks: Supervision-Efficient Learning through Dense Communication
cs.AI 2025-05 unverdicted novelty 6.0

LMNet connects stripped LLMs as nodes with trainable seq2seq edges for dense vector exchange, supporting supervision-efficient learning through differentiable communication.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
cs.AI 2026-05 unverdicted novelty 5.0

U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
LanG -- A Governance-Aware Agentic AI Platform for Unified Security Operations
cs.CR 2026-04 unverdicted novelty 5.0

LanG presents a governance-aware agentic AI platform for unified security operations that reports strong performance on incident correlation, rule generation, attack reconstruction, and AI safety guardrails in an open...
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
Autonomy Reshapes How Personalization Affects Privacy Concerns and Trust in LLM Agents
cs.HC 2025-10 conditional novelty 5.0

A 3x3 between-subjects experiment finds that risk-contingent autonomy in LLM agents attenuates personalization's negative effects on privacy concerns and trust via increased perceived control.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
cs.AI 2025-01 unverdicted novelty 4.0

The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and ident...
Large Language Model-Based Agents for Software Engineering: A Survey
cs.SE 2024-09 unverdicted novelty 4.0

A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation
cs.IR 2026-03 unverdicted novelty 3.0

A multi-agent LLM system using CrewAI and RAG improves response coherence and correctness over a single-LLM RAG baseline for Brazilian labor law Q&A.
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
cs.SE 2026-02 unverdicted novelty 3.0

A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...
LLM Multi-Agent Systems: Challenges and Open Problems
cs.MA 2024-02 unverdicted novelty 2.0

The paper identifies inadequately addressed challenges in optimizing task allocation, fostering robust reasoning through debates, managing layered context, enhancing memory, and applying multi-agent systems to blockchain.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 29 Pith papers · 1 internal anchor

[1]

Saurous, Jascha Sohl-dickstein, Kevin Murphy, and Charles Sutton

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, and Charles Sutton. Language model cascades, 2022

work page 2022
[2]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023

work page 2023
[3]

Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

work page 2023
[4]

Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. 10

work page 2023
[5]

One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era, 2023

Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, Gyeong-Moon Park, Sung-Ho Bae, Lik-Hang Lee, Pan Hui, In So Kweon, and Choong Seon Hong. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era, 2023

work page 2023
[6]

Do, Yan Xu, and Pascale Fung

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V . Do, Yan Xu, and Pascale Fung. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity, 2023

work page 2023
[7]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[8]

Improving language model negotiation with self-play and in-context learning from ai feedback, 2023

Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback, 2023

work page 2023
[9]

Teaching large language models to self-debug, 2023

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023

work page 2023
[10]

Self-refine: Iterative refinement with self-feedback, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023

work page 2023
[11]

Introducing chatgpt

OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 2023-06-04

work page 2022
[12]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023

work page 2023
[13]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Blind judgement: Agent-based supreme court modelling with gpt, 2023

Sil Hamilton. Blind judgement: Agent-based supreme court modelling with gpt, 2023. 11

work page 2023

[1] [1]

Saurous, Jascha Sohl-dickstein, Kevin Murphy, and Charles Sutton

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, and Charles Sutton. Language model cascades, 2022

work page 2022

[2] [2]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023

work page 2023

[3] [3]

Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023

work page 2023

[4] [4]

Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023. 10

work page 2023

[5] [5]

One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era, 2023

Chaoning Zhang, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, Jung Uk Kim, Seong Tae Kim, Jinwoo Choi, Gyeong-Moon Park, Sung-Ho Bae, Lik-Hang Lee, Pan Hui, In So Kweon, and Choong Seon Hong. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era, 2023

work page 2023

[6] [6]

Do, Yan Xu, and Pascale Fung

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V . Do, Yan Xu, and Pascale Fung. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity, 2023

work page 2023

[7] [7]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023

[8] [8]

Improving language model negotiation with self-play and in-context learning from ai feedback, 2023

Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback, 2023

work page 2023

[9] [9]

Teaching large language models to self-debug, 2023

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023

work page 2023

[10] [10]

Self-refine: Iterative refinement with self-feedback, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023

work page 2023

[11] [11]

Introducing chatgpt

OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 2023-06-04

work page 2022

[12] [12]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023

work page 2023

[13] [13]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Blind judgement: Agent-based supreme court modelling with gpt, 2023

Sil Hamilton. Blind judgement: Agent-based supreme court modelling with gpt, 2023. 11

work page 2023