Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

Thanh Luong Tuan

arxiv: 2606.00804 · v2 · pith:O4AI466Snew · submitted 2026-05-30 · 💻 cs.MA · cs.AI· cs.CL

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

Thanh Luong Tuan This is my paper

Pith reviewed 2026-06-28 17:46 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CL

keywords multi-agent systemscoordination strategiesdynamic routingenterprise tasksproblem classesquality evaluationstrategy selection

0 comments

The pith

Enterprise multi-agent systems should select coordination strategies dynamically by problem class rather than fixing one globally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines if multi-agent coordination patterns like consensus or debate should be chosen based on the specific problem class instead of using the same one for all enterprise tasks. It conducts a large-scale evaluation with 1,440 outputs from multiple models on 30 tasks across industries, judging them with a consistent rubric. The results do not support identifying an exact best strategy consistently, but strongly back a near-best dynamic routing approach where the selected strategy performs within 0.10 quality points of the top option in all tested cases. No difference is found between English and Vietnamese domains in ranking consistency. The conclusion favors dynamic routing as a default policy for such systems.

Core claim

The paper claims that dynamic selection of coordination strategies by problem class produces outputs within 0.10 quality-score points of the best observed condition across all pre-registered model arms and an auxiliary validation arm, supporting its use as a calibrated default despite instability in exact winner identity.

What carries the argument

A frozen evaluation matrix of 30 enterprise tasks spanning five problem classes and four execution conditions, with all outputs scored by a fixed Sonnet rubric to compare single-agent, consensus, debate, and synthesis workflows.

If this is right

Enterprise multi-agent deployments should treat dynamic routing by problem class as the default policy.
Strategy selection cannot be treated as a deterministic rule for picking an exact winner.
Structured compliance verification tasks favor the single-agent workflow over consensus in all tested arms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 0.10 tolerance band may enable practical routing systems without requiring perfect winner prediction.
The approach could be tested in real-time production environments to check if the quality gap holds under live conditions.
Similar dynamic selection logic might apply to coordination in non-enterprise multi-agent settings with different task distributions.

Load-bearing premise

The fixed Sonnet rubric provides a stable and meaningful measure of output quality that generalizes across model arms, task domains, and coordination conditions.

What would settle it

A new experiment on fresh tasks where the predicted strategy scores more than 0.10 quality points below the best observed condition across multiple model arms would falsify the near-best routing claim.

read the original abstract

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ran a pre-registered 1440-run matrix showing dynamic routing stays within 0.10 quality points of the best observed strategy across models and classes, but the single Sonnet rubric is the load-bearing assumption.

read the letter

The core result is that their predicted coordination strategy lands close to the best condition in every arm and problem class, even after the stricter original H1 on exact winners fell apart. They tested 30 enterprise tasks in six industries with four models, three replications, and an OpenAI auxiliary arm, all scored by one fixed rubric. The pre-registration and the honest report that exact identity is unstable across arms are clear strengths.

The scale and the cross-model check give the bounded claim some weight. The Kendall test on Vietnamese versus English domains is a reasonable extra check even though it came up null. Structured compliance tasks favoring single_agent over consensus is a concrete observation worth noting.

The main vulnerability is the evaluation step. Every quality score, including the ones used to define the best condition, comes from the same Sonnet rubric. If that rubric favors outputs that look like Sonnet or particular coordination styles, the 0.10 closeness result could be partly an artifact. The abstract gives no error bars or variance details on the scores, and the 0.10 margin itself reads as post-hoc once the exact-winner test failed. The auxiliary arm does not break the dependence because it uses the same judge.

This is for practitioners who need evidence on when to route between consensus, debate, or single-agent workflows in real deployments rather than for theorists looking to shift the field. A reader building enterprise agent systems would find the task set and the near-best routing observation useful.

Send it to peer review. The design is solid enough and the operational claim is narrow but testable; referees can pressure the rubric and stats without the paper collapsing.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical evaluation of dynamic coordination strategy selection in enterprise multi-agent systems. Using a pre-registered design with 30 tasks across six industries and five problem classes, four execution conditions, three replications, and four model arms (including an auxiliary OpenAI validation), all 1440 outputs are scored by a fixed Sonnet rubric. The central claim is that while the pre-registered exact-winner criterion is not met, the predicted strategy is within 0.10 quality-score points of the best observed condition in every arm and problem class. H2 on Vietnamese vs English domain consistency is not supported.

Significance. If the Sonnet rubric provides a valid and generalizable measure of output quality, the findings support dynamic routing of coordination strategies as a calibrated default policy for enterprise multi-agent systems rather than a fixed global approach. The pre-registered design with multiple replications and model arms (including validation) is a clear strength that enhances reproducibility and reduces overfitting risk. The bounded near-best claim is operationally useful for practitioners, though the post-hoc adjustment from the original H1 tempers the strength of the conclusion.

major comments (2)

[Methods (Evaluation Protocol)] Methods (Evaluation Protocol): The reliance on a single fixed Sonnet rubric for all 1440 quality scores is load-bearing for the central empirical claim. This rubric is used both to identify the 'best observed condition' per cell and to evaluate whether the predicted strategy stays within 0.10 points. Potential bias toward Sonnet stylistic preferences or particular coordination patterns could make the closeness result an artifact; the auxiliary OpenAI arm does not mitigate this dependence since the same rubric is applied throughout.
[Results (near-best routing claim)] Results (near-best routing claim): The 0.10 quality-score margin for declaring the predicted strategy 'near-best' appears selected after observing the data, following the failure of the pre-registered exact-winner/CI criterion. This post-hoc adjustment requires explicit justification or re-analysis with a pre-specified threshold to support the assertion that the weaker claim is 'strongly supported' across all arms and classes.

minor comments (2)

[Abstract] Abstract: Does not report error bars, confidence intervals, or full statistical details on the quality scores underlying the 0.10-point claim, reducing transparency.
[Abstract] Abstract: The exception for structured compliance verification (all arms favoring single_agent) would benefit from a brief quantitative summary of the deviation from the original mapping.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point-by-point below, with planned revisions where appropriate.

read point-by-point responses

Referee: Methods (Evaluation Protocol): The reliance on a single fixed Sonnet rubric for all 1440 quality scores is load-bearing for the central empirical claim. This rubric is used both to identify the 'best observed condition' per cell and to evaluate whether the predicted strategy stays within 0.10 points. Potential bias toward Sonnet stylistic preferences or particular coordination patterns could make the closeness result an artifact; the auxiliary OpenAI arm does not mitigate this dependence since the same rubric is applied throughout.

Authors: We agree that dependence on a single fixed Sonnet rubric is a substantive limitation. The rubric was pre-registered and held constant to maximize scoring consistency and reproducibility across the 1440 outputs and four model arms. While the auxiliary OpenAI arm uses a different generator, it does not remove rubric bias. The cross-arm replication of the near-best result (including local and open-source models) provides partial mitigation, but does not fully address the concern. In revision we will add an explicit limitations subsection discussing rubric dependence and noting that future work could employ multiple independent rubrics or human raters. revision: partial
Referee: Results (near-best routing claim): The 0.10 quality-score margin for declaring the predicted strategy 'near-best' appears selected after observing the data, following the failure of the pre-registered exact-winner/CI criterion. This post-hoc adjustment requires explicit justification or re-analysis with a pre-specified threshold to support the assertion that the weaker claim is 'strongly supported' across all arms and classes.

Authors: The referee is correct that the 0.10 margin is post-hoc. We will revise the manuscript to (a) state explicitly that the threshold was chosen after the pre-registered exact-winner criterion failed, (b) justify 0.10 on substantive grounds as a practically small difference on the quality-score scale, and (c) add a sensitivity table reporting the fraction of cells in which the predicted strategy remains within 0.05, 0.10, and 0.15 points of the best observed condition. This re-analysis will be presented as a robustness check rather than a pre-specified test. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper conducts a large-scale empirical evaluation of coordination strategies across 1440 outputs from multiple model arms, using a fixed external Sonnet rubric to score quality. No mathematical derivations, equations, parameter fitting, or self-citations are present that would reduce any claim to its own inputs by construction. The central finding (predicted strategy within 0.10 points of best observed) is a direct comparison of independently measured scores, with pre-registered hypotheses tested against observed data rather than derived from them. The rubric application is external to the strategies being compared.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper is an empirical hypothesis test rather than a derivation, so the ledger contains only the measurement assumption and one ad-hoc threshold; no new entities are postulated.

free parameters (1)

0.10 quality-score margin
The near-best claim depends on this specific numeric threshold, which is not derived from first principles and appears selected to match observed differences.

axioms (1)

domain assumption The fixed Sonnet rubric yields comparable quality scores across different model families and coordination conditions
All 1440 judgments rest on this single external judge without reported calibration or inter-rater checks.

pith-pipeline@v0.9.1-grok · 5824 in / 1421 out tokens · 32556 ms · 2026-06-28T17:46:20.019740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 7 canonical work pages

[1]

Large Language Model Based Multi-Agents: A Survey of Progress and Challenges,

Large Language Model based Multi-Agents: A Survey of Progress and Challenges , author=. Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI) , year=. doi:10.24963/ijcai.2024/890 , note=

work page doi:10.24963/ijcai.2024/890 2024
[2]

2024 , eprint=

A Survey on LLM-based Multi-Agent System: Recent Advances and New Frontiers in Application , author=. 2024 , eprint=

2024
[3]

2025 , eprint=

Multi-Agent Collaboration Mechanisms: A Survey of LLMs , author=. 2025 , eprint=

2025
[4]

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead,

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2025 , publisher=. doi:10.1145/3712003 , note=

work page doi:10.1145/3712003 2025
[5]

Transactions on Machine Learning Research , year=

Cognitive Architectures for Language Agents , author=. Transactions on Machine Learning Research , year=
[6]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=
[7]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=. 2024 , doi=

2024
[8]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=. 2024 , note=

2024
[9]

2025 , eprint=

Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity , author=. 2025 , eprint=

2025
[10]

Proceedings of the Conference on Language Modeling (COLM) , year=

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework , author=. Proceedings of the Conference on Language Modeling (COLM) , year=
[11]

International Conference on Learning Representations (ICLR) , year=

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. International Conference on Learning Representations (ICLR) , year=
[12]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

ChatDev: Communicative Agents for Software Development , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=. 2024 , doi=

2024
[13]

International Conference on Learning Representations (ICLR) , year=

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors , author=. International Conference on Learning Representations (ICLR) , year=
[14]

Advances in Neural Information Processing Systems , volume=

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society , author=. Advances in Neural Information Processing Systems , volume=. 2023 , note=

2023
[15]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , year=

LLM Voting: Human Choices and AI Collective Decision-Making , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , year=. doi:10.1609/aies.v7i1.31758 , note=

work page doi:10.1609/aies.v7i1.31758
[16]

Proceedings of the ACM Web Conference (WWW) , year=

Mechanism Design for Large Language Models , author=. Proceedings of the ACM Web Conference (WWW) , year=
[17]

2024 , eprint=

Game-theoretic LLM: Agent Workflow for Negotiation Games , author=. 2024 , eprint=

2024
[18]

2025 , note=

Multi-Agent Risks from Advanced AI , author=. 2025 , note=

2025
[19]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=
[20]

Proceedings of the 25th ACM Conference on Economics and Computation (EC) , year=

Generative Social Choice , author=. Proceedings of the 25th ACM Conference on Economics and Computation (EC) , year=. doi:10.1145/3670865.3673547 , note=

work page doi:10.1145/3670865.3673547
[21]

O’Brien, Carrie J

Generative Agents: Interactive Simulacra of Human Behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year=. doi:10.1145/3586183.3606763 , note=

work page doi:10.1145/3586183.3606763
[22]

Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST) , year=

Social Simulacra: Creating Populated Prototypes for Social Computing Systems , author=. Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST) , year=. doi:10.1145/3526113.3545616 , note=

work page doi:10.1145/3526113.3545616
[23]

2024 , eprint=

Generative Agent Simulations of 1,000 People , author=. 2024 , eprint=

2024
[24]

Leveraging Applications of Formal Methods, Verification and Validation (ISoLA) , year=

Emergence in Multi-Agent Systems: A Safety Perspective , author=. Leveraging Applications of Formal Methods, Verification and Validation (ISoLA) , year=
[25]

Horton, and Benjamin S

Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? , author=. Proceedings of the 25th ACM Conference on Economics and Computation (EC) , year=. doi:10.1145/3670865.3673513 , note=

work page doi:10.1145/3670865.3673513
[26]

Conference on Language Modeling (COLM) , year=

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration , author=. Conference on Language Modeling (COLM) , year=
[27]

Advances in Neural Information Processing Systems , volume=

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making , author=. Advances in Neural Information Processing Systems , volume=. 2024 , note=

2024
[28]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

GPTSwarm: Language Agents as Optimizable Graphs , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=
[29]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=. 2023 , note=

2023
[30]

International Conference on Learning Representations (ICLR) , year=

AgentBench: Evaluating LLMs as Agents , author=. International Conference on Learning Representations (ICLR) , year=
[31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=
[32]

NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

AvalonBench: Evaluating LLMs Playing the Game of Avalon , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

2023
[33]

2024 , eprint=

Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction , author=. 2024 , eprint=

2024
[34]

Advances in Neural Information Processing Systems --- Datasets and Benchmarks Track , volume=

Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation , author=. Advances in Neural Information Processing Systems --- Datasets and Benchmarks Track , volume=
[35]

Advances in Neural Information Processing Systems , volume=

FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making , author=. Advances in Neural Information Processing Systems , volume=. 2024 , note=

2024
[36]

International Conference on Learning Representations (ICLR) , year=

FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design , author=. International Conference on Learning Representations (ICLR) , year=
[37]

2024 , eprint=

TradingAgents: Multi-Agents LLM Financial Trading Framework , author=. 2024 , eprint=

2024

[1] [1]

Large Language Model Based Multi-Agents: A Survey of Progress and Challenges,

Large Language Model based Multi-Agents: A Survey of Progress and Challenges , author=. Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI) , year=. doi:10.24963/ijcai.2024/890 , note=

work page doi:10.24963/ijcai.2024/890 2024

[2] [2]

2024 , eprint=

A Survey on LLM-based Multi-Agent System: Recent Advances and New Frontiers in Application , author=. 2024 , eprint=

2024

[3] [3]

2025 , eprint=

Multi-Agent Collaboration Mechanisms: A Survey of LLMs , author=. 2025 , eprint=

2025

[4] [4]

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead,

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2025 , publisher=. doi:10.1145/3712003 , note=

work page doi:10.1145/3712003 2025

[5] [5]

Transactions on Machine Learning Research , year=

Cognitive Architectures for Language Agents , author=. Transactions on Machine Learning Research , year=

[6] [6]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

[7] [7]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=. 2024 , doi=

2024

[8] [8]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=. 2024 , note=

2024

[9] [9]

2025 , eprint=

Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity , author=. 2025 , eprint=

2025

[10] [10]

Proceedings of the Conference on Language Modeling (COLM) , year=

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework , author=. Proceedings of the Conference on Language Modeling (COLM) , year=

[11] [11]

International Conference on Learning Representations (ICLR) , year=

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. International Conference on Learning Representations (ICLR) , year=

[12] [12]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

ChatDev: Communicative Agents for Software Development , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=. 2024 , doi=

2024

[13] [13]

International Conference on Learning Representations (ICLR) , year=

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors , author=. International Conference on Learning Representations (ICLR) , year=

[14] [14]

Advances in Neural Information Processing Systems , volume=

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society , author=. Advances in Neural Information Processing Systems , volume=. 2023 , note=

2023

[15] [15]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , year=

LLM Voting: Human Choices and AI Collective Decision-Making , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) , year=. doi:10.1609/aies.v7i1.31758 , note=

work page doi:10.1609/aies.v7i1.31758

[16] [16]

Proceedings of the ACM Web Conference (WWW) , year=

Mechanism Design for Large Language Models , author=. Proceedings of the ACM Web Conference (WWW) , year=

[17] [17]

2024 , eprint=

Game-theoretic LLM: Agent Workflow for Negotiation Games , author=. 2024 , eprint=

2024

[18] [18]

2025 , note=

Multi-Agent Risks from Advanced AI , author=. 2025 , note=

2025

[19] [19]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

[20] [20]

Proceedings of the 25th ACM Conference on Economics and Computation (EC) , year=

Generative Social Choice , author=. Proceedings of the 25th ACM Conference on Economics and Computation (EC) , year=. doi:10.1145/3670865.3673547 , note=

work page doi:10.1145/3670865.3673547

[21] [21]

O’Brien, Carrie J

Generative Agents: Interactive Simulacra of Human Behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year=. doi:10.1145/3586183.3606763 , note=

work page doi:10.1145/3586183.3606763

[22] [22]

Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST) , year=

Social Simulacra: Creating Populated Prototypes for Social Computing Systems , author=. Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST) , year=. doi:10.1145/3526113.3545616 , note=

work page doi:10.1145/3526113.3545616

[23] [23]

2024 , eprint=

Generative Agent Simulations of 1,000 People , author=. 2024 , eprint=

2024

[24] [24]

Leveraging Applications of Formal Methods, Verification and Validation (ISoLA) , year=

Emergence in Multi-Agent Systems: A Safety Perspective , author=. Leveraging Applications of Formal Methods, Verification and Validation (ISoLA) , year=

[25] [25]

Horton, and Benjamin S

Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? , author=. Proceedings of the 25th ACM Conference on Economics and Computation (EC) , year=. doi:10.1145/3670865.3673513 , note=

work page doi:10.1145/3670865.3673513

[26] [26]

Conference on Language Modeling (COLM) , year=

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration , author=. Conference on Language Modeling (COLM) , year=

[27] [27]

Advances in Neural Information Processing Systems , volume=

MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making , author=. Advances in Neural Information Processing Systems , volume=. 2024 , note=

2024

[28] [28]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

GPTSwarm: Language Agents as Optimizable Graphs , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , year=

[29] [29]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=. 2023 , note=

2023

[30] [30]

International Conference on Learning Representations (ICLR) , year=

AgentBench: Evaluating LLMs as Agents , author=. International Conference on Learning Representations (ICLR) , year=

[31] [31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

[32] [32]

NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

AvalonBench: Evaluating LLMs Playing the Game of Avalon , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

2023

[33] [33]

2024 , eprint=

Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction , author=. 2024 , eprint=

2024

[34] [34]

Advances in Neural Information Processing Systems --- Datasets and Benchmarks Track , volume=

Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation , author=. Advances in Neural Information Processing Systems --- Datasets and Benchmarks Track , volume=

[35] [35]

Advances in Neural Information Processing Systems , volume=

FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making , author=. Advances in Neural Information Processing Systems , volume=. 2024 , note=

2024

[36] [36]

International Conference on Learning Representations (ICLR) , year=

FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design , author=. International Conference on Learning Representations (ICLR) , year=

[37] [37]

2024 , eprint=

TradingAgents: Multi-Agents LLM Financial Trading Framework , author=. 2024 , eprint=

2024