arxiv: 2605.06320 · v1 · submitted 2026-05-07 · 💻 cs.MA · cs.AI· cs.CL

Recognition: unknown

Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

Elizabeth Mieczkowski , Alexander Ku , Tiwalayo Eisape , Dilip Arumugam , John Matters , Katherine M. Collins , Ilia Sucholutsky , Thomas L. Griffiths

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:40 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CL

keywords LLM agent teamstask graphsmulti-agent coordinationadaptive task decompositionlanguage model collaborationcoordination efficiencydistributed systems

0 comments

The pith

LLM agent teams maintain a shared evolving task graph to reduce token use, time, and conflicts while matching accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LATTE as a coordination approach for teams of language model agents that jointly build and update a shared graph tracking task dependencies, assignments, and progress. This replaces both rigid pre-set structures and fully open interactions that often lead to duplicated work or clashes. The method draws on ideas from distributed computing to let agents operate with only partial views of the overall state yet keep the plan consistent. Experiments across tasks and models show drops in resources spent and fewer failures such as file conflicts, without any drop in the quality of the final results. A reader would care because many practical uses of agent teams involve ongoing collaboration where fixed plans break and free-form talk wastes effort.

Core claim

In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, the approach reduces token usage, wall-clock time, communication, and coordination failures such as file conflicts and redundant outputs, while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static

What carries the argument

The shared evolving coordination graph that records sub-task dependencies, agent assignments, and progress states so agents can update it together under partial information.

If this is right

Agents can discover new tasks and reallocate work dynamically instead of following a fixed plan.
Coordination failures such as file conflicts and redundant outputs become less frequent.
Overall token consumption and wall-clock time decrease across varied base models and tasks.
Final accuracy stays at or above the level of fixed hierarchies and unstructured teams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar shared-state mechanisms could help agent teams that must incorporate external tools or data sources mid-task.
The approach may scale best on longer projects where the cost of early miscoordination grows large.
Explicit dependency tracking might reduce the need for verbose natural-language messages between agents.

Load-bearing premise

Language model agents can reliably and consistently collaborate to build and maintain the shared task graph without creating new inconsistencies or adding too much extra overhead.

What would settle it

Run LATTE on a multi-step collaborative coding task and check whether the maintained graph ever allows two agents to edit the same file at once or repeat the same sub-task output.

Figures

Figures reproduced from arXiv: 2605.06320 by Alexander Ku, Dilip Arumugam, Elizabeth Mieczkowski, Ilia Sucholutsky, John Matters, Katherine M. Collins, Thomas L. Griffiths, Tiwalayo Eisape.

**Figure 1.** Figure 1: LATTE. Most existing LLM team designs are either highly structured (a. pipeline systems; b. Leader-Worker hierarchies) or unstructured (c. decentralized teams). (d) LATTE provides teams with a dynamic coordination graph that they collectively maintain and adapt. For example, in a data analysis task, the Lead initializes G0 and assigns Worker 1 to preprocess. As Worker 1 learns about the data, it spawns par… view at source ↗

**Figure 2.** Figure 2: Efficiency-accuracy tradeoff. A) LATTE achieves greater efficiency than alternative frameworks. We measure expected cost (total tokens or wall-clock time weighted by trial completion rate) to account for runs in which teams fail to terminate. B) LATTE achieves higher task success with lower token consumption (normalized across tasks) on the accuracy-vs-token-cost Pareto frontier. Task 1: Exploratory Data A… view at source ↗

**Figure 3.** Figure 3: LLM teams successfully utilize LATTE. A) LATTE teams emergently call all graph operators across rounds, demonstrating full utilization of the coordination toolkit. B) Dynamic coordination graphs grow larger than static ones over time. This reflects richer and more fine-grained understanding of which subtasks need to be executed, offering more opportunities for Workers to be deployed. In contrast, a smaller… view at source ↗

**Figure 4.** Figure 4: LATTE improves coordination. (A) Overwrites: agents overwriting a prior agent’s work in a later round. (B) Concurrent writes: two agents simultaneously writing to the same function. (C) Wasted output: characters written that do not appear in the final output. (D) Communication overhead: number and volume of messages exchanged. (E) Inactivity: proportion of rounds with an agent suppressed. LATTE reduces A–D… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LATTE gives LLM agent teams a shared evolving task graph for dynamic assignment and consistency under partial views, with reported drops in tokens and conflicts.

read the letter

The main point is that the authors adapt distributed-systems ideas to let LLM agents jointly maintain a task graph encoding dependencies, assignments, and progress. Agents can discover new work and reallocate on the fly while the protocol tries to preserve consistency without a central controller. That setup sits between rigid top-down decompositions and fully unstructured chat, and the paper shows it cutting token use, wall-clock time, communication volume, and failures like redundant outputs or file clashes across several tasks and base models, while accuracy stays comparable to MetaGPT, leader-worker, and static baselines. The protocol itself is the concrete new piece; it is not just another prompt template but an explicit coordination mechanism that agents update collaboratively. The experiments give a direct head-to-head on efficiency metrics that matter for scaling these teams. The soft spots are in the evaluation details. How much overhead the graph updates themselves add, how the authors handled LLM nondeterminism and prompt sensitivity, and whether the consistency guarantees hold when partial observability is severe all need checking in the full runs. If the gains shrink once those controls are tightened, the practical advantage narrows. This is for people already building or studying multi-agent LLM setups who need something more structured than free-form discussion but less brittle than fixed pipelines. A reader focused on coordination primitives and deployment constraints will get usable ideas from the design and the comparisons. It deserves peer review because the framing is grounded and the empirical questions are the right ones to ask next.

Referee Report

2 major / 2 minor

Summary. The paper introduces LATTE (Language Agent Teams for Task Evolution), a coordination framework for LLM-based agent teams. Agents collaboratively build and maintain a shared, evolving task graph encoding sub-task dependencies, assignments, and progress states. Inspired by distributed-systems principles for partial observability, the protocol aims to enable dynamic work allocation and adaptation. Empirical results across collaborative tasks and base models claim that LATTE reduces token usage, wall-clock time, communication volume, and coordination failures (e.g., file conflicts, redundant outputs) while matching or exceeding accuracy of baselines including MetaGPT, fully decentralized teams, Leader-Worker hierarchies, and static decompositions.

Significance. If the empirical claims hold under rigorous controls, LATTE offers a practical middle path between rigid a-priori structures and unstructured teams, potentially improving scalability and resource efficiency in multi-agent LLM deployments. The distributed-systems analogy and focus on measurable coordination failures provide a concrete, falsifiable protocol that could influence subsequent work on agent orchestration.

major comments (2)

[Abstract / Experimental Evaluation] The central efficiency claims (reductions in tokens, time, communication, and failures) rest on experimental demonstrations, yet the abstract and framing provide no quantitative results, error bars, statistical tests, or controls. This leaves the magnitude and reliability of the reported improvements unassessable from the given material; the full manuscript must supply detailed tables, ablation studies, and significance testing to support the efficiency-accuracy tradeoff.
[LATTE Protocol Description] The protocol's correctness hinges on agents reliably constructing, updating, and maintaining a consistent shared task graph under partial observability. The manuscript should provide a precise description (with pseudocode or state-transition rules) of conflict resolution, consistency guarantees, and overhead measurements for graph maintenance; without this, the weakest assumption identified in the review cannot be evaluated.

minor comments (2)

[Introduction / Framework Overview] Define the precise data structure of the adaptive task graph (nodes, edges, state fields) with an early figure or formal notation to aid readability.
[Experimental Setup] Clarify how baseline implementations (MetaGPT, Leader-Worker, etc.) were reproduced or adapted to ensure fair comparison of communication and failure metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address the major comments point-by-point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract / Experimental Evaluation] The central efficiency claims (reductions in tokens, time, communication, and failures) rest on experimental demonstrations, yet the abstract and framing provide no quantitative results, error bars, statistical tests, or controls. This leaves the magnitude and reliability of the reported improvements unassessable from the given material; the full manuscript must supply detailed tables, ablation studies, and significance testing to support the efficiency-accuracy tradeoff.

Authors: We agree that the abstract should include quantitative highlights to make the efficiency claims more concrete and assessable. The full manuscript provides detailed tables, ablation studies, and comparisons across tasks and models in the experimental section. In the revision, we will update the abstract to report key quantitative results such as average reductions in token usage and time, and we will ensure that error bars and statistical significance are explicitly discussed and visualized in the main text. revision: yes
Referee: [LATTE Protocol Description] The protocol's correctness hinges on agents reliably constructing, updating, and maintaining a consistent shared task graph under partial observability. The manuscript should provide a precise description (with pseudocode or state-transition rules) of conflict resolution, consistency guarantees, and overhead measurements for graph maintenance; without this, the weakest assumption identified in the review cannot be evaluated.

Authors: We acknowledge the need for a more precise and formal description of the LATTE protocol to allow evaluation of its correctness under partial observability. The manuscript currently describes the protocol in natural language with examples of graph evolution. To strengthen this, we will include pseudocode for the core procedures of task graph construction, update, and conflict resolution, along with a discussion of consistency mechanisms drawn from distributed systems principles. We will also add experimental measurements of the computational overhead for maintaining the shared graph. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with external validation

full rationale

The paper introduces LATTE as a practical coordination protocol for LLM agent teams, drawing inspiration from distributed systems concepts like partial observability but presenting it as an implemented design rather than a mathematical derivation. No equations, fitted parameters, or predictions appear that reduce by construction to inputs. Claims of efficiency gains (reduced tokens, time, conflicts) rest on empirical comparisons to external baselines (MetaGPT, decentralized teams, Leader-Worker hierarchies, static decompositions) across multiple tasks and models. The graph maintenance protocol is described as a design choice to handle consistency under partial views, not derived from self-cited uniqueness theorems or ansatzes. This is a self-contained empirical systems contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that current LLMs possess sufficient reasoning and instruction-following ability to jointly maintain a consistent shared graph under partial observability; no free parameters or invented physical entities are described.

axioms (1)

domain assumption LLM agents can collaboratively construct and maintain a consistent shared task graph without introducing new coordination failures
Invoked as the core operating premise of the LATTE protocol.

invented entities (1)

Adaptive Task Graph no independent evidence
purpose: Shared data structure encoding sub-task dependencies, agent assignments, and progress state
New coordination artifact introduced by the framework; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5538 in / 1312 out tokens · 57433 ms · 2026-05-08T03:40:30.442693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 27 canonical work pages · 5 internal anchors

[1]

Executing task graphs using work-stealing

Kunal Agrawal, Charles E Leiserson, and Jim Sukha. Executing task graphs using work-stealing. In2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1–12. IEEE, 2010

2010
[2]

How we built our multi-agent research system

Anthropic. How we built our multi-agent research system. https://www.anthropic.com/ engineering/multi-agent-research-system, June 2025. Anthropic Engineering Blog

2025
[3]

org/abs/2603.01213

Frédéric Berdoz, Leonardo Rugli, and Roger Wattenhofer. Can AI agents agree?arXiv preprint arXiv:2603.01213, 2026

work page arXiv 2026
[4]

Graph of thoughts: Solving elaborate problems with Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, 2024

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, 2024. doi: 10.1609/aaai.v38i16. 29720

work page doi:10.1609/aaai.v38i16 2024
[5]

Social agents: Collective intelligence improves LLM predic- tions

Aanisha Bhattacharyya, Abhilekh Borah, Yaman Kumar Singla, Rajiv Ratn Shah, Changyou Chen, and Balaji Krishnamurthy. Social agents: Collective intelligence improves LLM predic- tions. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[6]

Brooks.The Mythical Man-Month: Essays on Software Engineering

Frederick P. Brooks.The Mythical Man-Month: Essays on Software Engineering. Addison- Wesley, Reading, MA, 1975. ISBN 0-201-00650-2

1975
[7]

Herbsleb, and Kathleen M

Marcelo Cataldo, James D. Herbsleb, and Kathleen M. Carley. Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software de- velopment productivity. InProceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 2–11, Kaiserslautern, Germany, 2...

work page doi:10.1145/1414004.1414008 2008
[8]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent LLM systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review arXiv 2025
[9]

Melvin E. Conway. How do committees invent?Datamation, 14(4):28–31, April 1968

1968
[10]

The tail at scale,

Jeffrey Dean and Luiz André Barroso. The tail at scale.Communications of the ACM, 56(2): 74–80, 2013. doi: 10.1145/2408776.2408794

work page doi:10.1145/2408776.2408794 2013
[11]

MapReduce: Simplified data processing on large clusters

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, January 2008. doi: 10.1145/1327452.1327492

work page doi:10.1145/1327452.1327492 2008
[12]

Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000

Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000. 10

2000
[13]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

2024
[14]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first International Conference on Machine Learning, 2024

2024
[15]

On the nature of merge conflicts: A study of 2,731 open source Java projects hosted by GitHub.IEEE Transactions on Software Engineering, 46(8):892–915, 2020

Gleiph Ghiotto, Leonardo Murta, Márcio Barros, and André van der Hoek. On the nature of merge conflicts: A study of 2,731 open source Java projects hosted by GitHub.IEEE Transactions on Software Engineering, 46(8):892–915, 2020. doi: 10.1109/TSE.2018.2871083

work page doi:10.1109/tse.2018.2871083 2020
[16]

Planning with abstract Markov decision processes

Nakul Gopalan, Michael Littman, James MacGlashan, Shawn Squire, Stefanie Tellex, John Winder, and Lawson Wong. Planning with abstract Markov decision processes. InProceedings of the International Conference on Automated Planning and Scheduling, volume 27, pages 480–488, 2017

2017
[17]

Doing more with less: Meta-reasoning and meta-learning in humans and machines

Thomas L Griffiths, Frederick Callaway, Michael B Chang, Erin Grant, Paul M Krueger, and Falk Lieder. Doing more with less: Meta-reasoning and meta-learning in humans and machines. Current Opinion in Behavioral Sciences, 29:24–30, 2019

2019
[18]

W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. doi: 10.1093/biomet/57.1.97

work page doi:10.1093/biomet/57.1.97 1970
[19]

Selecting computa- tions: Theory and applications

Nicholas Hay, Stuart Russell, David Tolpin, and Solomon Eyal Shimony. Selecting computa- tions: Theory and applications. InProceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 346–355, 2012

2012
[20]

Princeton University Press, Princeton, NJ,

Joseph Henrich.The Secret of Our Success: How Culture is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter. Princeton University Press, Princeton, NJ,
[21]

Herbsleb and Audris Mockus

James D. Herbsleb and Audris Mockus. An empirical study of speed and communication in globally distributed software development.IEEE Transactions on Software Engineering, 29(6): 481–494, 2003. doi: 10.1109/TSE.2003.1205177

work page doi:10.1109/tse.2003.1205177 2003
[22]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2023

2023
[23]

On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of LLM-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

work page arXiv 2024
[24]

arXiv preprint arXiv:2507.14928 , year=

Yongrae Jo and Chanik Park. Byzantine-robust decentralized coordination of LLM agents. arXiv preprint arXiv:2507.14928, 2025

work page arXiv 2025
[25]

A concurrent dynamic task graph

Theodore Johnson. A concurrent dynamic task graph. In1993 International Conference on Parallel Processing-ICPP’93, volume 2, pages 223–230. IEEE, 1993

1993
[26]

Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Metareasoning structures, problems, and modes for multiagent systems: A survey.IEEE Access, 8:183080–183089, 2020

Samuel T Langlois, Oghenetekevwe Akoroda, Estefany Carrillo, Jeffrey W Herrmann, Shapour Azarm, Huan Xu, and Michael Otte. Metareasoning structures, problems, and modes for multiagent systems: A survey.IEEE Access, 8:183080–183089, 2020

2020
[28]

Agent-oriented planning in multi-agent systems

Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems. InThe Thirteenth International Conference on Learning Representations, 2025. 11

2025
[29]

More agents is all you need

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need. arXiv preprint arXiv:2402.05120, 2024

work page arXiv 2024
[30]

Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024
[31]

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Shuo Liu, Tianle Chen, Ryan Amiri, and Christopher Amato. Learning decentralized LLM collaboration with multi-agent actor critic.arXiv preprint arXiv:2601.21972, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

mirroring

Alan MacCormack, Carliss Baldwin, and John Rusnak. Exploring the duality between product and organizational architectures: A test of the “mirroring” hypothesis.Research Policy, 41(8): 1309–1324, 2012. doi: 10.1016/j.respol.2012.04.011

work page doi:10.1016/j.respol.2012.04.011 2012
[33]

Austern, Aart J

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10), pages 135–146, Indianapolis, Indiana, USA, 2010. Association for Computing Machinery....

work page doi:10.1145/1807167.1807184 2010
[34]

Learning scheduling algorithms for data processing clusters

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. InProceedings of the ACM Special Interest Group on Data Communication (SIGCOMM), pages 270–288. ACM,
[35]

doi: 10.1145/3341302.3342080

work page doi:10.1145/3341302.3342080
[36]

Language model teams as distributed systems.arXiv preprint arXiv:2603.12229, 2026

Elizabeth Mieczkowski, Katherine M Collins, Ilia Sucholutsky, Natalia Vélez, and Thomas L Griffiths. Language model teams as distributed systems.arXiv preprint arXiv:2603.12229, 2026

work page arXiv 2026
[37]

Ray: A distributed framework for emerging AI applications

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577. USENIX Association, 2018

2018
[38]

Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, and James Zou. Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

work page arXiv 2026
[39]

O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023. doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[40]

Polychronopoulos and David J

Constantine D. Polychronopoulos and David J. Kuck. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers.IEEE Transactions on Computers, C-36(12): 1425–1439, December 1987. doi: 10.1109/TC.1987.5009495

work page doi:10.1109/tc.1987.5009495 1987
[41]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15174–15186, 2024

2024
[42]

A framework for meta-level control in multi-agent systems

Anita Raja and Victor Lesser. A framework for meta-level control in multi-agent systems. Autonomous Agents and Multi-Agent Systems, 15(2):147–196, 2007

2007
[43]

Emergent Coordination in Multi-Agent Language Models

Christoph Riedl. Emergent coordination in multi-agent language models.arXiv preprint arXiv:2510.05174, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Benefits and limitations of communication in multi-agent reasoning.arXiv preprint arXiv:2510.13903, 2025

Michael Rizvi-Martel, Satwik Bhattamishra, Neil Rathi, Guillaume Rabusseau, and Michael Hahn. Benefits and limitations of communication in multi-agent reasoning.arXiv preprint arXiv:2510.13903, 2025

work page arXiv 2025
[45]

Principles of metareasoning.Artificial Intelligence, 49(1-3): 361–395, 1991

Stuart Russell and Eric Wefald. Principles of metareasoning.Artificial Intelligence, 49(1-3): 361–395, 1991. 12

1991
[46]

Sackman, W

H. Sackman, W. J. Erikson, and E. E. Grant. Exploratory experimental studies comparing online and offline programming performance.Communications of the ACM, 11(1):3–11, 1968. doi: 10.1145/362851.362858

work page doi:10.1145/362851.362858 1968
[47]

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, et al. Agents of chaos.arXiv preprint arXiv:2602.20021, 2026

work page internal anchor Pith review arXiv 2026
[48]

HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates, Inc., 2023

2023
[49]

Multiagent metareasoning through organizational design

Jason Sleight and Edmund Durfee. Multiagent metareasoning through organizational design. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014

2014
[50]

The virtual lab of AI agents designs new SARS-CoV-2 nanobodies.Nature, 646(8085):716–723, 2025

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of AI agents designs new SARS-CoV-2 nanobodies.Nature, 646(8085):716–723, 2025

2025
[51]

Un- derstanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and Henrike Moll. Un- derstanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

2005
[52]

Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Transactions on Parallel and Distributed Systems, 13(3):260–274, 2002

Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Transactions on Parallel and Distributed Systems, 13(3):260–274, 2002

2002
[53]

distributed-systems.net, 2023

Maarten Van Steen and Andrew S Tanenbaum.Distributed Systems. distributed-systems.net, 2023

2023
[54]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022
[55]

Strengthening the case for pair programming.IEEE Software, 17(4):19–25, 2000

Laurie Williams, Robert R Kessler, Ward Cunningham, and Ron Jeffries. Strengthening the case for pair programming.IEEE Software, 17(4):19–25, 2000

2000
[56]

Task scheduling in distributed computing systems with a genetic algorithm

Sung-Ho Woo, Sung-Bong Yang, Shin-Dug Kim, and Tack-Don Han. Task scheduling in distributed computing systems with a genetic algorithm. InProceedings High Performance Computing on the Information Superhighway. HPC Asia’97, pages 301–305. IEEE, 1997

1997
[57]

Autogen: Enabling next-gen LLM applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

2024
[58]

Understanding agent scaling in llm-based multi-agent sys- tems via diversity.arXiv preprint arXiv:2602.03794, 2026

Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, and Shangding Gu. Understanding agent scaling in LLM-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

work page arXiv 2026
[59]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36:11809–11822, 2023

2023
[60]

Cut the crap: An economical communication pipeline for llm-based multi-agent systems.arXiv preprint arXiv:2410.02506, 2024

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the crap: An economical communication pipeline for LLM-based multi-agent systems.arXiv preprint arXiv:2410.02506, 2024

work page arXiv 2024
[61]

Position: Science is collaborative—LLM for science should be too

Terry Jingchen Zhang, Wenyuan Jiang, Yongjin Yang, Sirui Lu, Bernhard Schölkopf, and Zhijing Jin. Position: Science is collaborative—LLM for science should be too. InICLR 2026 Workshop on Foundation Models for Science: Real-World Impact, 2026. Oral

2026
[62]

Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024. 13

2024
[63]

Least-to-most prompting enables complex reasoning in large language models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh International Conference on Learning Representations, 2023. 14 Appendix A1 Related Work 15 A1.1 LLM teams . . . . . ...

2023
[69]

I can’t start B until A’s output exists

Synthesize results and ensure successful project completion Work efficiently and delegate appropriately. Trust your teammates to handle their assignments, but provide guidance when needed. Keep communication clear and actionable. Parallelism: Teammates can self-assign from the ready queue — they do not need to wait for you. Your job is to keep the graph c...
[75]

fix-index

Graph updates. The task graph is a living document. Use <discover_task> to add new tasks whenever: (a) A teammate reports that tests are still failing after completing their task, (b) you notice a dependency was missed or a prior task produced incorrect output, (c) the project needs a verification or integration pass that wasn’t planned upfront. Example: ...
[76]

The verifying agent will check correctness and fix any issues

If a task is high-stakes — it is upstream of many other tasks, or its output is hard to validate later — you can request a verification pass by a second agent: <verify_task id="task-X" /> This inserts a lightweight review task into the graph that must complete before downstream tasks proceed. The verifying agent will check correctness and fix any issues
[77]

task-X" /> This clears the current owner and resets the task to pending. Then reassign it with <assign_task id=

Straggler mitigation. If a teammate has been assigned a task for several rounds without completing it, they may be stuck. Use this action to release the task back to pending so it can be reassigned: <release_task id="task-X" /> This clears the current owner and resets the task to pending. Then reassign it with <assign_task id="task-X" to="DevY" /> either ...
[78]

assigned

If the test suite is passing but tasks are still marked "assigned" or "in_progress" (e.g. a teammate completed the work but forgot to emit <complete_task>), you can close them directly: <close_task id="task-X" /> Only use this after confirming with <run_tests /> that tests pass. This is the right action when: all tests are green, a task’s work is clearly ...
[81]

math_utils.py

To read an existing file’s contents directly, use: <read_file path="math_utils.py" /> This returns the file contents immediately — no script needed. Always prefer this over writing a helper script to print a file. To execute a script and see its output, use: <run_script path="script.py" /> This runs the file and returns stdout/stderr to you. Use this to v...
[83]

Communicate with the team Lead when blocked or in need of clarification
[84]

I cannot start B until A’s output exists

Complete tasks thoroughly before moving to the next one. Be proactive, collaborative, and detail-oriented. Focus on producing high-quality work. Discovering New Tasks Use <discover_task> whenever you uncover work that isn’t already in the task list. When possible, build a wide graph, not a deep one. Only use dependencies to express real implementation ord...

2024
[85]

ProductManager(Alice) translates the task description into a Product Requirements Docu- ment (PRD), user stories, and a competitive analysis
[86]

3.ProjectManager(Eve) reads the system design and issues a task list to the Engineer

Architect(Bob) receives the PRD and produces a system design document, including the Python package name, file structure, and API specifications. 3.ProjectManager(Eve) reads the system design and issues a task list to the Engineer
[87]

Engineer(Alex, n_borg= 1 ) implements the assigned files sequentially, one file per action, emitting code blocks to shared memory
[88]

Agents communicate exclusively through a shared publish-subscribe message bus: each role watches a fixed set of upstream action types and acts only when a matching message arrives

QaEngineer(Edward, test_round_allowed= 5 ) watches for Engineer output and iterates a write-test→run-code→debug-error loop up to the allowed round count. Agents communicate exclusively through a shared publish-subscribe message bus: each role watches a fixed set of upstream action types and acts only when a matching message arrives. The task decomposition...
[90]

Break down work and strategically assign tasks to team members
[91]

Monitor progress and coordinate the team
[92]

Help unblock teammates when they face issues
[93]

Review work for quality and consistency
[94]

Trust your teammates to handle their assignments, but provide guidance when needed

Synthesize results and ensure successful project completion Work efficiently and delegate appropriately. Trust your teammates to handle their assignments, but provide guidance when needed. Keep communication clear and actionable. Available Actions: Do NOT edit files yourself — focus on directing your team and verifying their work

Showing first 80 references.