What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

Chen Huang; Wenxuan Zhang; Yuhao Wu

arxiv: 2606.05304 · v1 · pith:MQIOUAXJnew · submitted 2026-06-03 · 💻 cs.AI

What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

Chen Huang , Yuhao Wu , Wenxuan Zhang This is my paper

Pith reviewed 2026-06-28 06:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemslarge language modelsinter-agent communicationtoken efficiencyaction-state recordsPACT

0 comments

The pith

Projecting each agent's raw output to a compact action-state record lets multi-agent LLM systems match or exceed task performance while using far fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how LLM-based multi-agent systems pass messages and finds that unconstrained natural language quickly inflates token counts and crowds the shared context. Analysis of five common strategies across two topologies shows that no single approach works best everywhere, but messages that keep only the action-centered facts needed by the next agent stay effective. From this observation the authors build PACT, a method that converts every raw output into a short action-state record and treats the shared history as a public state that gets updated rather than appended. Tests on varied layouts and on real coding agents show the compact records deliver the same or better results at much lower token cost.

Core claim

PACT treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies this yields comparable or stronger task performance with substantially fewer tokens. The same gains appear in production coding harnesses: PACT raises OpenHands resolve rate while cutting tokens-per-resolved by 10 percent and keeps SWE-agent resolve rate unchanged while halving input tokens.

What carries the argument

PACT (Protocolized Action-state Communication and Transmission), which converts raw agent outputs into compact action-state records that update a shared public state.

If this is right

PACT improves the performance-cost trade-off on every MAS topology tested.
PACT raises OpenHands resolve rate while reducing tokens-per-resolved by 10 percent.
PACT keeps SWE-agent resolve rate unchanged while halving input tokens.
No fixed communication strategy is optimal for all topologies; only messages that retain action-centered information remain reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same state-update framing could be applied to non-coding multi-agent workflows where token budgets are tight.
If the records lose information on some edge tasks, the method would need task-specific extensions or fallback rules.
Large-scale deployments that hit context limits first would see the largest absolute cost savings.

Load-bearing premise

That the compact action-state records always preserve every fact downstream agents need, with no task-specific loss.

What would settle it

Any concrete task and topology where replacing raw language with the compact records produces measurably lower final performance than the baseline communication method.

Figures

Figures reproduced from arXiv: 2606.05304 by Chen Huang, Wenxuan Zhang, Yuhao Wu.

**Figure 2.** Figure 2: Five inter-agent communication strategies in two MAS settings at three model scales. Top two rows: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average agent turns per interaction dialogue: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Illustrative interaction turn with and without [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter-agent messages consistently preserve action-centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action-state Communication and Transmission), which treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. Our code is publicly available at https://github.com/iNLP-Lab/PACT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PACT gives a concrete way to cut tokens in multi-agent LLM systems by turning outputs into action-state records, with gains on the tested coding agents.

read the letter

The main takeaway is that PACT projects raw agent outputs into compact action-state records treated as public updates, and this improves the performance-cost trade-off on the coding tasks they ran.

They first tested five communication strategies across two topologies and observed that keeping action-centered information matters more than the exact format. From that they built PACT as a structured protocol instead of unconstrained natural language. The reported results show it lifts OpenHands resolve rate at 10% lower tokens per resolved task and keeps SWE-agent resolve neutral while halving input tokens. Public code is a plus.

The analysis of existing strategies is useful grounding, and the focus on a real bottleneck like context length and cost is practical. The empirical gains on production harnesses give the claim some weight.

The soft spots are the missing experimental details such as error bars, exact construction of the action-state records, and any ablation on what gets dropped. The stress-test point about potential omitted context for downstream agents is fair; the abstract shows it worked on their cases but does not prove it holds more generally or outside coding tasks.

This is for researchers and engineers working on multi-agent LLM orchestration who need to manage token budgets. It deserves peer review because the method is specific, the results are on real systems, and the idea is testable even if more experiments would be needed in revision.

Referee Report

1 major / 1 minor

Summary. The paper analyzes five common inter-agent communication strategies across two MAS topologies in LLM-based multi-agent systems, finding no fixed strategy is universally optimal and that effective messages preserve action-centered information. It proposes PACT, which projects raw agent outputs into compact action-state records as public state updates to reduce token usage. Empirical results claim consistent improvements in the performance-cost trade-off across topologies, with specific gains on production coding harnesses: improved resolve rate on OpenHands at -10% tokens-per-resolved, and resolve-neutral performance on SWE-agent with halved input tokens. Code is released publicly.

Significance. If the results hold, PACT provides a practical protocol for lowering inference costs and context pressure in multi-agent LLM systems without performance loss. The public code release is a strength that supports reproducibility and further testing.

major comments (1)

[Abstract] Abstract: the claim that projecting raw outputs to compact action-state records 'preserves' all information required by downstream agents is load-bearing for the 'comparable or stronger task performance' result, yet is supported only by empirical outcomes on the tested topologies and harnesses; no formal completeness argument or mechanism is supplied to guarantee retention of private reasoning or dependency chains for arbitrary tasks.

minor comments (1)

The abstract reports consistent improvements but supplies no experimental details on baselines, number of runs, error bars, or exclusion criteria, limiting assessment of the reported gains on OpenHands and SWE-agent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important point on the abstract's phrasing. We address the concern directly below and agree that revisions are warranted to align language with the empirical nature of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that projecting raw outputs to compact action-state records 'preserves' all information required by downstream agents is load-bearing for the 'comparable or stronger task performance' result, yet is supported only by empirical outcomes on the tested topologies and harnesses; no formal completeness argument or mechanism is supplied to guarantee retention of private reasoning or dependency chains for arbitrary tasks.

Authors: We agree the results are strictly empirical and provide no formal completeness argument or mechanism that would guarantee retention of all private reasoning or dependency chains for arbitrary tasks. The manuscript demonstrates that action-state records suffice for the tested topologies and harnesses, but does not claim universality. We will revise the abstract (and related sections) to replace any implication of guaranteed preservation with explicit reference to observed empirical outcomes, e.g., 'achieving comparable or stronger task performance with substantially fewer tokens across the evaluated settings.' This change removes the load-bearing overclaim while preserving the paper's core empirical contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of PACT independent of self-referential definitions

full rationale

The paper analyzes five communication strategies empirically across MAS topologies, observes that action-centered information is effective, and introduces PACT as a projection method. Performance gains are reported from direct experiments on topologies and production harnesses (OpenHands, SWE-agent) with token and resolve-rate metrics. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked in the provided text to derive the central claims; results are presented as measured outcomes rather than reductions by construction. The assumption that action-state records preserve necessary information is tested empirically but not claimed as proven by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical observation that action-state information is sufficient; no free parameters, mathematical axioms, or new physical entities are introduced.

axioms (1)

domain assumption Action-centered information is the primary content needed by downstream agents in the tested MAS topologies
Invoked to justify projecting raw outputs into compact records

invented entities (1)

PACT protocol no independent evidence
purpose: To treat inter-agent communication as a public state-update problem and project outputs into action-state records
New method introduced by the paper; no independent evidence outside the reported experiments

pith-pipeline@v0.9.1-grok · 5737 in / 1222 out tokens · 28560 ms · 2026-06-28T06:20:53.298884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 32 canonical work pages · 22 internal anchors

[1]

Cohen and Ruslan Salakhutdinov and Christopher D

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =. 2018 , doi =

2018
[2]

Proceedings of the 28th International Conference on Computational Linguistics , pages =

Xanh Ho and Anh-Khoa Duong Nguyen and Saku Sugawara and Akiko Aizawa , title =. Proceedings of the 28th International Conference on Computational Linguistics , pages =. 2020 , doi =

2020
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , title =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =
[6]

2026 , month = apr, howpublished =

2026
[7]

2026 , howpublished =

2026
[9]

Proceedings of the 13th International Conference on Learning Representations , year =

Weize Chen and Ziming You and Ran Li and Yitong Guan and Chen Qian and Chenyang Zhao and Cheng Yang and Ruobing Xie and Zhiyuan Liu and Maosong Sun , title =. Proceedings of the 13th International Conference on Learning Representations , year =
[10]

White and Doug Burger and Chi Wang , title =

Qingyun Wu and Gagan Bansal and Jieyu Zhang and Yiran Wu and Beibin Li and Erkang Zhu and Li Jiang and Xiaoyun Zhang and Shaokun Zhang and Jiale Liu and Ahmed Hassan Awadallah and Ryen W. White and Doug Burger and Chi Wang , title =. Proceedings of the 12th International Conference on Learning Representations , year =
[11]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =

Zhenhailong Wang and Shaoguang Mao and Wenshan Wu and Tao Ge and Furu Wei and Heng Ji , title =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =. 2024 , doi =

2024
[12]

Proceedings of the 12th International Conference on Learning Representations , year =

Weize Chen and Yusheng Su and Jingwei Zuo and Cheng Yang and Chenfei Yuan and Chi-Min Chan and Heyang Yu and Yaxi Lu and Yi-Hsin Hung and Chen Qian and Yujia Qin and Xin Cong and Ruobing Xie and Zhiyuan Liu and Maosong Sun and Jie Zhou , title =. Proceedings of the 12th International Conference on Learning Representations , year =
[13]

Advances in Neural Information Processing Systems , volume =

Guohao Li and Hasan Anil Hammoud and Hani Itani and Dmitrii Khizbullin and Bernard Ghanem , title =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023
[14]

Proceedings of the 11th International Conference on Learning Representations , year =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , title =. Proceedings of the 11th International Conference on Learning Representations , year =
[15]

Le and Denny Zhou , title =

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc V. Le and Denny Zhou , title =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022
[16]

Foundations and Trends in Information Retrieval , volume =

Stephen Robertson and Hugo Zaragoza , title =. Foundations and Trends in Information Retrieval , volume =. 2009 , doi =

2009
[17]

M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J

Rishabh Agarwal and Avi Singh and Lei M. Zhang and Bernd Bohnet and Luis Rosias and Stephanie Chan and Biao Zhang and Ankesh Anand and Zaheer Abbas and Azade Nova and John D. Co-Reyes and Eric Chu and Feryal Behbahani and Aleksandra Faust and Hugo Larochelle , title =. arXiv preprint arXiv:2404.11018 , year =

work page arXiv
[18]

Tenenbaum and Igor Mordatch , title =

Yilun Du and Shuang Li and Antonio Torralba and Joshua B. Tenenbaum and Igor Mordatch , title =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , url =

2024
[20]

arXiv preprint arXiv:2408.13654 , year =

Hao Zhou and Chengkun Li and Junlang Qian and Zhen Huang and Fandong Meng and Jie Zhou , title =. arXiv preprint arXiv:2408.13654 , year =

work page arXiv
[21]

The Llama 3 Herd of Models

The. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2303.08774 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2405.14394 , year =

Yifan Shen and Zhiqi Bu and Fang Chen and Jing Li , title =. arXiv preprint arXiv:2405.14394 , year =

work page arXiv
[25]

Bowman , title =

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , title =. Proceedings of the First Conference on Language Modeling (COLM) , year =
[26]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =

Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =. 2018 , doi =

2018
[27]

Evaluating Large Language Models Trained on Code

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and others , title =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Program Synthesis with Large Language Models

Jacob Austin and Augustus Odena and Maxwell Nye and Maarten Bosma and others , title =. arXiv preprint arXiv:2108.07732 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[30]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =. Proceedings of the Twelfth International Conference on Learning Representations (ICLR) , year =
[31]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and H...
[32]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[34]

Advances in Neural Information Processing Systems , volume=

Why do multi-agent llm systems fail? , author=. Advances in Neural Information Processing Systems , volume=
[35]

International Conference on Learning Representations , volume=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. International Conference on Learning Representations , volume=
[37]

Advances in Neural Information Processing Systems , volume=

Chain of agents: Large language models collaborating on long-context tasks , author=. Advances in Neural Information Processing Systems , volume=
[39]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Improving multi-agent debate with sparse communication topology , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[42]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
[43]

2024 , howpublished =

2024
[44]

2025 , howpublished =

2025
[45]

International Conference on Learning Representations , volume=

Cut the crap: An economical communication pipeline for llm-based multi-agent systems , author=. International Conference on Learning Representations , volume=
[46]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Understanding the information propagation effects of communication topologies in llm-based multi-agent systems , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[47]

Forty-first International Conference on Machine Learning , year=

Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=
[48]

S2-mad: Breaking the token barrier to enhance multi-agent debate efficiency , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[49]

Anthropic . 2026 a . Claude Code . https://claude.com/product/claude-code

2026
[50]

Anthropic . 2026 b . Introducing Claude Opus 4.7 . https://www.anthropic.com/news/claude-opus-4-7

2026
[51]

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, and Jiaxin Pei. 2026. How do ai agents spend your money? analyzing and predicting token consumption in agentic coding tasks. arXiv preprint arXiv:2604.22750

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, and 1 others. 2026. Why do multi-agent llm systems fail? Advances in Neural Information Processing Systems, 38

2026
[53]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2024. https://arxiv.org/abs/2308.10848 AgentVerse : Facilitating multi-agent collaboration and exploring emergent behaviors in agents . In Proceedings of the 12t...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2025. https://arxiv.org/abs/2410.08115 Optima : Optimizing effectiveness and efficiency for LLM -based multi-agent system . In Proceedings of the 13th International Conference on Learning Representations

work page arXiv 2025
[55]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2024. https://arxiv.org/abs/2305.14325 Improving factuality and reasoning in language models through multiagent debate . In Proceedings of the 41st International Conference on Machine Learning, pages 11850--11881

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. https://doi.org/10.18653/v1/2020.coling-main.580 Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps . In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609--6625

work page doi:10.18653/v1/2020.coling-main.580 2020
[57]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, and 1 others. 2024. Metagpt: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations, volume 2024, pages 23247--23275

2024
[58]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. https://arxiv.org/abs/2310.06770 SWE-bench : Can language models resolve real-world GitHub issues? In Proceedings of the Twelfth International Conference on Learning Representations (ICLR)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Guohao Li, Hasan Anil Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. https://arxiv.org/abs/2303.17760 CAMEL : Communicative agents for ``mind'' exploration of large language model society . In Advances in Neural Information Processing Systems, volume 36, pages 51991--52008

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. 2024. Improving multi-agent debate with sparse communication topology. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7281--7294

2024
[61]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. https://arxiv.org/abs/2305.19118 Encouraging divergent thinking in large language models through multi-agent debate

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

math-ai . 2025. AIME 2025 dataset. https://huggingface.co/datasets/math-ai/aime25. Hugging Face dataset

2025
[63]

Maxwell-Jia . 2024. AIME 2024 dataset. https://huggingface.co/datasets/Maxwell-Jia/AIME_2024. Hugging Face dataset

2024
[64]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? a new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391

work page doi:10.18653/v1/d18-1260 2018
[65]

OpenAI . 2026 a . Codex . https://openai.com/codex/

2026
[66]

OpenAI . 2026 b . Introducing GPT-5.5 . https://openai.com/index/introducing-gpt-5-5/

2026
[67]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, and 1 others. 2024. Chatdev: Communicative agents for software development. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174--15186

2024
[68]

Vignav Ramesh and Kenneth Li. 2025. Communicating activations between language model agents. arXiv preprint arXiv:2501.14082

work page arXiv 2025
[69]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2024. https://arxiv.org/abs/2311.12022 GPQA : A graduate-level google-proof Q&A benchmark . In Proceedings of the First Conference on Language Modeling (COLM)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. 2025. Understanding the information propagation effects of communication topologies in llm-based multi-agent systems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12358--12372

2025
[71]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and 1 others. 2026. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026
[72]

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. 2024 a . https://arxiv.org/abs/2406.04692 Mixture-of-agents enhances large language model capabilities

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024 b . Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning

2024
[74]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, and 5 others. 2025. https://arxiv.org/abs/2407.16741 OpenHands : An open platform for AI software develope...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2024 c . https://doi.org/10.18653/v1/2024.naacl-long.15 Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Co...

work page doi:10.18653/v1/2024.naacl-long.15 2024
[76]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35, pages 24824--24837

work page internal anchor Pith review Pith/arXiv arXiv 2022
[77]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. https://arxiv.org/abs/2308.08155 AutoGen : Enabling next-gen LLM applications via multi-agent conversation . In Proceedings of the 12th International Conference o...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. https://arxiv.org/abs/2405.15793 SWE-agent : Agent-computer interfaces enable automated software engineering . In Advances in Neural Information Processing Systems (NeurIPS)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

In: Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 HotpotQA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380

work page doi:10.18653/v1/d18-1259 2018
[81]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . In Proceedings of the 11th International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2023
[82]

Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang, and Haohan Wang. 2026. Learning to communicate: Toward end-to-end optimization of multi-agent language systems. arXiv preprint arXiv:2604.21794

work page internal anchor Pith review Pith/arXiv arXiv 2026
[83]

Yuting Zeng, Weizhe Huang, Lei Jiang, Tongxuan Liu, Xitai Jin, Chen Tianying Tiana, Jing Li, and Xiaohua Xu. 2025. S2-mad: Breaking the token barrier to enhance multi-agent debate efficiency. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1:...

2025
[84]

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Yu, and Tianlong Chen. 2025. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. In International Conference on Learning Representations, volume 2025, pages 75389--75428

2025
[85]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan \"O Ar k. 2024. Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems, 37:132208--132237

2024
[86]

Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, and Haoran Zhang. 2025. Connecting the dots: A chain-of-collaboration prompting framework for llm agents. arXiv preprint arXiv:2505.10936

work page arXiv 2025
[87]

Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, and 1 others. 2025. Latent collaboration in multi-agent systems. arXiv preprint arXiv:2511.20639

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Cohen and Ruslan Salakhutdinov and Christopher D

Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =. 2018 , doi =

2018

[2] [2]

Proceedings of the 28th International Conference on Computational Linguistics , pages =

Xanh Ho and Anh-Khoa Duong Nguyen and Saku Sugawara and Akiko Aizawa , title =. Proceedings of the 28th International Conference on Computational Linguistics , pages =. 2020 , doi =

2020

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , title =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

[5] [6]

2026 , month = apr, howpublished =

2026

[6] [7]

2026 , howpublished =

2026

[7] [9]

Proceedings of the 13th International Conference on Learning Representations , year =

Weize Chen and Ziming You and Ran Li and Yitong Guan and Chen Qian and Chenyang Zhao and Cheng Yang and Ruobing Xie and Zhiyuan Liu and Maosong Sun , title =. Proceedings of the 13th International Conference on Learning Representations , year =

[8] [10]

White and Doug Burger and Chi Wang , title =

Qingyun Wu and Gagan Bansal and Jieyu Zhang and Yiran Wu and Beibin Li and Erkang Zhu and Li Jiang and Xiaoyun Zhang and Shaokun Zhang and Jiale Liu and Ahmed Hassan Awadallah and Ryen W. White and Doug Burger and Chi Wang , title =. Proceedings of the 12th International Conference on Learning Representations , year =

[9] [11]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =

Zhenhailong Wang and Shaoguang Mao and Wenshan Wu and Tao Ge and Furu Wei and Heng Ji , title =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =. 2024 , doi =

2024

[10] [12]

Proceedings of the 12th International Conference on Learning Representations , year =

Weize Chen and Yusheng Su and Jingwei Zuo and Cheng Yang and Chenfei Yuan and Chi-Min Chan and Heyang Yu and Yaxi Lu and Yi-Hsin Hung and Chen Qian and Yujia Qin and Xin Cong and Ruobing Xie and Zhiyuan Liu and Maosong Sun and Jie Zhou , title =. Proceedings of the 12th International Conference on Learning Representations , year =

[11] [13]

Advances in Neural Information Processing Systems , volume =

Guohao Li and Hasan Anil Hammoud and Hani Itani and Dmitrii Khizbullin and Bernard Ghanem , title =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023

[12] [14]

Proceedings of the 11th International Conference on Learning Representations , year =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , title =. Proceedings of the 11th International Conference on Learning Representations , year =

[13] [15]

Le and Denny Zhou , title =

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc V. Le and Denny Zhou , title =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022

[14] [16]

Foundations and Trends in Information Retrieval , volume =

Stephen Robertson and Hugo Zaragoza , title =. Foundations and Trends in Information Retrieval , volume =. 2009 , doi =

2009

[15] [17]

M., Bohnet, B., Rosias, L., Chan, S., Zhang, B., Anand, A., Abbas, Z., Nova, A., Co-Reyes, J

Rishabh Agarwal and Avi Singh and Lei M. Zhang and Bernd Bohnet and Luis Rosias and Stephanie Chan and Biao Zhang and Ankesh Anand and Zaheer Abbas and Azade Nova and John D. Co-Reyes and Eric Chu and Feryal Behbahani and Aleksandra Faust and Hugo Larochelle , title =. arXiv preprint arXiv:2404.11018 , year =

work page arXiv

[16] [18]

Tenenbaum and Igor Mordatch , title =

Yilun Du and Shuang Li and Antonio Torralba and Joshua B. Tenenbaum and Igor Mordatch , title =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , url =

2024

[17] [20]

arXiv preprint arXiv:2408.13654 , year =

Hao Zhou and Chengkun Li and Junlang Qian and Zhen Huang and Fandong Meng and Jie Zhou , title =. arXiv preprint arXiv:2408.13654 , year =

work page arXiv

[18] [21]

The Llama 3 Herd of Models

The. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[19] [23]

arXiv preprint arXiv:2303.08774 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[20] [24]

arXiv preprint arXiv:2405.14394 , year =

Yifan Shen and Zhiqi Bu and Fang Chen and Jing Li , title =. arXiv preprint arXiv:2405.14394 , year =

work page arXiv

[21] [25]

Bowman , title =

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , title =. Proceedings of the First Conference on Language Modeling (COLM) , year =

[22] [26]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =

Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages =. 2018 , doi =

2018

[23] [27]

Evaluating Large Language Models Trained on Code

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and others , title =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [28]

Program Synthesis with Large Language Models

Jacob Austin and Augustus Odena and Maxwell Nye and Maarten Bosma and others , title =. arXiv preprint arXiv:2108.07732 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[25] [29]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[26] [30]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =. Proceedings of the Twelfth International Conference on Learning Representations (ICLR) , year =

[27] [31]

Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H

Xingyao Wang and Boxuan Li and Yufan Song and Frank F. Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H. Tran and Fuqiang Li and Ren Ma and Mingzhang Zheng and Bill Qian and Yanjun Shao and Niklas Muennighoff and Yizhe Zhang and Binyuan Hui and Junyang Lin and Robert Brennan and Hao Peng and H...

[28] [32]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[29] [34]

Advances in Neural Information Processing Systems , volume=

Why do multi-agent llm systems fail? , author=. Advances in Neural Information Processing Systems , volume=

[30] [35]

International Conference on Learning Representations , volume=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. International Conference on Learning Representations , volume=

[31] [37]

Advances in Neural Information Processing Systems , volume=

Chain of agents: Large language models collaborating on long-context tasks , author=. Advances in Neural Information Processing Systems , volume=

[32] [39]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Improving multi-agent debate with sparse communication topology , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[33] [42]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

[34] [43]

2024 , howpublished =

2024

[35] [44]

2025 , howpublished =

2025

[36] [45]

International Conference on Learning Representations , volume=

Cut the crap: An economical communication pipeline for llm-based multi-agent systems , author=. International Conference on Learning Representations , volume=

[37] [46]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Understanding the information propagation effects of communication topologies in llm-based multi-agent systems , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[38] [47]

Forty-first International Conference on Machine Learning , year=

Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=

[39] [48]

S2-mad: Breaking the token barrier to enhance multi-agent debate efficiency , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[40] [49]

Anthropic . 2026 a . Claude Code . https://claude.com/product/claude-code

2026

[41] [50]

Anthropic . 2026 b . Introducing Claude Opus 4.7 . https://www.anthropic.com/news/claude-opus-4-7

2026

[42] [51]

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, and Jiaxin Pei. 2026. How do ai agents spend your money? analyzing and predicting token consumption in agentic coding tasks. arXiv preprint arXiv:2604.22750

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [52]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, and 1 others. 2026. Why do multi-agent llm systems fail? Advances in Neural Information Processing Systems, 38

2026

[44] [53]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2024. https://arxiv.org/abs/2308.10848 AgentVerse : Facilitating multi-agent collaboration and exploring emergent behaviors in agents . In Proceedings of the 12t...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [54]

Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2025. https://arxiv.org/abs/2410.08115 Optima : Optimizing effectiveness and efficiency for LLM -based multi-agent system . In Proceedings of the 13th International Conference on Learning Representations

work page arXiv 2025

[46] [55]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2024. https://arxiv.org/abs/2305.14325 Improving factuality and reasoning in language models through multiagent debate . In Proceedings of the 41st International Conference on Machine Learning, pages 11850--11881

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [56]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. https://doi.org/10.18653/v1/2020.coling-main.580 Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps . In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609--6625

work page doi:10.18653/v1/2020.coling-main.580 2020

[48] [57]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, and 1 others. 2024. Metagpt: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations, volume 2024, pages 23247--23275

2024

[49] [58]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. https://arxiv.org/abs/2310.06770 SWE-bench : Can language models resolve real-world GitHub issues? In Proceedings of the Twelfth International Conference on Learning Representations (ICLR)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [59]

Guohao Li, Hasan Anil Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. https://arxiv.org/abs/2303.17760 CAMEL : Communicative agents for ``mind'' exploration of large language model society . In Advances in Neural Information Processing Systems, volume 36, pages 51991--52008

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [60]

Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. 2024. Improving multi-agent debate with sparse communication topology. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7281--7294

2024

[52] [61]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. https://arxiv.org/abs/2305.19118 Encouraging divergent thinking in large language models through multi-agent debate

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [62]

math-ai . 2025. AIME 2025 dataset. https://huggingface.co/datasets/math-ai/aime25. Hugging Face dataset

2025

[54] [63]

Maxwell-Jia . 2024. AIME 2024 dataset. https://huggingface.co/datasets/Maxwell-Jia/AIME_2024. Hugging Face dataset

2024

[55] [64]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? a new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391

work page doi:10.18653/v1/d18-1260 2018

[56] [65]

OpenAI . 2026 a . Codex . https://openai.com/codex/

2026

[57] [66]

OpenAI . 2026 b . Introducing GPT-5.5 . https://openai.com/index/introducing-gpt-5-5/

2026

[58] [67]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, and 1 others. 2024. Chatdev: Communicative agents for software development. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174--15186

2024

[59] [68]

Vignav Ramesh and Kenneth Li. 2025. Communicating activations between language model agents. arXiv preprint arXiv:2501.14082

work page arXiv 2025

[60] [69]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2024. https://arxiv.org/abs/2311.12022 GPQA : A graduate-level google-proof Q&A benchmark . In Proceedings of the First Conference on Language Modeling (COLM)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [70]

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. 2025. Understanding the information propagation effects of communication topologies in llm-based multi-agent systems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12358--12372

2025

[62] [71]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and 1 others. 2026. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [72]

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. 2024 a . https://arxiv.org/abs/2406.04692 Mixture-of-agents enhances large language model capabilities

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [73]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024 b . Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning

2024

[65] [74]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, and 5 others. 2025. https://arxiv.org/abs/2407.16741 OpenHands : An open platform for AI software develope...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [75]

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2024 c . https://doi.org/10.18653/v1/2024.naacl-long.15 Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Co...

work page doi:10.18653/v1/2024.naacl-long.15 2024

[67] [76]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35, pages 24824--24837

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [77]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. https://arxiv.org/abs/2308.08155 AutoGen : Enabling next-gen LLM applications via multi-agent conversation . In Proceedings of the 12th International Conference o...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [78]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [79]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. https://arxiv.org/abs/2405.15793 SWE-agent : Agent-computer interfaces enable automated software engineering . In Advances in Neural Information Processing Systems (NeurIPS)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [80]

In: Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 HotpotQA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380

work page doi:10.18653/v1/d18-1259 2018

[72] [81]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . In Proceedings of the 11th International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [82]

Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang, and Haohan Wang. 2026. Learning to communicate: Toward end-to-end optimization of multi-agent language systems. arXiv preprint arXiv:2604.21794

work page internal anchor Pith review Pith/arXiv arXiv 2026

[74] [83]

Yuting Zeng, Weizhe Huang, Lei Jiang, Tongxuan Liu, Xitai Jin, Chen Tianying Tiana, Jing Li, and Xiaohua Xu. 2025. S2-mad: Breaking the token barrier to enhance multi-agent debate efficiency. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1:...

2025

[75] [84]

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Yu, and Tianlong Chen. 2025. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. In International Conference on Learning Representations, volume 2025, pages 75389--75428

2025

[76] [85]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan \"O Ar k. 2024. Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems, 37:132208--132237

2024

[77] [86]

Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, and Haoran Zhang. 2025. Connecting the dots: A chain-of-collaboration prompting framework for llm agents. arXiv preprint arXiv:2505.10936

work page arXiv 2025

[78] [87]

Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, and 1 others. 2025. Latent collaboration in multi-agent systems. arXiv preprint arXiv:2511.20639

work page internal anchor Pith review Pith/arXiv arXiv 2025