arxiv: 2604.22452 · v1 · submitted 2026-04-24 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

Xirui Li , Ming Li , Yunze Xiao , Ryan Wong , Dianqi Li , Timothy Baldwin , Tianyi Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords collective intelligencemulti-agent systemslarge language model agentsprobing evaluationinteraction sparsityagent societyemergent behaviorAI scalability

0 comments

The pith

Large AI agent societies fail to develop collective intelligence because their interactions stay too shallow and sparse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Superminds Test, a three-tier probing framework that deploys controlled agents to check for joint reasoning, information synthesis, and basic coordination in a society of over two million autonomous agents. Experiments on the MoltBook platform show the group does not outperform single frontier models on complex tasks, rarely combines information from different agents, and often fails even simple coordination. A sympathetic reader would care because the results challenge the assumption that simply increasing the number of agents will produce group-level intelligence.

Core claim

The Superminds Test reveals that collective intelligence does not emerge from scale alone in current agent societies. In a platform with over two million agents, the society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information across agents, and frequently fails trivial coordination tasks, with platform-wide data showing interactions remain shallow as threads rarely extend beyond a single reply and most responses are generic or off-topic.

What carries the argument

Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers of joint reasoning, information synthesis, and basic interaction.

If this is right

Scaling the number of agents alone will not produce collective intelligence without addressing interaction patterns.
Agent platforms require redesigned mechanisms to encourage multi-turn exchanges and building on prior outputs.
Controlled probing with external agents can diagnose the presence or absence of collective intelligence in autonomous societies.
Information stays isolated in individual threads when interactions are sparse, blocking any group-level synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent designs could train models to reference and extend previous messages to test whether deeper exchanges improve group performance.
The same probing approach could be applied to smaller agent groups or alternative interaction formats to isolate the effect of platform rules.
This limitation may connect to broader questions about communication depth in both artificial and human group problem-solving.

Load-bearing premise

The three-tier probing with controlled agents and selected tasks accurately detects the presence or absence of collective intelligence without bias from the choice of probes or tasks.

What would settle it

An experiment in which agents in the society solve a complex reasoning task by exchanging and building on information across multiple interaction rounds would falsify the claim that shallow interactions prevent collective intelligence.

read the original abstract

Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling to millions of agents on MoltBook produces no collective intelligence beyond single models because interactions stay too shallow, but the Superminds Test's probing setup needs checks for false negatives.

read the letter

The core result is that this 2-million-agent platform shows no emergent group intelligence. The agents fail to beat frontier models on reasoning tasks, rarely combine information across threads, and often can't manage even simple coordination. Platform stats back this up with short threads and mostly generic replies, so the authors conclude scale alone isn't enough and interaction design is the real bottleneck.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Superminds Test, a three-tier hierarchical probing framework (joint reasoning, information synthesis, and basic interaction) that inserts controlled probing agents into the MoltBook platform (>2 million autonomous LLM agents) to test whether collective intelligence emerges from scale. Experiments report consistent negative results: the society fails to outperform individual frontier models on complex tasks, rarely synthesizes distributed information, and often fails even trivial coordination; platform-wide statistics show threads rarely exceed one reply and responses are mostly generic or off-topic. The authors conclude that sparse and shallow interaction, not scale, is the dominant limitation preventing collective intelligence.

Significance. If the probing methodology proves robust, the result would be significant for the field: it supplies large-scale empirical evidence against the hypothesis that collective intelligence arises spontaneously in massive LLM agent populations and usefully redirects attention toward designing deeper interaction mechanisms. The combination of controlled tiered probes with observational platform statistics is a methodological strength that could serve as a template for future evaluations.

major comments (3)

[§4] §4 (Experimental Setup and Tier Definitions): The manuscript provides no details on the selection criteria, model sizes, or prompting strategies for the controlled probing agents, nor on the concrete task instances, number of trials, or statistical controls used in each tier. This is load-bearing because the central claim of 'absence of collective intelligence' rests on the probes correctly detecting (rather than failing to elicit) any existing interaction capabilities.
[§3] §3 (Superminds Test Framework): No ablation studies or alternative probe designs are reported to test whether the negative results could arise from mismatches in prompting style, response expectations, or task framing between probing agents and the native MoltBook population. Without such checks, the attribution of failure to 'extremely sparse and shallow interaction' in the society rather than to the measurement apparatus cannot be cleanly established.
[Platform-wide Analysis] Platform-wide Analysis section: The observational statistics on thread length and response quality are presented without controls or counterfactual elicitation conditions (e.g., modified prompts that might encourage longer threads). This weakens the claim that shallow interaction is an intrinsic limit rather than an artifact of current interaction norms.

minor comments (2)

[Abstract] The abstract would be clearer if it briefly quantified the scale of the probing experiments (number of probes per tier, total agents involved).
[§3] Notation for the three tiers could be made more consistent across the text and any accompanying figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which highlight important aspects for strengthening the manuscript. We appreciate the recognition of the work's significance and methodological approach. We address each major comment below and will revise the paper to incorporate additional details and clarifications where feasible.

read point-by-point responses

Referee: §4 (Experimental Setup and Tier Definitions): The manuscript provides no details on the selection criteria, model sizes, or prompting strategies for the controlled probing agents, nor on the concrete task instances, number of trials, or statistical controls used in each tier. This is load-bearing because the central claim of 'absence of collective intelligence' rests on the probes correctly detecting (rather than failing to elicit) any existing interaction capabilities.

Authors: We agree that the current manuscript lacks sufficient detail on these elements, which is necessary for full reproducibility and to substantiate that the probes were appropriately designed to detect capabilities. In the revised version, we will add an expanded subsection in §4 titled 'Probing Agent Configuration, Task Design, and Experimental Controls'. This will specify: model details (probing agents drawn from GPT-4o and Claude-3.5-Sonnet with temperature 0.7 for consistency with native agents), selection criteria (random sampling from frontier models available on the platform), prompting strategies (tier-specific templates that use native MoltBook response formats while embedding controlled probes), concrete task instances (e.g., multi-hop reasoning problems for joint reasoning tier, fact-distribution puzzles for synthesis, and simple query coordination for basic interaction), number of trials (100 independent runs per tier across distinct agent subpopulations), and statistical controls (performance means with 95% bootstrap confidence intervals, direct comparisons to solo frontier model baselines). These additions will directly mitigate the concern regarding potential elicitation failures. revision: yes
Referee: §3 (Superminds Test Framework): No ablation studies or alternative probe designs are reported to test whether the negative results could arise from mismatches in prompting style, response expectations, or task framing between probing agents and the native MoltBook population. Without such checks, the attribution of failure to 'extremely sparse and shallow interaction' in the society rather than to the measurement apparatus cannot be cleanly established.

Authors: We recognize that ablations on alternative probe designs would provide stronger validation against measurement artifacts. However, performing such studies would require extensive new experiments on the live 2-million-agent platform, which exceeds the resources and access available for this study. In the revision, we will add a 'Design Rationale and Limitations' paragraph to §3 that justifies the current tiered probes by their alignment with observed native interaction patterns (e.g., using similar length and style constraints) and includes a qualitative review of how probes were received in the platform. We will also explicitly discuss this as a limitation and suggest it as future work. The consistent negative outcomes across all three independent tiers offer some internal triangulation, but we acknowledge this does not fully substitute for ablations. revision: partial
Referee: Platform-wide Analysis section: The observational statistics on thread length and response quality are presented without controls or counterfactual elicitation conditions (e.g., modified prompts that might encourage longer threads). This weakens the claim that shallow interaction is an intrinsic limit rather than an artifact of current interaction norms.

Authors: The platform-wide statistics are observational by design, as we lack the ability to alter the MoltBook platform's underlying prompts or interaction rules for counterfactual testing. In the revised manuscript, we will expand the Platform-wide Analysis section to explicitly state this limitation and to better integrate the statistics with the probe results (e.g., noting that the 92% single-reply thread rate aligns with the failure modes observed in controlled probes). We will also add contextual comparisons to smaller-scale agent societies reported in prior literature to argue that the shallowness appears systemic rather than platform-specific. This will clarify the interpretive scope without overstating the evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical probing evaluation

full rationale

The paper introduces the Superminds Test as a new hierarchical probing framework and applies it directly to observational data from the MoltBook platform, including controlled agent insertions and platform-wide thread statistics. No mathematical derivations, parameter fittings, self-definitional constructs, or load-bearing self-citations appear in the provided text that would reduce the central claims about absent collective intelligence and sparse interactions to inputs by construction. The evaluation chain remains self-contained through direct empirical measurement rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the Superminds Test accurately measures collective intelligence and that MoltBook interactions are representative of general agent societies; no free parameters, formal axioms, or invented entities are specified in the provided abstract.

pith-pipeline@v0.9.0 · 5508 in / 1091 out tokens · 45200 ms · 2026-05-08T11:56:58.514353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 27 canonical work pages · 8 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang

Altera AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, et al. Project sid: Many-agent simulations toward ai civilization. arXiv preprint arXiv:2411.00114, 2024

work page arXiv 2024
[3]

Introducing claude sonnet 4.6, February 2026

Anthropic . Introducing claude sonnet 4.6, February 2026. https://www.anthropic.com/news/claude-sonnet-4-6. Accessed: 2026-02-23

2026
[4]

Metaphysics

Aristotle. Metaphysics. Oxford University Press, Oxford, 1924. See Book VIII (Eta), 1045a8--10

1924
[5]

Political mobilization through social network sites: The mobilizing power of political messages received from sns friends

Young Min Baek. Political mobilization through social network sites: The mobilizing power of political messages received from sns friends. Computers in Human Behavior, 44: 0 12--19, 2015. ISSN 0747-5632. doi:10.1016/j.chb.2014.11.021

work page doi:10.1016/j.chb.2014.11.021 2015
[6]

The hanabi challenge: A new frontier for ai research

Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280: 0 103216, 2020

2020
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

1901
[8]

I want to break free! persuasion and anti-social behavior of LLMs in multi-agent settings with social hierarchy.arXiv preprint arXiv:2410.07109, 2024

Gian Maria Campedelli, Nicolo Penzo, Massimo Stefan, Roberto Dess \` , Marco Guerini, Bruno Lepri, and Jacopo Staiano. I want to break free! persuasion and anti-social behavior of llms in multi-agent settings with social hierarchy. arXiv preprint arXiv:2410.07109, 2024

work page arXiv 2024
[9]

Towards a design guideline for rpa evaluation: A survey of large language model-based role-playing agents

Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, and Dakuo Wang. Towards a design guideline for rpa evaluation: A survey of large language model-based role-playing agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18229--18268, 2025 a

2025
[10]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2023

2023
[11]

Multi-Agent Evolve:

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution. arXiv preprint arXiv:2510.23595, 2025 b

work page arXiv 2025
[12]

Herbert H. Clark. Using Language. Cambridge University Press, Cambridge, 1996. ISBN 9780521567454

1996
[13]

Nicol \'o Fontana, Francesco Pierri, and Luca Maria Aiello. Nicer than humans: How do large language models behave in the prisoner's dilemma? In Proceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 522--535, 2025

2025
[14]

Gerber and Donald P

Alan S. Gerber and Donald P. Green. Field Experiments: Design, Analysis, and Interpretation. W. W. Norton, New York, 1 edition, 2012

2012
[15]

A formal analysis and taxonomy of task allocation in multi-robot systems

Brian P Gerkey and Maja J Matari \'c . A formal analysis and taxonomy of task allocation in multi-robot systems. The International journal of robotics research, 23 0 (9): 0 939--954, 2004

2004
[16]

Gemini: A Family of Highly Capable Multimodal Models

Google. Gemini: A family of highly capable multimodal models, 2025. https://arxiv.org/abs/2312.11805

work page internal anchor Pith review arXiv 2025
[17]

Richelieu: Self-evolving llm-based agents for ai diplomacy

Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, and Yizhou Wang. Richelieu: Self-evolving llm-based agents for ai diplomacy. Advances in Neural Information Processing Systems, 37: 0 123471--123497, 2024

2024
[18]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The twelfth international conference on learning representations, 2023

2023
[19]

Population-aligned persona generation for llm-based social simulation

Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, and Xing Xie. Population-aligned persona generation for llm-based social simulation. arXiv preprint arXiv:2509.10127, 2025

work page arXiv 2025
[20]

Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111 0 (24): 0 8788--8790, 2014. doi:10.1073/pnas.1320040111

work page doi:10.1073/pnas.1320040111 2014
[21]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120, 2025

work page internal anchor Pith review arXiv 2025
[22]

Camel: Communicative agents for" mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36: 0 51991--52008, 2023

2023
[23]

Can LLM s speak for diverse people? tuning LLM s via debate to generate controllable controversial statements

Ming Li, Jiuhai Chen, Lichang Chen, and Tianyi Zhou. Can LLM s speak for diverse people? tuning LLM s via debate to generate controllable controversial statements. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 16160--16176, Bangkok, Thailand, August 2024. Association f...

work page doi:10.18653/v1/2024.findings-acl.956 2024
[24]

& Zhou, T

Ming Li, Xirui Li, and Tianyi Zhou. Does socialization emerge in ai agent society? a case study of moltbook, 2026. https://arxiv.org/abs/2602.14299

work page arXiv 2026
[25]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017

2017
[26]

Harnessing crowds: Mapping the genome of collective intelligence

Thomas Malone, Robert Laubacher, and Chrysanthos Dellarocas. Harnessing crowds: Mapping the genome of collective intelligence. Technology, 1, 02 2009. doi:10.2139/ssrn.1381502

work page doi:10.2139/ssrn.1381502 2009
[27]

Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation

Xinyi Mou, Zhongyu Wei, and Xuan-Jing Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4789--4809, 2024

2024
[28]

Introducing GPT-5.2 , December 2025

OpenAI . Introducing GPT-5.2 , December 2025. https://openai.com/index/introducing-gpt-5-2/. Accessed: 2026-02-23

2025
[29]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22, 2023

2023
[30]

Gorilla: Large language model connected with massive apis

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37: 0 126544--126565, 2024

2024
[31]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity's last exam. arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review arXiv 2025
[32]

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society. arXiv preprint arXiv:2502.08691, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents

Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch \"o lkopf, Mrinmaya Sachan, and Rada Mihalcea. Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems, 37: 0 111715--111759, 2024

2024
[34]

Robocup 2d soccer simulation league: Evaluation challenges

Mikhail Prokopenko, Peter Wang, Sebastian Marian, Aijun Bai, Xiao Li, and Xiaoping Chen. Robocup 2d soccer simulation league: Evaluation challenges. In Robot World Cup, pages 325--337. Springer, 2017

2017
[35]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174--15186, 2024

2024
[36]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review arXiv 2023
[37]

Monotonic value function factorisation for deep multi-agent reinforcement learning

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21 0 (178): 0 1--51, 2020

2020
[38]

Ai for collective intelligence

Christoph Riedl and David De Cremer. Ai for collective intelligence. Collective Intelligence, 4 0 (2), 2025. doi:10.1177/26339137251328909. https://doi.org/10.1177/26339137251328909. Published April 3, 2025

work page doi:10.1177/26339137251328909 2025
[39]

Programmable self-assembly in a thousand-robot swarm

Michael Rubenstein, Alejandro Cornejo, and Radhika Nagpal. Programmable self-assembly in a thousand-robot swarm. Science, 345 0 (6198): 0 795--799, 2014. doi:10.1126/science.1254295. https://www.science.org/doi/10.1126/science.1254295

work page doi:10.1126/science.1254295 2014
[40]

arXiv preprint arXiv:1902.04043 , year=

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019

work page arXiv 1902
[41]

A social network for ai agents, 2026

Matt Schlicht. A social network for ai agents, 2026. https://www.moltbook.com/

2026
[42]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

2023
[43]

Reid G. Smith. The contract net protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on Computers, C-29 0 (12): 0 1104--1113, 1980. doi:10.1109/TC.1980.1675516. https://ieeexplore.ieee.org/document/1675516. Classic work introducing the Contract Net Protocol

work page doi:10.1109/tc.1980.1675516 1980
[44]

Openclaw: The ai that actually does things

Peter Steinberger. Openclaw: The ai that actually does things. https://openclaw.ai/, 2026. Accessed: 2026-02-17

2026
[45]

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017

work page Pith review arXiv 2017
[46]

The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations

James Surowiecki. The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday, New York, NY, USA, 2004. ISBN 9780385503860. APA PsycNet record 2004-20179-000

2004
[47]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484--13508, 2023

2023
[48]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review arXiv 2025
[49]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022
[50]

Evaluating collective behaviour of hundreds of llm agents

Richard Willis, Jianing Zhao, Yali Du, and Joel Z Leibo. Evaluating collective behaviour of hundreds of llm agents. arXiv preprint arXiv:2602.16662, 2026

work page arXiv 2026
[51]

An introduction to collective intelligence

David H Wolpert and Kagan Tumer. An introduction to collective intelligence. arXiv preprint cs/9908014, 1999

work page internal anchor Pith review arXiv 1999
[53]

Chabris, Alex Pentland, Nada Hashmi, and Thomas W

Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, Nada Hashmi, and Thomas W. Malone. Evidence for a collective intelligence factor in the performance of human groups. Science, 330 0 (6004): 0 686--688, 2010 b . doi:10.1126/science.1193147

work page doi:10.1126/science.1193147 2010
[54]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First Conference on Language Modeling, 2024 a

2024
[55]

Shall we team up: Exploring spontaneous cooperation of competing llm agents

Zengqing Wu, Run Peng, Shuyuan Zheng, Qianying Liu, Xu Han, Brian I Kwon, Makoto Onizuka, Shaojie Tang, and Chuan Xiao. Shall we team up: Exploring spontaneous cooperation of competing llm agents. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5163--5186, 2024 b

2024
[56]

Twinmarket: A scalable behavioral and socialsimulation for financial markets,

Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets. arXiv preprint arXiv:2502.01506, 2025

work page arXiv 2025
[57]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

2022
[58]

Magent: A many-agent reinforcement learning platform for artificial collective intelligence

Lianmin Zheng, Jiacheng Yang, Han Cai, Ming Zhou, Weinan Zhang, Jun Wang, and Yong Yu. Magent: A many-agent reinforcement learning platform for artificial collective intelligence. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[59]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

2023
[60]

Player*: Enhancing llm-based multi-agent communication and interaction in murder mystery games

Qinglin Zhu, Runcong Zhao, Bin Liang, Jinhua Du, Lin Gui, and Yulan He. Player*: Enhancing llm-based multi-agent communication and interaction in murder mystery games. arXiv preprint arXiv:2404.17662, 2024

work page arXiv 2024
[61]

Characterizing LLM-driven Social Network: The Chirper.ai Case

Yiming Zhu, Yupeng He, Ehsan-Ul Haq, Gareth Tyson, and Pan Hui. Characterizing llm-driven social network: The chirper. ai case. arXiv preprint arXiv:2504.10286, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[63]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[64]

comment\_id

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2024