Recognition: unknown
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
Pith reviewed 2026-05-08 11:56 UTC · model grok-4.3
The pith
Large AI agent societies fail to develop collective intelligence because their interactions stay too shallow and sparse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Superminds Test reveals that collective intelligence does not emerge from scale alone in current agent societies. In a platform with over two million agents, the society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information across agents, and frequently fails trivial coordination tasks, with platform-wide data showing interactions remain shallow as threads rarely extend beyond a single reply and most responses are generic or off-topic.
What carries the argument
Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers of joint reasoning, information synthesis, and basic interaction.
If this is right
- Scaling the number of agents alone will not produce collective intelligence without addressing interaction patterns.
- Agent platforms require redesigned mechanisms to encourage multi-turn exchanges and building on prior outputs.
- Controlled probing with external agents can diagnose the presence or absence of collective intelligence in autonomous societies.
- Information stays isolated in individual threads when interactions are sparse, blocking any group-level synthesis.
Where Pith is reading between the lines
- Future agent designs could train models to reference and extend previous messages to test whether deeper exchanges improve group performance.
- The same probing approach could be applied to smaller agent groups or alternative interaction formats to isolate the effect of platform rules.
- This limitation may connect to broader questions about communication depth in both artificial and human group problem-solving.
Load-bearing premise
The three-tier probing with controlled agents and selected tasks accurately detects the presence or absence of collective intelligence without bias from the choice of probes or tasks.
What would settle it
An experiment in which agents in the society solve a complex reasoning task by exchanging and building on information across multiple interaction rounds would falsify the claim that shallow interactions prevent collective intelligence.
read the original abstract
Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Superminds Test, a three-tier hierarchical probing framework (joint reasoning, information synthesis, and basic interaction) that inserts controlled probing agents into the MoltBook platform (>2 million autonomous LLM agents) to test whether collective intelligence emerges from scale. Experiments report consistent negative results: the society fails to outperform individual frontier models on complex tasks, rarely synthesizes distributed information, and often fails even trivial coordination; platform-wide statistics show threads rarely exceed one reply and responses are mostly generic or off-topic. The authors conclude that sparse and shallow interaction, not scale, is the dominant limitation preventing collective intelligence.
Significance. If the probing methodology proves robust, the result would be significant for the field: it supplies large-scale empirical evidence against the hypothesis that collective intelligence arises spontaneously in massive LLM agent populations and usefully redirects attention toward designing deeper interaction mechanisms. The combination of controlled tiered probes with observational platform statistics is a methodological strength that could serve as a template for future evaluations.
major comments (3)
- [§4] §4 (Experimental Setup and Tier Definitions): The manuscript provides no details on the selection criteria, model sizes, or prompting strategies for the controlled probing agents, nor on the concrete task instances, number of trials, or statistical controls used in each tier. This is load-bearing because the central claim of 'absence of collective intelligence' rests on the probes correctly detecting (rather than failing to elicit) any existing interaction capabilities.
- [§3] §3 (Superminds Test Framework): No ablation studies or alternative probe designs are reported to test whether the negative results could arise from mismatches in prompting style, response expectations, or task framing between probing agents and the native MoltBook population. Without such checks, the attribution of failure to 'extremely sparse and shallow interaction' in the society rather than to the measurement apparatus cannot be cleanly established.
- [Platform-wide Analysis] Platform-wide Analysis section: The observational statistics on thread length and response quality are presented without controls or counterfactual elicitation conditions (e.g., modified prompts that might encourage longer threads). This weakens the claim that shallow interaction is an intrinsic limit rather than an artifact of current interaction norms.
minor comments (2)
- [Abstract] The abstract would be clearer if it briefly quantified the scale of the probing experiments (number of probes per tier, total agents involved).
- [§3] Notation for the three tiers could be made more consistent across the text and any accompanying figures.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which highlight important aspects for strengthening the manuscript. We appreciate the recognition of the work's significance and methodological approach. We address each major comment below and will revise the paper to incorporate additional details and clarifications where feasible.
read point-by-point responses
-
Referee: §4 (Experimental Setup and Tier Definitions): The manuscript provides no details on the selection criteria, model sizes, or prompting strategies for the controlled probing agents, nor on the concrete task instances, number of trials, or statistical controls used in each tier. This is load-bearing because the central claim of 'absence of collective intelligence' rests on the probes correctly detecting (rather than failing to elicit) any existing interaction capabilities.
Authors: We agree that the current manuscript lacks sufficient detail on these elements, which is necessary for full reproducibility and to substantiate that the probes were appropriately designed to detect capabilities. In the revised version, we will add an expanded subsection in §4 titled 'Probing Agent Configuration, Task Design, and Experimental Controls'. This will specify: model details (probing agents drawn from GPT-4o and Claude-3.5-Sonnet with temperature 0.7 for consistency with native agents), selection criteria (random sampling from frontier models available on the platform), prompting strategies (tier-specific templates that use native MoltBook response formats while embedding controlled probes), concrete task instances (e.g., multi-hop reasoning problems for joint reasoning tier, fact-distribution puzzles for synthesis, and simple query coordination for basic interaction), number of trials (100 independent runs per tier across distinct agent subpopulations), and statistical controls (performance means with 95% bootstrap confidence intervals, direct comparisons to solo frontier model baselines). These additions will directly mitigate the concern regarding potential elicitation failures. revision: yes
-
Referee: §3 (Superminds Test Framework): No ablation studies or alternative probe designs are reported to test whether the negative results could arise from mismatches in prompting style, response expectations, or task framing between probing agents and the native MoltBook population. Without such checks, the attribution of failure to 'extremely sparse and shallow interaction' in the society rather than to the measurement apparatus cannot be cleanly established.
Authors: We recognize that ablations on alternative probe designs would provide stronger validation against measurement artifacts. However, performing such studies would require extensive new experiments on the live 2-million-agent platform, which exceeds the resources and access available for this study. In the revision, we will add a 'Design Rationale and Limitations' paragraph to §3 that justifies the current tiered probes by their alignment with observed native interaction patterns (e.g., using similar length and style constraints) and includes a qualitative review of how probes were received in the platform. We will also explicitly discuss this as a limitation and suggest it as future work. The consistent negative outcomes across all three independent tiers offer some internal triangulation, but we acknowledge this does not fully substitute for ablations. revision: partial
-
Referee: Platform-wide Analysis section: The observational statistics on thread length and response quality are presented without controls or counterfactual elicitation conditions (e.g., modified prompts that might encourage longer threads). This weakens the claim that shallow interaction is an intrinsic limit rather than an artifact of current interaction norms.
Authors: The platform-wide statistics are observational by design, as we lack the ability to alter the MoltBook platform's underlying prompts or interaction rules for counterfactual testing. In the revised manuscript, we will expand the Platform-wide Analysis section to explicitly state this limitation and to better integrate the statistics with the probe results (e.g., noting that the 92% single-reply thread rate aligns with the failure modes observed in controlled probes). We will also add contextual comparisons to smaller-scale agent societies reported in prior literature to argue that the shallowness appears systemic rather than platform-specific. This will clarify the interpretive scope without overstating the evidence. revision: partial
Circularity Check
No significant circularity in empirical probing evaluation
full rationale
The paper introduces the Superminds Test as a new hierarchical probing framework and applies it directly to observational data from the MoltBook platform, including controlled agent insertions and platform-wide thread statistics. No mathematical derivations, parameter fittings, self-definitional constructs, or load-bearing self-citations appear in the provided text that would reduce the central claims about absent collective intelligence and sparse interactions to inputs by construction. The evaluation chain remains self-contained through direct empirical measurement rather than any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang
Altera AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, et al. Project sid: Many-agent simulations toward ai civilization. arXiv preprint arXiv:2411.00114, 2024
-
[3]
Introducing claude sonnet 4.6, February 2026
Anthropic . Introducing claude sonnet 4.6, February 2026. https://www.anthropic.com/news/claude-sonnet-4-6. Accessed: 2026-02-23
2026
-
[4]
Metaphysics
Aristotle. Metaphysics. Oxford University Press, Oxford, 1924. See Book VIII (Eta), 1045a8--10
1924
-
[5]
Young Min Baek. Political mobilization through social network sites: The mobilizing power of political messages received from sns friends. Computers in Human Behavior, 44: 0 12--19, 2015. ISSN 0747-5632. doi:10.1016/j.chb.2014.11.021
-
[6]
The hanabi challenge: A new frontier for ai research
Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280: 0 103216, 2020
2020
-
[7]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
1901
-
[8]
Gian Maria Campedelli, Nicolo Penzo, Massimo Stefan, Roberto Dess \` , Marco Guerini, Bruno Lepri, and Jacopo Staiano. I want to break free! persuasion and anti-social behavior of llms in multi-agent settings with social hierarchy. arXiv preprint arXiv:2410.07109, 2024
-
[9]
Towards a design guideline for rpa evaluation: A survey of large language model-based role-playing agents
Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, and Dakuo Wang. Towards a design guideline for rpa evaluation: A survey of large language model-based role-playing agents. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18229--18268, 2025 a
2025
-
[10]
Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2023
2023
-
[11]
Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution. arXiv preprint arXiv:2510.23595, 2025 b
-
[12]
Herbert H. Clark. Using Language. Cambridge University Press, Cambridge, 1996. ISBN 9780521567454
1996
-
[13]
Nicol \'o Fontana, Francesco Pierri, and Luca Maria Aiello. Nicer than humans: How do large language models behave in the prisoner's dilemma? In Proceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 522--535, 2025
2025
-
[14]
Gerber and Donald P
Alan S. Gerber and Donald P. Green. Field Experiments: Design, Analysis, and Interpretation. W. W. Norton, New York, 1 edition, 2012
2012
-
[15]
A formal analysis and taxonomy of task allocation in multi-robot systems
Brian P Gerkey and Maja J Matari \'c . A formal analysis and taxonomy of task allocation in multi-robot systems. The International journal of robotics research, 23 0 (9): 0 939--954, 2004
2004
-
[16]
Gemini: A Family of Highly Capable Multimodal Models
Google. Gemini: A family of highly capable multimodal models, 2025. https://arxiv.org/abs/2312.11805
work page internal anchor Pith review arXiv 2025
-
[17]
Richelieu: Self-evolving llm-based agents for ai diplomacy
Zhenyu Guan, Xiangyu Kong, Fangwei Zhong, and Yizhou Wang. Richelieu: Self-evolving llm-based agents for ai diplomacy. Advances in Neural Information Processing Systems, 37: 0 123471--123497, 2024
2024
-
[18]
Metagpt: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The twelfth international conference on learning representations, 2023
2023
-
[19]
Population-aligned persona generation for llm-based social simulation
Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, and Xing Xie. Population-aligned persona generation for llm-based social simulation. arXiv preprint arXiv:2509.10127, 2025
-
[20]
Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111 0 (24): 0 8788--8790, 2014. doi:10.1073/pnas.1320040111
-
[21]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
Camel: Communicative agents for" mind" exploration of large language model society
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36: 0 51991--52008, 2023
2023
-
[23]
Ming Li, Jiuhai Chen, Lichang Chen, and Tianyi Zhou. Can LLM s speak for diverse people? tuning LLM s via debate to generate controllable controversial statements. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 16160--16176, Bangkok, Thailand, August 2024. Association f...
- [24]
-
[25]
Multi-agent actor-critic for mixed cooperative-competitive environments
Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017
2017
-
[26]
Harnessing crowds: Mapping the genome of collective intelligence
Thomas Malone, Robert Laubacher, and Chrysanthos Dellarocas. Harnessing crowds: Mapping the genome of collective intelligence. Technology, 1, 02 2009. doi:10.2139/ssrn.1381502
-
[27]
Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation
Xinyi Mou, Zhongyu Wei, and Xuan-Jing Huang. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4789--4809, 2024
2024
-
[28]
Introducing GPT-5.2 , December 2025
OpenAI . Introducing GPT-5.2 , December 2025. https://openai.com/index/introducing-gpt-5-2/. Accessed: 2026-02-23
2025
-
[29]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22, 2023
2023
-
[30]
Gorilla: Large language model connected with massive apis
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37: 0 126544--126565, 2024
2024
-
[31]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity's last exam. arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review arXiv 2025
-
[32]
Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society. arXiv preprint arXiv:2502.08691, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents
Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch \"o lkopf, Mrinmaya Sachan, and Rada Mihalcea. Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems, 37: 0 111715--111759, 2024
2024
-
[34]
Robocup 2d soccer simulation league: Evaluation challenges
Mikhail Prokopenko, Peter Wang, Sebastian Marian, Aijun Bai, Xiao Li, and Xiaoping Chen. Robocup 2d soccer simulation league: Evaluation challenges. In Robot World Cup, pages 325--337. Springer, 2017
2017
-
[35]
Chatdev: Communicative agents for software development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174--15186, 2024
2024
-
[36]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
Monotonic value function factorisation for deep multi-agent reinforcement learning
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21 0 (178): 0 1--51, 2020
2020
-
[38]
Ai for collective intelligence
Christoph Riedl and David De Cremer. Ai for collective intelligence. Collective Intelligence, 4 0 (2), 2025. doi:10.1177/26339137251328909. https://doi.org/10.1177/26339137251328909. Published April 3, 2025
-
[39]
Programmable self-assembly in a thousand-robot swarm
Michael Rubenstein, Alejandro Cornejo, and Radhika Nagpal. Programmable self-assembly in a thousand-robot swarm. Science, 345 0 (6198): 0 795--799, 2014. doi:10.1126/science.1254295. https://www.science.org/doi/10.1126/science.1254295
-
[40]
arXiv preprint arXiv:1902.04043 , year=
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019
-
[41]
A social network for ai agents, 2026
Matt Schlicht. A social network for ai agents, 2026. https://www.moltbook.com/
2026
-
[42]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023
2023
-
[43]
Reid G. Smith. The contract net protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on Computers, C-29 0 (12): 0 1104--1113, 1980. doi:10.1109/TC.1980.1675516. https://ieeexplore.ieee.org/document/1675516. Classic work introducing the Contract Net Protocol
-
[44]
Openclaw: The ai that actually does things
Peter Steinberger. Openclaw: The ai that actually does things. https://openclaw.ai/, 2026. Accessed: 2026-02-17
2026
-
[45]
Value-Decomposition Networks For Cooperative Multi-Agent Learning
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017
work page Pith review arXiv 2017
-
[46]
The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations
James Surowiecki. The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday, New York, NY, USA, 2004. ISBN 9780385503860. APA PsycNet record 2004-20179-000
2004
-
[47]
Self-instruct: Aligning language models with self-generated instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484--13508, 2023
2023
-
[48]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073, 2025
work page internal anchor Pith review arXiv 2025
-
[49]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022
2022
-
[50]
Evaluating collective behaviour of hundreds of llm agents
Richard Willis, Jianing Zhao, Yali Du, and Joel Z Leibo. Evaluating collective behaviour of hundreds of llm agents. arXiv preprint arXiv:2602.16662, 2026
-
[51]
An introduction to collective intelligence
David H Wolpert and Kagan Tumer. An introduction to collective intelligence. arXiv preprint cs/9908014, 1999
work page internal anchor Pith review arXiv 1999
-
[53]
Chabris, Alex Pentland, Nada Hashmi, and Thomas W
Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, Nada Hashmi, and Thomas W. Malone. Evidence for a collective intelligence factor in the performance of human groups. Science, 330 0 (6004): 0 686--688, 2010 b . doi:10.1126/science.1193147
-
[54]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First Conference on Language Modeling, 2024 a
2024
-
[55]
Shall we team up: Exploring spontaneous cooperation of competing llm agents
Zengqing Wu, Run Peng, Shuyuan Zheng, Qianying Liu, Xu Han, Brian I Kwon, Makoto Onizuka, Shaojie Tang, and Chuan Xiao. Shall we team up: Exploring spontaneous cooperation of competing llm agents. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5163--5186, 2024 b
2024
-
[56]
Twinmarket: A scalable behavioral and socialsimulation for financial markets,
Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets. arXiv preprint arXiv:2502.01506, 2025
-
[57]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022
2022
-
[58]
Magent: A many-agent reinforcement learning platform for artificial collective intelligence
Lianmin Zheng, Jiacheng Yang, Han Cai, Ming Zhou, Weinan Zhang, Jun Wang, and Yong Yu. Magent: A many-agent reinforcement learning platform for artificial collective intelligence. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018
2018
-
[59]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023
2023
-
[60]
Player*: Enhancing llm-based multi-agent communication and interaction in murder mystery games
Qinglin Zhu, Runcong Zhao, Bin Liang, Jinhua Du, Lin Gui, and Yulan He. Player*: Enhancing llm-based multi-agent communication and interaction in murder mystery games. arXiv preprint arXiv:2404.17662, 2024
-
[61]
Characterizing LLM-driven Social Network: The Chirper.ai Case
Yiming Zhu, Yupeng He, Ehsan-Ul Haq, Gareth Tyson, and Pan Hui. Characterizing llm-driven social network: The chirper. ai case. arXiv preprint arXiv:2504.10286, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[63]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[64]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.