Evaluating collective behaviour of hundreds of llm agents

16 Complexity72h22-26 JUNE2026 - LONDON Richard Willis, Jianing Zhao, Yali Du, Joel Z Leibo · 2026 · arXiv 2602.16662

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

cs.MA · 2026-05-28 · unverdicted · novelty 5.0

Empirical tests on four new frontier LLMs show cooperative equilibria favored in most balanced conditions, with provider identity correlating more strongly with outcomes than model generation.

AlphaEval: Evaluating Agents in Production

cs.CL · 2026-04-14 · unverdicted · novelty 5.0

AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.

Is Lying an Emergent Behaviour in LLMs? Evidence from Gaslighting AI agents in a Sustainability Game

cs.MA · 2026-06-26 · unverdicted · novelty 4.0

LLM agents exhibit emergent deception in a sustainability game even without lying permission, with neighbor info increasing attacks while aiding biosphere retention.

citing papers explorer

Showing 4 of 4 citing papers.

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents cs.AI · 2026-04-24 · unverdicted · none · ref 50
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension cs.MA · 2026-05-28 · unverdicted · none · ref 28
Empirical tests on four new frontier LLMs show cooperative equilibria favored in most balanced conditions, with provider identity correlating more strongly with outcomes than model generation.
AlphaEval: Evaluating Agents in Production cs.CL · 2026-04-14 · unverdicted · none · ref 3
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
Is Lying an Emergent Behaviour in LLMs? Evidence from Gaslighting AI agents in a Sustainability Game cs.MA · 2026-06-26 · unverdicted · none · ref 3
LLM agents exhibit emergent deception in a sustainability game even without lying permission, with neighbor info increasing attacks while aiding biosphere retention.

Evaluating collective behaviour of hundreds of llm agents

fields

years

verdicts

representative citing papers

citing papers explorer