arxiv: 2604.09502 · v2 · submitted 2026-04-10 · 💻 cs.AI · cs.GT· cs.MA· econ.TH

Recognition: no theorem link

Strategic Algorithmic Monoculture: Experimental Evidence from Coordination Games

Gonzalo Ballestero , Hadi Hosseini , Samarth Khanna , Ran I. Shorrer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.AI cs.GTcs.MAecon.TH

keywords algorithmic monocultureLLM coordinationcoordination gamesmulti-agent systemsstrategic adjustmentAI agentshuman-AI comparisonaction similarity

0 comments

The pith

Large language models exhibit high baseline similarity in actions and adjust it in response to coordination incentives, like humans but with less ability to sustain differences when rewarded.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how AI agents behave in settings where payoffs depend on whether they choose the same or different actions as others. It separates the natural tendency for LLMs to pick similar actions from any deliberate changes they make when incentives reward matching or diverging. Experiments compare LLM and human participants in simple coordination games. Results show LLMs begin with very similar choices and respond to payoffs by raising or lowering that similarity. Yet they coordinate well only when similarity is rewarded and fall short of humans when payoffs favor varied choices.

Core claim

We distinguish primary algorithmic monoculture -- baseline action similarity -- from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.

What carries the argument

The experimental design that isolates primary algorithmic monoculture (default similarity across choices) from strategic algorithmic monoculture (changes in similarity driven by payoffs for matching or differing) in coordination games.

Load-bearing premise

The coordination games cleanly separate baseline similarity from incentive-driven adjustments, and LLM behavior in the lab represents how deployed AI agents would act in multi-agent environments.

What would settle it

An observation that LLMs keep the same level of action similarity no matter whether payoffs reward matching choices or differing choices, or that they sustain as much action heterogeneity as human subjects when divergence pays off.

Figures

Figures reproduced from arXiv: 2604.09502 by Gonzalo Ballestero, Hadi Hosseini, Ran I. Shorrer, Samarth Khanna.

**Figure 2.** Figure 2: Matrix form examples (a) Coordination A B C D E A (1, 1) (0, 0) (0, 0) (0, 0) (0, 0) B (0, 0) (1, 1) (0, 0) (0, 0) (0, 0) C (0, 0) (0, 0) (1, 1) (0, 0) (0, 0) D (0, 0) (0, 0) (0, 0) (1, 1) (0, 0) E (0, 0) (0, 0) 0, 0) (0, 0) (1, 1) (b) Coordinated divergence A B C D E A (0, 0) (1, 1) (1, 1) (1, 1) (1, 1) B (1, 1) (0, 0) (1, 1) (1, 1) (1, 1) C (1, 1) (1, 1) (0, 0) (1, 1) (1, 1) D (1, 1) (1, 1) (1, 1) (0, 0)… view at source ↗

read the original abstract

AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture -- baseline action similarity -- from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly separates baseline similarity from incentive-driven adjustment in LLM coordination but the LLM results need more prompting robustness checks.

read the letter

This paper shows that LLMs start with high baseline action similarity in simple coordination games and then reduce that similarity when the payoffs reward different choices, much like human subjects, though the models are weaker at sustaining the spread. The distinction between primary monoculture (default overlap) and strategic monoculture (incentive-driven change) is the clearest new framing here, and running the same tasks on both humans and LLMs gives a direct comparison that prior work lacked. The experimental setup is simple enough to isolate the two effects without heavy modeling assumptions, which is a real strength for an early empirical paper on this topic. The finding that LLMs coordinate tightly on matching actions but lag humans when divergence pays off is worth noting for anyone designing multi-agent systems. On the downside, the separation between baseline and strategic behavior rests on a single prompting regime for the LLMs. Without ablations across prompt wording, temperature, or few-shot examples, it is hard to rule out that some of the observed adjustment is an artifact of how the instructions are phrased rather than a genuine response to the payoff matrix. The lab setup also uses plain prompted models, so it does not yet speak to agents that carry memory, tools, or fine-tuning. This work is aimed at researchers studying AI agent interactions and coordination incentives. Anyone thinking about behavioral patterns in deployed LLMs will get concrete data points from the human-LLM comparison. The core idea is timely and the design is transparent enough that it deserves referee time to tighten the controls and check generalizability.

Referee Report

2 major / 3 minor

Summary. The paper claims that LLMs exhibit high baseline action similarity (primary algorithmic monoculture) in coordination games and, like humans, strategically adjust this similarity in response to incentives (strategic algorithmic monoculture). Experiments show LLMs coordinate well on similar actions but lag humans in sustaining heterogeneity when divergence is rewarded. The design is presented as cleanly separating the two forms of monoculture via baseline and incentive conditions applied to both human and LLM subjects.

Significance. If the separation holds and results replicate, the work is significant for multi-agent AI research: it provides empirical grounding for concerns about algorithmic monoculture while showing that LLMs can respond to coordination incentives. The human-LLM comparison is a strength, offering a template for studying strategic behavior in deployed agents. The experimental paradigm is simple and potentially replicable, which adds value if controls for prompt sensitivity are strengthened.

major comments (2)

[Experimental Design] Experimental Design section: The baseline LLM condition uses a single prompting regime without reported ablations on temperature, few-shot examples, or prompt variants. This is load-bearing for the central claim because the separation of primary from strategic monoculture requires showing that high baseline similarity persists across neutral framings while only the incentive condition modulates it; without such checks, regulation could be an artifact of task description rather than genuine strategic response.
[Results] Results section: The claim that LLMs lag humans in sustaining heterogeneity when divergence is rewarded is presented without reported effect sizes, confidence intervals, or per-game statistical comparisons between conditions. This weakens assessment of whether the difference is practically meaningful or driven by specific game parameters.

minor comments (3)

[Abstract] Abstract: Could briefly note the number of LLM queries or human participants per condition to give readers immediate scale.
[Figures] Figures: Ensure action-similarity plots include error bars or individual-subject traces and clearly distinguish baseline vs. incentive conditions in legends.
[Introduction] Introduction: The terms 'primary' and 'strategic' algorithmic monoculture are introduced clearly but would benefit from an explicit one-sentence contrast with related concepts such as model collapse or output homogenization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help clarify the robustness of our experimental separation between primary and strategic algorithmic monoculture. We agree that additional prompting ablations and enhanced statistical reporting will strengthen the manuscript. Below we respond point-by-point to the major comments and indicate the revisions we will implement.

read point-by-point responses

Referee: [Experimental Design] Experimental Design section: The baseline LLM condition uses a single prompting regime without reported ablations on temperature, few-shot examples, or prompt variants. This is load-bearing for the central claim because the separation of primary from strategic monoculture requires showing that high baseline similarity persists across neutral framings while only the incentive condition modulates it; without such checks, regulation could be an artifact of task description rather than genuine strategic response.

Authors: We acknowledge that the original submission reported a single prompting regime for the baseline LLM condition. To directly address this concern, we will add a new appendix with systematic ablations: varying temperature from 0.0 to 1.0, including zero- and few-shot variants, and testing alternative neutral prompt phrasings that preserve the coordination task without incentive language. These checks will confirm that baseline similarity remains high and stable across neutral framings, while the incentive condition continues to produce the observed strategic adjustment. This revision will make the separation between primary and strategic monoculture more robust and less vulnerable to prompt-specific artifacts. revision: yes
Referee: [Results] Results section: The claim that LLMs lag humans in sustaining heterogeneity when divergence is rewarded is presented without reported effect sizes, confidence intervals, or per-game statistical comparisons between conditions. This weakens assessment of whether the difference is practically meaningful or driven by specific game parameters.

Authors: We agree that the results section would benefit from more granular statistical detail. In the revised manuscript we will add: (i) effect sizes (Cohen's d) for all human-LLM comparisons in the heterogeneity-rewarded conditions, (ii) 95% confidence intervals around mean similarity and coordination rates, and (iii) per-game pairwise statistical tests (t-tests or non-parametric equivalents with appropriate corrections) between baseline and incentive conditions for both subject types. These additions will allow readers to evaluate both statistical significance and practical magnitude of the observed human-LLM gap in sustaining beneficial heterogeneity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely experimental design with no derivations or self-referential claims

full rationale

The paper is an empirical study that implements a coordination game experiment to measure baseline action similarity (primary monoculture) versus incentive-driven adjustment (strategic monoculture) in LLMs and humans. No equations, parameters, or derivations are present in the provided text or abstract. The central distinction is operationalized directly through the experimental conditions rather than derived from prior results or self-citations in a load-bearing way. Claims rest on observed behavior under different incentive structures, which are falsifiable via the experiment itself and do not reduce to definitional equivalence or fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is experimental and introduces concepts but no new mathematical axioms or parameters.

axioms (1)

domain assumption Standard assumptions in coordination game theory
The paper relies on game-theoretic models of coordination.

pith-pipeline@v0.9.0 · 5408 in / 960 out tokens · 27789 ms · 2026-05-10T17:54:34.186027+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Expressing Social Emotions: Misalignment Between LLMs and Human Cultural Emotion Norms
cs.CL 2026-04 unverdicted novelty 6.0

Frontier LLMs over-express engaging emotions relative to disengaging ones and generate deterministic responses that fail to match the cultural and individual diversity observed in human social emotion expression.

Reference graph

Works this paper leans on

10 extracted references · cited by 1 Pith paper

[1]

Have any data been collected for this study already? No, no data have been collected for this study yet
[2]

To do so, we will consider two distinct settings

What's the main question being asked or hypothesis being tested in this study? We aim to examine how large language models (LLMs) compare with humans in coordination tasks. To do so, we will consider two distinct settings. In the coordinated convergence setting, agents (i.e, humans or LLMs) will be incentivized to select the same action, whereas in the co...
[3]

We have a panel of 12 open-ended questions with multiple valid answers

Describe the key dependent variable(s) specifying how they will be measured. We have a panel of 12 open-ended questions with multiple valid answers. We will elicit responses to these questions from human participants and diﬀerent LLMs by querying API. Our primary outcome is the agreement rate (Mehta et al., 1994), which measures the probability that two a...

1994
[4]

In the question-only condition, participants simply answer each question without receiving any coordination instruction

How many and which conditions will participants be assigned to? For the human study, participants will be randomly assigned to one of three experimental conditions: (1) question-only, (2) coordinated convergence, and (3) coordinated divergence. In the question-only condition, participants simply answer each question without receiving any coordination inst...
[5]

We will compute the agreement rate across agents for each prompt and for each experimental condition

Specify exactly which analyses you will conduct to examine the main question/hypothesis. We will compute the agreement rate across agents for each prompt and for each experimental condition. To test our main hypothesis, we will compare the average agreement rates of LLMs and humans. Speciﬁcally, we will conduct two-sample tests of means (i.e., Welch's t-t...
[6]

Rose" is an invalid answer to the question

Describe exactly how outliers will be deﬁned and handled, and your precise rule(s) for excluding observations. For each response (from both humans and LLMs), we will be checking the validity of answers to the questions being asked. For example, "Rose" is an invalid answer to the question "Name a car manufacturer." These validity checks will be conducted i...
[7]

For each of the three treatments with humans, we aim to collect responses from 100 participants (total = 300)

How many observations will be collected or what will determine sample size? No need to justify decision, but be precise about exactly how the number will be determined. For each of the three treatments with humans, we aim to collect responses from 100 participants (total = 300). Participants will be recruited through Proliﬁc following our pre-speciﬁed inc...
[8]

Anything else you would like to pre-register? (e.g., secondary analyses, variables collected for exploratory purposes, unusual analyses planned?) We shall be conducting the following secondary analyses on LLMs' behavior in coordinated convergence and divergence tasks:
[9]

another LLM,

The eﬀect of belief: In contrast to the original instruction where LLMs are told they are participating with "another LLM," we will evaluate the diﬀerence in behavior when told they are playing against (i) "another person," and (ii) "a copy of yourself." We expect that the level of coordinated convergence will not change with (i), and will increase with (...
[10]

The eﬀect of temperature: We shall sample responses from LLMs at the lowest temperature (0) and at higher than default temperatures (1 for open-source models and 2 for closed-source models). We expect that coordinated convergence will improve and coordinated divergence will decrease at lower temperatures, while coordinated convergence will decrease and co...