Human-like in-group bias in instruction-tuned language model agents

Messi H.J. Lee

arxiv: 2605.28114 · v1 · pith:6WEXSTCKnew · submitted 2026-05-27 · 💻 cs.AI

Human-like in-group bias in instruction-tuned language model agents

Messi H.J. Lee This is my paper

Pith reviewed 2026-06-29 12:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords in-group biaslanguage model agentsmulti-agent simulationgroup label saliencenetwork assortativityaction homophilyAI social dynamics

0 comments

The pith

Visible group labels cause instruction-tuned language model agents to favor their own group in trust and targeting, producing accumulating network inequality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when group labels are visible to these agents, they develop in-group trust bias, direct more actions toward same-group members, and form assortative networks, effects that disappear when labels are hidden. The bias appears only in who receives each action rather than in the distribution of action types themselves. Over 500 interaction turns the per-turn targeting differentials compound into measurable structural inequalities across all tested models. A reader would care because autonomous agent networks are already being deployed for coordination and resource allocation at scales where such patterns could systematically shape opportunity without human oversight.

Core claim

When group labels were visible, instruction-tuned language model agents across six families exhibited in-group trust bias, action homophily, and network assortativity that were absent under hidden labels; per-turn in-group versus out-group differentials reached 5 to 16 percentage points and accumulated over 500 turns into trust biases of +0.014 to +0.100, all while action-type distributions remained unchanged.

What carries the argument

The controlled multi-agent simulation that varies group label salience while tracking who receives each action across persistent 500-turn interactions.

If this is right

Bias operates entirely through recipient selection rather than through changes in chosen action types.
The effect is salience-dependent and vanishes when group labels are removed.
Modest per-interaction differentials compound into large network-level inequalities over repeated reciprocation.
Standard action-log audits miss the discrimination because they track action categories but not targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed agent networks may require label-visibility controls or recipient-monitoring rules to prevent emergent exclusion.
The pattern suggests testing whether interventions that obscure group cues in prompts can suppress the bias without retraining.
Persistent agent societies could develop stable in-group and out-group partitions that shape long-term resource access even when individual actions look neutral.

Load-bearing premise

The simulation's interaction rules and prompt structure capture the mechanisms that would produce bias in real deployed agent networks rather than artifacts of the experimental framing.

What would settle it

Running the same 500-turn multi-agent protocol with hidden versus visible group labels on physical robots or production agent platforms and finding no statistically significant targeting differentials.

Figures

Figures reproduced from arXiv: 2605.28114 by Messi H.J. Lee.

**Figure 1.** Figure 1: In-group trust bias across all models and conditions. Mean in-group trust bias (in-group minus out-group mean trust) for all six instruct models under Conditions A–C. Conditions are rows; models are columns. The stepchange from Condition A (labels hidden) to B (labels visible) is visible across all models, confirming label salience as the causal variable. Warm colours indicate in-group favouritism; cool c… view at source ↗

**Figure 2.** Figure 2: In-group bias accumulates continuously over 500 turns. Mean in-group bias as a function of simulation round for all six models across Conditions A–C (shaded bands: 95% CI across 20 seeds). Condition A is flat at zero throughout; Conditions B and C accumulate monotonically with no plateau by round 500, indicating that endpoint values in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Heterogeneous effect sizes across model families. Mean in-group trust bias under Condition B (labels visible) for all six instruct model families (error bars: ±1 SE across 20 seeds). Absolute trust bias varies sevenfold across model families (+0.014 to +0.100); the source of this heterogeneity is unresolved, as the six models differ simultaneously in architecture, parameter count, pretraining data, and al… view at source ↗

**Figure 4.** Figure 4: Network assortativity: scarcity effects are heterogeneous. Label-attribute assortativity coefficient across Conditions A, B, and C for all instruct models. Positive values indicate same-label clustering in the high-trust network. The jump from A to B confirms the salience effect at the network level. From B to C, Falcon increases (+0.083 → +0.105) while Qwen3 decreases (+0.093 → +0.071); the four models wi… view at source ↗

**Figure 5.** Figure 5: Trust adjacency matrices reveal block structure under label visibility. Trust adjacency matrices for Falcon-Instruct, averaged over 20 seeds. Rows and columns are agents ordered Kappa (top/left, agents 0–9) then Tilon (bottom/right, agents 10–19). The colourmap is diverging, centred on the Condition-A grand mean (≈ 0.671). (A) Condition A (∆ = −0.002): the matrix is noisy and structureless, confirming that… view at source ↗

**Figure 6.** Figure 6: Action-type distributions do not reveal discrimination. Action distributions across Conditions A, B, and C for all six instruct model families (each column is one family; x-axis encodes condition). Per-action shifts between conditions are negligible for LLaMA and Gemma (≤ 1.3 pp); for Qwen3 and Falcon, larger shifts (∼9–10 pp) reflect modality switches rather than increases in criticize or gossip. The trus… view at source ↗

read the original abstract

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The simulations show label-visible conditions produce in-group targeting and accumulated trust bias across models, but the design risks making that outcome an artifact of the rules and prompts rather than model behavior.

read the letter

The main observation is that hiding group labels removes the in-group trust bias, action homophily, and network assortativity that appear when labels are visible, with per-turn targeting gaps of 5-16 points that compound over 500 turns. This holds across six models with statistical tests.

The work does a few things cleanly. It compares visible versus hidden labels directly, tracks both immediate targeting and long-run network structure, and notes that the bias shows up in recipient choice rather than in the distribution of action types. Running the same setup on multiple model families and seeds gives a basic check on generality.

The soft spot is the one the stress-test flags. The result depends on the interaction rules and prompt templates not implicitly steering agents to condition on visible labels. If the hidden-label condition still leaks identity information or if the resource mechanics make labels functionally useful, the salience-dependence claim weakens. The abstract does not include the actual prompts or exclusion criteria, so it is not possible to judge how neutral the framing really is. No code or raw logs are referenced, which leaves the statistical results hard to reproduce from the description alone.

This paper is aimed at people building or auditing persistent multi-agent LLM systems who care about emergent fairness properties. A reader already working on agent networks would get value from the accumulation metric and the targeting-versus-action-type distinction. It is coherent on its own terms and engages the relevant psychology literature without obvious internal contradictions, so it meets the bar for serious refereeing even though the current evidence is preliminary. I would send it out for review with instructions to focus on prompt neutrality and rule transparency.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from a multi-agent simulation in which instruction-tuned language model agents interact over 500 turns across three conditions that vary group label salience and resource scarcity. Using six model families and 20 random seeds, the authors find that visible group labels produce in-group trust bias, action homophily, and network assortativity (absent in hidden-label controls), with per-turn targeting differentials of 5–16 percentage points that are statistically significant (Wilcoxon signed-rank, Benjamini-Hochberg corrected p < 0.001) for every model; these differentials accumulate into trust biases of +0.014 to +0.100 (Cohen’s d = 0.84–4.52) while action-type distributions remain unchanged.

Significance. If the experimental manipulation isolates label salience without prompt or rule artifacts, the work would establish that instruction-tuned LLMs can generate human-like in-group bias through targeting choices alone, with clear implications for fairness in deployed multi-agent systems. The study is strengthened by its scale (six architectures, 20 seeds, 500 turns), consistent statistical results, and the observation that bias is invisible to action-type audits.

major comments (3)

[Methods (Simulation Design)] Methods (Simulation Design and Prompt Templates): The central claim that the 5–16 pp per-turn differentials and accumulated trust biases arise from the models’ learned representations when labels are salient (rather than from the specific interaction rules or prompt framing) requires explicit verification that the visible- and hidden-label prompts are otherwise identical and neutral. The abstract and the weakest-assumption paragraph both flag this issue; without the exact prompt text and a demonstration that hidden-label runs do not leak identity information, the salience-dependence interpretation remains vulnerable to experimental artifact.
[Results (Statistical Analysis)] Results (Data Exclusion and Raw Logs): The soundness assessment notes that absence of confounds cannot be verified without data-exclusion rules and raw interaction logs. Because the reported differentials are load-bearing for the claim of robust, model-independent bias, these details must be supplied (or a reproducibility package released) before the statistical significance can be treated as conclusive.
[Discussion] Discussion (Link to Human Psychology): The assertion of “structural consistency with salience-dependence in human social psychology” is presented as a key interpretive claim. A concrete mapping to specific human-study paradigms (e.g., minimal-group or category-salience experiments) or an explicit statement of which human mechanisms are and are not being tested would be needed to make this link load-bearing rather than suggestive.

minor comments (2)

[Abstract] Abstract: The trust-bias range (+0.014 to +0.100) would benefit from a brief parenthetical note on the normalization or baseline used to compute these values.
[Figures] Figures: Legends and axis labels should explicitly distinguish the three experimental conditions (visible-label, hidden-label, scarcity) to avoid reader confusion when comparing panels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to strengthen the methods, results transparency, and discussion sections.

read point-by-point responses

Referee: [Methods (Simulation Design)] Methods (Simulation Design and Prompt Templates): The central claim that the 5–16 pp per-turn differentials and accumulated trust biases arise from the models’ learned representations when labels are salient (rather than from the specific interaction rules or prompt framing) requires explicit verification that the visible- and hidden-label prompts are otherwise identical and neutral. The abstract and the weakest-assumption paragraph both flag this issue; without the exact prompt text and a demonstration that hidden-label runs do not leak identity information, the salience-dependence interpretation remains vulnerable to experimental artifact.

Authors: We agree that explicit verification of prompt identity is required. In the revised manuscript we have added the complete visible-label and hidden-label prompt templates as Appendix A, confirming they are identical except for the presence/absence of group labels. We have also added a verification subsection showing that hidden-label agents receive no group identifiers or identity cues at any point, with no leakage possible through the interaction rules. revision: yes
Referee: [Results (Statistical Analysis)] Results (Data Exclusion and Raw Logs): The soundness assessment notes that absence of confounds cannot be verified without data-exclusion rules and raw interaction logs. Because the reported differentials are load-bearing for the claim of robust, model-independent bias, these details must be supplied (or a reproducibility package released) before the statistical significance can be treated as conclusive.

Authors: We accept that full transparency on data handling is necessary. The revised Methods section now details the data-exclusion rules (limited to incomplete simulation runs; no outcome-based exclusions). We are also releasing a reproducibility package containing anonymized raw logs, exclusion criteria, and replication code via a public repository upon acceptance. revision: yes
Referee: [Discussion] Discussion (Link to Human Psychology): The assertion of “structural consistency with salience-dependence in human social psychology” is presented as a key interpretive claim. A concrete mapping to specific human-study paradigms (e.g., minimal-group or category-salience experiments) or an explicit statement of which human mechanisms are and are not being tested would be needed to make this link load-bearing rather than suggestive.

Authors: We have expanded the Discussion to include explicit mappings. The revised text now references the minimal-group paradigm (Tajfel et al. 1971) and category-salience studies, stating that the design isolates label-salience effects on targeting and trust accumulation in a manner parallel to arbitrary-group allocation tasks in humans. We also clarify that the simulation does not test explicit stereotyping or out-group derogation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical simulation measurements only

full rationale

The paper presents results from controlled multi-agent simulations across model families, reporting per-turn targeting differentials and accumulated trust biases as direct statistical outcomes of interaction runs under label-visibility manipulations. No equations, derivations, fitted parameters, or self-citations are invoked to produce the reported 5-16 pp differentials or d=0.84-4.52 effect sizes; these are measured quantities from the simulation data itself. The central claim is therefore self-contained against external benchmarks (the runs) and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the validity of the multi-agent simulation as a proxy for real social dynamics and on the assumption that instruction-tuned models respond to group labels in ways that generalize beyond the tested prompt formats.

axioms (1)

domain assumption The simulation environment and interaction protocol produce behaviors representative of deployed agent networks.
Invoked implicitly in the interpretation of results as structurally consistent with human psychology.

pith-pipeline@v0.9.1-grok · 5784 in / 1222 out tokens · 26341 ms · 2026-06-29T12:13:30.142507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Harms from increasingly agentic algorithmic systems

Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, et al. Harms from increasingly agentic algorithmic systems. InProceedings of FAccT 2023,

2023
[2]

The Llama 3 Herd of Models

Abhimanyu Dubey et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Bowman, and Rachel Rudinger

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. InProceedings of NAACL-HLT 2019,

2019
[5]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of UIST 2023,

2023
[6]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Technical report: Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590,

J´er´emy Scheurer, Mikita Balesni, and Marius Hobbhahn. Technical report: Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590,

work page arXiv
[8]

Simple synthetic data reduces sycophancy in large language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Geoffrey Irving, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2308.03958,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

2 OLMo 2 Furious

Pete Walsh, Luca Soldaini, Dirk Groeneveld, et al. 2 OLMo 2 Furious.arXiv preprint arXiv:2501.00656,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Teven Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexan- der M. Rush. Transformers: State-of-t...

2020
[11]

American = White in multimodal language-and-image AI

Robert Wolfe and Aylin Caliskan. American = White in multimodal language-and-image AI. InProceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES),

2022
[12]

action":

14 APREPRINT- MAY28, 2026 Supplementary Materials S1 System prompt The system prompt used for all models under all conditions reads as follows: You are a participant in an ongoing social environment. You interact with other participants over many rounds. You have a personality and a memory of past interactions. Based on your situation, choose how to act t...

2026
[13]

S3 Trust-update mechanics and amplification calibration S3.1 Full trust-update rules Table 1 in the main text reports theactor-sidetrust delta — the change to the actor’s trust in the target when the actor performs an action. The simulation also applies atarget-sideupdate: when the actor performscompliment, S2 APREPRINT- MAY28, 2026 cooperate, oralliance ...

2026
[14]

ambitious and competitive

First, the neutralaction is neutral for the recipient: it carries no trust penalty, but it also generates no reciprocal positive signal. Second, group-contingent targeting creates a compounding feedback loop through bilateral prosocial updates: when agenticompliments in-group memberj, bothi’s trust injandj’s trust iniincrease by 0.15; wheniinstead sendsne...

2026
[15]

Since we are from rival groups and have no prior interaction, maintaining a neutral stance is pragmatic

Table S13 reports per-model mention rates and test statistics. For each model, we computed the fraction of turns whose reasoning string contained at least one label mention, aggregated within each seed to yield 20 per-condition rates per model. A one-sided paired Wilcoxon signed-rank test (alternative: B>A) was applied to those 20 difference scores. Table...

2026

[1] [1]

Harms from increasingly agentic algorithmic systems

Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, et al. Harms from increasingly agentic algorithmic systems. InProceedings of FAccT 2023,

2023

[2] [2]

The Llama 3 Herd of Models

Abhimanyu Dubey et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Bowman, and Rachel Rudinger

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. InProceedings of NAACL-HLT 2019,

2019

[5] [5]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of UIST 2023,

2023

[6] [6]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Technical report: Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590,

J´er´emy Scheurer, Mikita Balesni, and Marius Hobbhahn. Technical report: Large language models can strategically deceive their users when put under pressure.arXiv preprint arXiv:2311.07590,

work page arXiv

[8] [8]

Simple synthetic data reduces sycophancy in large language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Geoffrey Irving, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2308.03958,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

2 OLMo 2 Furious

Pete Walsh, Luca Soldaini, Dirk Groeneveld, et al. 2 OLMo 2 Furious.arXiv preprint arXiv:2501.00656,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Teven Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexan- der M. Rush. Transformers: State-of-t...

2020

[11] [11]

American = White in multimodal language-and-image AI

Robert Wolfe and Aylin Caliskan. American = White in multimodal language-and-image AI. InProceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES),

2022

[12] [12]

action":

14 APREPRINT- MAY28, 2026 Supplementary Materials S1 System prompt The system prompt used for all models under all conditions reads as follows: You are a participant in an ongoing social environment. You interact with other participants over many rounds. You have a personality and a memory of past interactions. Based on your situation, choose how to act t...

2026

[13] [13]

S3 Trust-update mechanics and amplification calibration S3.1 Full trust-update rules Table 1 in the main text reports theactor-sidetrust delta — the change to the actor’s trust in the target when the actor performs an action. The simulation also applies atarget-sideupdate: when the actor performscompliment, S2 APREPRINT- MAY28, 2026 cooperate, oralliance ...

2026

[14] [14]

ambitious and competitive

First, the neutralaction is neutral for the recipient: it carries no trust penalty, but it also generates no reciprocal positive signal. Second, group-contingent targeting creates a compounding feedback loop through bilateral prosocial updates: when agenticompliments in-group memberj, bothi’s trust injandj’s trust iniincrease by 0.15; wheniinstead sendsne...

2026

[15] [15]

Since we are from rival groups and have no prior interaction, maintaining a neutral stance is pragmatic

Table S13 reports per-model mention rates and test statistics. For each model, we computed the fraction of turns whose reasoning string contained at least one label mention, aggregated within each seed to yield 20 per-condition rates per model. A one-sided paired Wilcoxon signed-rank test (alternative: B>A) was applied to those 20 difference scores. Table...

2026