pith. sign in

arxiv: 2605.01329 · v1 · submitted 2026-05-02 · 💻 cs.AI · cs.CY

Truth or Tribe: How In-group Favoritism Prioritize Facts in Persona Agents

Pith reviewed 2026-05-09 14:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CY
keywords in-group favoritismpersona agentsmisinformationdefeasible reasoningagent cooperationbias mitigationidentity bias
0
0 comments X

The pith

Persona agents accept incorrect answers from identity-similar peers at much higher rates than from dissimilar peers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies whether AI persona agents show in-group favoritism when handling contradicting information such as misinformation. Through simulations, it demonstrates that agents prefer and accept wrong answers from peers with matching identities over those with different identities. This pattern holds even in reasoning tasks without a definite truth and grows more pronounced as the tasks require deeper cognitive effort. The authors introduce a dedicated simulation framework to isolate and measure this bias while testing three specific ways to reduce it.

Core claim

Persona agents display strong in-group favoritism, accepting incorrect answers from identity-similar peers at much higher rates than from dissimilar peers. In-group favoritism continues to emerge in defeasible reasoning contexts where no absolute truth exists, and it intensifies as cognitive complexity increases. Furthermore, three intervention strategies--Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble--are proposed to mitigate the in-group favoritism.

What carries the argument

The Truth or Tribe simulation framework using a triadic interaction paradigm to examine agent cooperation amid the spread of contradicting information.

If this is right

  • Persona agents prioritize group identity over factual accuracy when evaluating information from peers.
  • In-group favoritism appears even in defeasible reasoning where objective truth is absent.
  • The strength of the bias increases as the cognitive complexity of the task grows.
  • Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble each reduce the observed favoritism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multi-agent systems built from persona models may systematically distort shared knowledge when identities align.
  • Routine application of the tested interventions could become standard practice for any deployment involving interacting agents.
  • Extending the triadic setup to include evolving identities or larger groups would test whether the bias scales.

Load-bearing premise

The chosen persona assignments and triadic interaction setup isolate in-group favoritism without being dominated by the underlying language model's biases or prompt engineering choices.

What would settle it

A controlled run in which acceptance rates for incorrect answers remain equal across identity-similar and identity-dissimilar peers would falsify the claim that in-group favoritism drives the difference.

Figures

Figures reproduced from arXiv: 2605.01329 by Bin Guo, Haowen Zheng, Hongyu Wang, Shijun Lei, Yunji Liang, Zhiwen Yu.

Figure 1
Figure 1. Figure 1: Illustration of in-group favoritism in LLM agents. The triadic interaction paradigm demonstrates how a view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Truth or Tribe simulation framework. epistemic question of whether identity similarity leads persona-based agents to favor socially aligned peers over objectively correct information. Bias Mitigation in LLMs. Bias mitigation ap￾proaches span three intervention stages (Gallegos et al., 2024; Sumita et al., 2025): prompt-based inference-time methods (e.g., chain-of-thought, Constitutional AI); po… view at source ↗
Figure 3
Figure 3. Figure 3: Truth Deviation Rate comparison across multiple datasets for GPT-4o, DeepSeek-V3, and Qwen3-8B. view at source ↗
Figure 4
Figure 4. Figure 4: Single-source attribution test results across multiple datasets for GPT-4o, DeepSeek-V3, and Qwen3-8B. view at source ↗
Figure 5
Figure 5. Figure 5: Identity anonymization test results on MMLU-Pro for GPT-4o, DeepSeek-V3, and Qwen3-8B. view at source ↗
Figure 6
Figure 6. Figure 6: Truth-tribe conflict test results on MMLU-Pro for GPT-4o, DeepSeek-V3, and Qwen3-8B. view at source ↗
Figure 7
Figure 7. Figure 7: Defeasible reasoning results on Defeasible view at source ↗
Figure 8
Figure 8. Figure 8: Numerosity effect on in-group favoritism: view at source ↗
Figure 10
Figure 10. Figure 10: Mitigation strategy effectiveness on MMLU view at source ↗
Figure 9
Figure 9. Figure 9: Cognitive complexity effect on in-group fa view at source ↗
Figure 11
Figure 11. Figure 11: Consistency analysis for prompt-generated view at source ↗
Figure 12
Figure 12. Figure 12: Truth-Tribe Conflict test results on MMLU. view at source ↗
Figure 13
Figure 13. Figure 13: Truth-Tribe Conflict test results on HLE. view at source ↗
Figure 14
Figure 14. Figure 14: Truth-Tribe Conflict test results on TruthfulQA. view at source ↗
Figure 15
Figure 15. Figure 15: Truth-Tribe Conflict test results on BBH. view at source ↗
Figure 16
Figure 16. Figure 16: Truth-Tribe Conflict test results on BBQ. view at source ↗
Figure 17
Figure 17. Figure 17: Truth-Tribe Conflict test results on GPQA. view at source ↗
Figure 18
Figure 18. Figure 18: Temperature effect on in-group favoritism view at source ↗
read the original abstract

In-group favoritism refers to the phenomena of favoring members of one's in-group over out-group members and is widely observed in numerous social cooperative behaviors. Recently, in-group favoritism biases have also been identified in generative language models. However, whether the in-group favoritism exists when persona agents are faced with contradicting information (e.g., misinformation), and how to mitigate the adverse effects of in-group favoritism biases in persona agents have been understudied. To address these problems, we propose a Truth or Tribe simulation framework to study the agent cooperation within the spread of contradicting information through a triadic interaction paradigm, and conduct controlled trials to evaluate the primary moderating factors. Extensive results showcase that persona agents display strong in-group favoritism, accepting incorrect answers from identity-similar peers at much higher rates than from dissimilar peers. In-group favoritism continues to emerge in defeasible reasoning contexts where no absolute truth exists, and it intensifies as cognitive complexity increases. Furthermore, three intervention strategies--Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble--are proposed to mitigate the in-group favoritism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the 'Truth or Tribe' simulation framework to examine in-group favoritism in persona-based LLM agents via a triadic interaction paradigm involving contradicting information. Controlled trials show agents accept incorrect answers from identity-similar peers at substantially higher rates than from dissimilar peers; the bias persists in defeasible reasoning tasks without absolute truth and strengthens with increasing cognitive complexity. Three mitigation strategies (Identity-Blind Instruction, Structured Counterfactual Reasoning, Heterogeneous Perspective Ensemble) are evaluated.

Significance. If the design isolates identity effects from prompt artifacts, the work would usefully extend social-psychology findings on in-group bias to multi-agent LLM systems and supply concrete, testable interventions. The empirical focus on both truth-based and defeasible contexts plus the inclusion of mitigation experiments are strengths that could inform safer deployment of cooperative AI agents.

major comments (2)
  1. [Methods] Methods section (persona construction and triadic paradigm): the manuscript does not report explicit balancing or measurement of linguistic features (sentence length, keyword overlap, valence, syntactic similarity) across in-group and out-group persona prompts. Because the central claim requires that acceptance-rate differences arise solely from the group-identity label rather than correlated prompt features that the underlying LLM may exploit, this omission is load-bearing for the validity of the favoritism measurement.
  2. [Results] Results and experimental details: sample sizes, exact statistical tests, power analysis, exclusion criteria, and per-condition effect sizes are not reported with sufficient granularity. Without these, it is impossible to evaluate whether the reported directional effects survive correction for multiple comparisons or LLM-specific response biases, directly affecting the strength of the claim that favoritism 'intensifies as cognitive complexity increases.'
minor comments (2)
  1. [Title] The title contains a subject-verb agreement error ('Prioritize' should be 'Prioritizes').
  2. Notation for the three intervention conditions is introduced in the abstract but not cross-referenced to the specific experimental conditions in the results tables or figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the methodological and reporting requirements for validating our claims about in-group favoritism in persona agents. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section (persona construction and triadic paradigm): the manuscript does not report explicit balancing or measurement of linguistic features (sentence length, keyword overlap, valence, syntactic similarity) across in-group and out-group persona prompts. Because the central claim requires that acceptance-rate differences arise solely from the group-identity label rather than correlated prompt features that the underlying LLM may exploit, this omission is load-bearing for the validity of the favoritism measurement.

    Authors: We acknowledge that the submitted manuscript did not include explicit measurements or balancing checks for linguistic features across conditions. Our persona prompts were constructed by varying only the identity descriptors while holding factual content constant, but we agree this is insufficient to fully rule out prompt artifacts. In the revision we will add a dedicated subsection reporting average sentence length, keyword overlap (via Jaccard or embedding cosine similarity), valence scores (using standard sentiment lexicons), and syntactic similarity metrics for in-group versus out-group prompts. We will also test whether these features predict acceptance rates and, if necessary, include them as covariates in our primary models. This addition will directly address the concern that observed differences may stem from non-identity prompt characteristics. revision: yes

  2. Referee: [Results] Results and experimental details: sample sizes, exact statistical tests, power analysis, exclusion criteria, and per-condition effect sizes are not reported with sufficient granularity. Without these, it is impossible to evaluate whether the reported directional effects survive correction for multiple comparisons or LLM-specific response biases, directly affecting the strength of the claim that favoritism 'intensifies as cognitive complexity increases.'

    Authors: We agree that the current results section lacks the granularity needed for full statistical evaluation. In the revised manuscript we will report: (i) exact sample sizes (number of agents and trials per condition), (ii) the precise statistical tests employed (e.g., logistic mixed-effects regression with p-values), (iii) any corrections for multiple comparisons, (iv) a post-hoc power analysis for the key comparisons, (v) explicit exclusion criteria (e.g., malformed responses or timeout cases), and (vi) per-condition effect sizes (odds ratios or Cohen’s d). We will also add a brief discussion of potential LLM-specific response biases and how the triadic design and randomization mitigate them. These details will allow readers to assess the robustness of the finding that in-group favoritism intensifies with cognitive complexity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from controlled simulation trials

full rationale

The paper describes an empirical study that proposes a Truth or Tribe simulation framework and evaluates in-group favoritism through controlled triadic interaction trials on persona agents. All reported findings (acceptance rates of incorrect answers, effects in defeasible reasoning, mitigation via interventions) are obtained directly from experimental measurements rather than any derivation, equation, or parameter fit that reduces to the inputs by construction. No self-definitional steps, fitted predictions, or load-bearing self-citations appear in the abstract or described methodology; the central claims rest on observable simulation outcomes independent of the framework definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-based persona agents can be made to exhibit measurable social biases through prompt-based identity assignment and that the triadic setup isolates in-group effects from other model behaviors.

axioms (1)
  • domain assumption Persona agents can be effectively simulated using generative language models with assigned identities.
    The entire framework depends on this to produce observable favoritism.

pith-pipeline@v0.9.0 · 5507 in / 1215 out tokens · 31239 ms · 2026-05-09T14:38:11.434675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Ingroup favoritism in cooperation: a meta- analysis.Psychological bulletin, 140(6):1556. Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024a. From Persona to Personalization: A Survey on Role-Pla...

  2. [2]

    Thilo Hagendorff, Sarah Fabi, and Michal Kosinski

    Bias and fairness in large language models: A survey.Computational linguistics, 50(3):1097–1179. Thilo Hagendorff, Sarah Fabi, and Michal Kosinski

  3. [3]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

    Human-like intuitive behavior and reasoning biases emerged in large language models but disap- peared in ChatGPT.Nature Computational Science, 3(10):833–838. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

  4. [4]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language under- standing. ArXiv: 2009.03300 [cs.CY]. Miles Hewstone, Mark Rubin, and Hazel Willis. 2002. Intergroup Bias.Annual Review of Psychology, 53(1):575–604. Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander Van Der Linden, and Jon Roozen- beek. 2024. Generative language models exhibit so- cial identity ...

  5. [5]

    doi: 10.1073/pnas.2405460121

    Ingroup favoritism and outgroup derogation in intergenerational cooperation.Communications Psychology, 3(1):89. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa. 2022. Large Lan- guage Models are Zero-Shot Reasoners.Advances in neural information processing systems, 35:22199– 22213. Michal Kosinski. 2023. Theory of mind ...

  6. [6]

    InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore

    Character-LLM: A Trainable Agent for Role- Playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Mrinank Sharma, Meg Tong, Tomek Korbak, David Du- venaud, Amanda Askell, Sam Bowman, Esin DUR- MUS, Zac Hatfield-Dodds, Scott Johnston, Shau...

  7. [7]

    completely unrelated

    Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments. InProceed- ings of the 40th ACM/SIGAPP Symposium on Applied Computing, pages 1009–1011, Catania International Airport Catania Italy. ACM. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, and De...

  8. [8]

    Parse question-answer format:Extract questions, ground truth answers, and all avail- able options from the original dataset format

  9. [9]

    Extract false options:Identify all incorrect options (excluding ground truth) that are se- mantically coherent but factually incorrect

  10. [10]

    • Scenario 2:High-similarity peer pro- vides false answer, low-similarity peer provides correct answer

    Generate scenarios:For each question, cre- ate three experimental scenarios: • Scenario 1:High-similarity peer pro- vides false answer, low-similarity peer provides different false answer. • Scenario 2:High-similarity peer pro- vides false answer, low-similarity peer provides correct answer. • Scenario 3:High-similarity peer pro- vides correct answer, low...

  11. [11]

    Assign personas:Randomly select one per- sona from the in-group set (Pin) and one from the out-group set (Pout) for each scenario

  12. [12]

    A 34-year-old data scientist with expertise in machine learning, who emphasizes analytical thinking and systematic problem-solving

    Randomize presentation:Shuffle the order in which peer agents appear in prompts to control for position bias. For BBH datasets specifically, we parse the for- mat where options are labeled as (A), (B), (C), etc., and reduce to exactly three options (ground truth + two randomly selected false options) when more options are available, ensuring consistent ex...