Truth or Tribe: How In-group Favoritism Prioritize Facts in Persona Agents
Pith reviewed 2026-05-09 14:38 UTC · model grok-4.3
The pith
Persona agents accept incorrect answers from identity-similar peers at much higher rates than from dissimilar peers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Persona agents display strong in-group favoritism, accepting incorrect answers from identity-similar peers at much higher rates than from dissimilar peers. In-group favoritism continues to emerge in defeasible reasoning contexts where no absolute truth exists, and it intensifies as cognitive complexity increases. Furthermore, three intervention strategies--Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble--are proposed to mitigate the in-group favoritism.
What carries the argument
The Truth or Tribe simulation framework using a triadic interaction paradigm to examine agent cooperation amid the spread of contradicting information.
If this is right
- Persona agents prioritize group identity over factual accuracy when evaluating information from peers.
- In-group favoritism appears even in defeasible reasoning where objective truth is absent.
- The strength of the bias increases as the cognitive complexity of the task grows.
- Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble each reduce the observed favoritism.
Where Pith is reading between the lines
- Multi-agent systems built from persona models may systematically distort shared knowledge when identities align.
- Routine application of the tested interventions could become standard practice for any deployment involving interacting agents.
- Extending the triadic setup to include evolving identities or larger groups would test whether the bias scales.
Load-bearing premise
The chosen persona assignments and triadic interaction setup isolate in-group favoritism without being dominated by the underlying language model's biases or prompt engineering choices.
What would settle it
A controlled run in which acceptance rates for incorrect answers remain equal across identity-similar and identity-dissimilar peers would falsify the claim that in-group favoritism drives the difference.
Figures
read the original abstract
In-group favoritism refers to the phenomena of favoring members of one's in-group over out-group members and is widely observed in numerous social cooperative behaviors. Recently, in-group favoritism biases have also been identified in generative language models. However, whether the in-group favoritism exists when persona agents are faced with contradicting information (e.g., misinformation), and how to mitigate the adverse effects of in-group favoritism biases in persona agents have been understudied. To address these problems, we propose a Truth or Tribe simulation framework to study the agent cooperation within the spread of contradicting information through a triadic interaction paradigm, and conduct controlled trials to evaluate the primary moderating factors. Extensive results showcase that persona agents display strong in-group favoritism, accepting incorrect answers from identity-similar peers at much higher rates than from dissimilar peers. In-group favoritism continues to emerge in defeasible reasoning contexts where no absolute truth exists, and it intensifies as cognitive complexity increases. Furthermore, three intervention strategies--Identity-Blind Instruction, Structured Counterfactual Reasoning, and Heterogeneous Perspective Ensemble--are proposed to mitigate the in-group favoritism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the 'Truth or Tribe' simulation framework to examine in-group favoritism in persona-based LLM agents via a triadic interaction paradigm involving contradicting information. Controlled trials show agents accept incorrect answers from identity-similar peers at substantially higher rates than from dissimilar peers; the bias persists in defeasible reasoning tasks without absolute truth and strengthens with increasing cognitive complexity. Three mitigation strategies (Identity-Blind Instruction, Structured Counterfactual Reasoning, Heterogeneous Perspective Ensemble) are evaluated.
Significance. If the design isolates identity effects from prompt artifacts, the work would usefully extend social-psychology findings on in-group bias to multi-agent LLM systems and supply concrete, testable interventions. The empirical focus on both truth-based and defeasible contexts plus the inclusion of mitigation experiments are strengths that could inform safer deployment of cooperative AI agents.
major comments (2)
- [Methods] Methods section (persona construction and triadic paradigm): the manuscript does not report explicit balancing or measurement of linguistic features (sentence length, keyword overlap, valence, syntactic similarity) across in-group and out-group persona prompts. Because the central claim requires that acceptance-rate differences arise solely from the group-identity label rather than correlated prompt features that the underlying LLM may exploit, this omission is load-bearing for the validity of the favoritism measurement.
- [Results] Results and experimental details: sample sizes, exact statistical tests, power analysis, exclusion criteria, and per-condition effect sizes are not reported with sufficient granularity. Without these, it is impossible to evaluate whether the reported directional effects survive correction for multiple comparisons or LLM-specific response biases, directly affecting the strength of the claim that favoritism 'intensifies as cognitive complexity increases.'
minor comments (2)
- [Title] The title contains a subject-verb agreement error ('Prioritize' should be 'Prioritizes').
- Notation for the three intervention conditions is introduced in the abstract but not cross-referenced to the specific experimental conditions in the results tables or figures.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help clarify the methodological and reporting requirements for validating our claims about in-group favoritism in persona agents. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section (persona construction and triadic paradigm): the manuscript does not report explicit balancing or measurement of linguistic features (sentence length, keyword overlap, valence, syntactic similarity) across in-group and out-group persona prompts. Because the central claim requires that acceptance-rate differences arise solely from the group-identity label rather than correlated prompt features that the underlying LLM may exploit, this omission is load-bearing for the validity of the favoritism measurement.
Authors: We acknowledge that the submitted manuscript did not include explicit measurements or balancing checks for linguistic features across conditions. Our persona prompts were constructed by varying only the identity descriptors while holding factual content constant, but we agree this is insufficient to fully rule out prompt artifacts. In the revision we will add a dedicated subsection reporting average sentence length, keyword overlap (via Jaccard or embedding cosine similarity), valence scores (using standard sentiment lexicons), and syntactic similarity metrics for in-group versus out-group prompts. We will also test whether these features predict acceptance rates and, if necessary, include them as covariates in our primary models. This addition will directly address the concern that observed differences may stem from non-identity prompt characteristics. revision: yes
-
Referee: [Results] Results and experimental details: sample sizes, exact statistical tests, power analysis, exclusion criteria, and per-condition effect sizes are not reported with sufficient granularity. Without these, it is impossible to evaluate whether the reported directional effects survive correction for multiple comparisons or LLM-specific response biases, directly affecting the strength of the claim that favoritism 'intensifies as cognitive complexity increases.'
Authors: We agree that the current results section lacks the granularity needed for full statistical evaluation. In the revised manuscript we will report: (i) exact sample sizes (number of agents and trials per condition), (ii) the precise statistical tests employed (e.g., logistic mixed-effects regression with p-values), (iii) any corrections for multiple comparisons, (iv) a post-hoc power analysis for the key comparisons, (v) explicit exclusion criteria (e.g., malformed responses or timeout cases), and (vi) per-condition effect sizes (odds ratios or Cohen’s d). We will also add a brief discussion of potential LLM-specific response biases and how the triadic design and randomization mitigate them. These details will allow readers to assess the robustness of the finding that in-group favoritism intensifies with cognitive complexity. revision: yes
Circularity Check
No circularity: empirical results from controlled simulation trials
full rationale
The paper describes an empirical study that proposes a Truth or Tribe simulation framework and evaluates in-group favoritism through controlled triadic interaction trials on persona agents. All reported findings (acceptance rates of incorrect answers, effects in defeasible reasoning, mitigation via interventions) are obtained directly from experimental measurements rather than any derivation, equation, or parameter fit that reduces to the inputs by construction. No self-definitional steps, fitted predictions, or load-bearing self-citations appear in the abstract or described methodology; the central claims rest on observable simulation outcomes independent of the framework definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Persona agents can be effectively simulated using generative language models with assigned identities.
Reference graph
Works this paper leans on
-
[1]
Ingroup favoritism in cooperation: a meta- analysis.Psychological bulletin, 140(6):1556. Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024a. From Persona to Personalization: A Survey on Role-Pla...
-
[2]
Thilo Hagendorff, Sarah Fabi, and Michal Kosinski
Bias and fairness in large language models: A survey.Computational linguistics, 50(3):1097–1179. Thilo Hagendorff, Sarah Fabi, and Michal Kosinski
-
[3]
Human-like intuitive behavior and reasoning biases emerged in large language models but disap- peared in ChatGPT.Nature Computational Science, 3(10):833–838. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt
-
[4]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language under- standing. ArXiv: 2009.03300 [cs.CY]. Miles Hewstone, Mark Rubin, and Hazel Willis. 2002. Intergroup Bias.Annual Review of Psychology, 53(1):575–604. Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander Van Der Linden, and Jon Roozen- beek. 2024. Generative language models exhibit so- cial identity ...
work page internal anchor Pith review arXiv 2009
-
[5]
Ingroup favoritism and outgroup derogation in intergenerational cooperation.Communications Psychology, 3(1):89. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa. 2022. Large Lan- guage Models are Zero-Shot Reasoners.Advances in neural information processing systems, 35:22199– 22213. Michal Kosinski. 2023. Theory of mind ...
-
[6]
Character-LLM: A Trainable Agent for Role- Playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Mrinank Sharma, Meg Tong, Tomek Korbak, David Du- venaud, Amanda Askell, Sam Bowman, Esin DUR- MUS, Zac Hatfield-Dodds, Scott Johnston, Shau...
work page 2023
-
[7]
Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments. InProceed- ings of the 40th ACM/SIGAPP Symposium on Applied Computing, pages 1009–1011, Catania International Airport Catania Italy. ACM. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, and De...
-
[8]
Parse question-answer format:Extract questions, ground truth answers, and all avail- able options from the original dataset format
-
[9]
Extract false options:Identify all incorrect options (excluding ground truth) that are se- mantically coherent but factually incorrect
-
[10]
Generate scenarios:For each question, cre- ate three experimental scenarios: • Scenario 1:High-similarity peer pro- vides false answer, low-similarity peer provides different false answer. • Scenario 2:High-similarity peer pro- vides false answer, low-similarity peer provides correct answer. • Scenario 3:High-similarity peer pro- vides correct answer, low...
-
[11]
Assign personas:Randomly select one per- sona from the in-group set (Pin) and one from the out-group set (Pout) for each scenario
-
[12]
Randomize presentation:Shuffle the order in which peer agents appear in prompts to control for position bias. For BBH datasets specifically, we parse the for- mat where options are labeled as (A), (B), (C), etc., and reduce to exactly three options (ground truth + two randomly selected false options) when more options are available, ensuring consistent ex...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.