pith. sign in

arxiv: 2509.18052 · v3 · submitted 2025-09-22 · 💻 cs.CL · cs.CY

The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies

Pith reviewed 2026-05-18 14:26 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords LLM societiescollective behaviorsimulation validityPIMMUR principlesmethodological flawsemergent behaviorAI agent simulationsocial experiment reproduction
0
0 comments X

The pith

Most studies of collective behavior in LLM societies violate methodological principles that make their results unreliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits 39 recent studies using large language models to simulate groups of agents running social experiments. It names six common flaws in agent setup, interaction rules, memory, control, awareness, and realism under the label PIMMUR and shows that 89.7 percent of the studies break at least one. When the authors rerun five representative experiments while following the principles, the group-level patterns that had been reported often disappear or reverse. A reader would care because these simulations are presented as tools for understanding real human societies, yet the findings may stem from how the models are instructed rather than from any social process.

Core claim

The authors claim that valid simulation of collective human behavior with LLMs requires adherence to the PIMMUR principles on agent profiles, interaction, memory, control, unawareness, and realism. Their audit establishes that 89.7 percent of published studies violate at least one principle, frontier models recognize the original social experiment in only 50.8 percent of prompts, and 61 percent of prompts exert excessive control. Reproduction of five experiments under PIMMUR compliance shows that reported collective phenomena commonly vanish or reverse, indicating that many observed behaviors are artifacts of the simulation method rather than genuine social dynamics.

What carries the argument

The PIMMUR principles, six requirements on agent profiles, interaction design, memory handling, experimental control, agent unawareness of the simulation, and overall realism that together diagnose and prevent invalid results in LLM-based social simulations.

If this is right

  • Many reported emergent behaviors in LLM societies are methodological artifacts rather than robust social dynamics.
  • Excessive prompt control in 61 percent of cases predetermines outcomes and undermines claims of genuine collective behavior.
  • Frontier LLMs identify the underlying social experiment correctly in only half of simulation prompts.
  • Current AI simulations may primarily reflect model-specific biases instead of universal human social behaviors.
  • Adoption of PIMMUR would require re-examination of existing findings in the field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardizing PIMMUR checks could reduce the number of non-replicable claims about LLM societies.
  • Automated prompt auditing tools might help enforce the principles at scale in future studies.
  • The findings suggest caution when treating LLM outputs as direct substitutes for human subjects in social research.
  • Extending similar audits to other simulation domains using LLMs could reveal parallel methodological issues.

Load-bearing premise

The 39 selected studies form a representative sample of the literature and the five reproduced experiments were implemented with enough fidelity to the originals to isolate the effect of PIMMUR violations.

What would settle it

An independent reproduction of a new set of ten or more studies from the same literature, run with strict PIMMUR compliance, that either shows the same high rate of phenomenon disappearance or fails to replicate the original violation rate.

Figures

Figures reproduced from arXiv: 2509.18052 by Hao Zhu, Jen-tse Huang, Jiaxu Zhou, Maarten Sap, Man Ho Lam, Wenxuan Wang, Xintao Wang, Xuhui Zhou.

Figure 1
Figure 1. Figure 1: The PIMMUR principles. The first three (PIM) focus on micro-level agent designs, while the latter three (MUR) focus on macro-level experiment designs. 1 INTRODUCTION Large Language Models (LLMs) have rapidly advanced in their reasoning (Huang et al., 2025a), communication (Tran et al., 2025), and coordination capabilities (Agashe et al., 2025), sparking growing interest in their potential applications with… view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of how the three experimenter visibility effects interact. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Implementing PIMMUR principles, LLM agents show less balanced social relationships. Cisneros-Velarde (2024) explore whether LLM-mediated social relationships conform to Heider’s balance theory (Heider, 1946). According to the theory, a triad achieves balance only under three conditions: (1) all three individuals are mutual friends, (2) two pairs are enemies while the remaining pair are friends (capturing t… view at source ↗
Figure 6
Figure 6. Figure 6: Probability of LLMs flipping their answers, grouped by different levels of confidence. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Log-log plot of the complementary cumulative distribution function (CCDF) of de￾gree k, with linear fits. Asking LLMs to rely on explicit degree informa￾tion departs from realistic social dynamics. To address these issues, we redesign the experiment. Degree information is withheld; instead, agents decide whom to befriend through one-to-one con￾versations, forming impressions of others. This design better r… view at source ↗
Figure 8
Figure 8. Figure 8: The [prompt] is replaced by the prompts that existing studies use. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The [prompt] is replaced by the prompts that existing studies use. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: This instruction is put at the beginning of every prompt. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Different actions are selected for different simulations. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed to simulate human collective behaviors, yet the methodological rigor of these "AI societies" remains under-explored. Through a systematic audit of 39 recent studies, we identify six pervasive flaws-spanning agent profiles, interaction, memory, control, unawareness, and realism (PIMMUR). Our analysis reveals that 89.7% of studies violate at least one principle, undermining simulation validity. We demonstrate that frontier LLMs correctly identify the underlying social experiment in 50.8% of cases, while 61.0% of prompts exert excessive control that pre-determines outcomes. By reproducing five representative experiments (e.g., telephone game), we show that reported collective phenomena often vanish or reverse when PIMMUR principles are enforced, suggesting that many "emergent" behaviors are methodological artifacts rather than genuine social dynamics. Our findings suggest that current AI simulations may capture model-specific biases rather than universal human social behaviors, raising critical concerns about the use of LLMs as scientific proxies for human society.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the PIMMUR principles (agent Profiles, Interaction, Memory, Control, Unawareness, Realism) for valid LLM simulations of collective human behavior. A systematic audit of 39 recent studies finds that 89.7% violate at least one principle. Frontier LLMs identify the underlying social experiment in only 50.8% of cases, while 61.0% of prompts exert excessive control. Reproductions of five representative experiments (e.g., telephone game) show that reported collective phenomena often vanish or reverse when PIMMUR principles are enforced, suggesting many emergent behaviors are methodological artifacts rather than genuine social dynamics.

Significance. If the central findings hold, the work would be significant for research on LLM-based AI societies by providing empirical evidence that many reported collective behaviors may be artifacts of flawed prompting and design rather than robust social phenomena. The systematic audit combined with targeted reproductions offers a concrete way to test validity and could encourage adoption of stricter methodological standards. Strengths include the scale of the audit (39 studies) and the use of fresh reproductions to isolate the effect of PIMMUR compliance.

major comments (2)
  1. Methods (Study Selection): The criteria used to select the 39 audited studies are insufficiently detailed, and no full list or inclusion/exclusion protocol is provided. This is load-bearing for the central claim because the reported 89.7% violation rate and its generalization to the broader literature depend on the sample being representative; without explicit selection details, selection bias cannot be excluded.
  2. Reproduction Experiments: Limited information is given on the exact implementation protocols, parameter settings, and fidelity checks for the five reproduced experiments. To support the claim that collective phenomena vanish or reverse specifically due to PIMMUR enforcement (rather than unintended prompt or setup changes), the manuscript must demonstrate that the reproductions match the originals except for the controlled principle-compliant modifications.
minor comments (2)
  1. Provide more explicit operational definitions and measurable criteria for each of the six PIMMUR principles to improve reproducibility of the violation audit.
  2. Ensure that figures reporting violation rates and identification accuracies include exact counts, sample sizes, and any confidence intervals for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving transparency in our manuscript. We address each major comment below and have incorporated revisions to strengthen the methodological details.

read point-by-point responses
  1. Referee: Methods (Study Selection): The criteria used to select the 39 audited studies are insufficiently detailed, and no full list or inclusion/exclusion protocol is provided. This is load-bearing for the central claim because the reported 89.7% violation rate and its generalization to the broader literature depend on the sample being representative; without explicit selection details, selection bias cannot be excluded.

    Authors: We agree that explicit selection details are necessary to support the generalizability of the 89.7% violation rate. In the revised manuscript, we will expand the Methods section to include the full search protocol (databases, keywords, date range), precise inclusion/exclusion criteria, and a complete list of the 39 studies with bibliographic references. This will be presented in a new table or supplementary appendix to allow readers to evaluate potential selection bias. revision: yes

  2. Referee: Reproduction Experiments: Limited information is given on the exact implementation protocols, parameter settings, and fidelity checks for the five reproduced experiments. To support the claim that collective phenomena vanish or reverse specifically due to PIMMUR enforcement (rather than unintended prompt or setup changes), the manuscript must demonstrate that the reproductions match the originals except for the controlled principle-compliant modifications.

    Authors: We concur that additional implementation details are required to isolate the effects of PIMMUR compliance. The revised manuscript will include expanded descriptions of the reproduction protocols, specifying model versions, temperature and other hyperparameters, exact prompt templates, and fidelity verification procedures (e.g., side-by-side comparisons of original and modified setups). We will also release the full reproduction code, prompts, and output logs via a public repository to enable verification that modifications were limited to enforcing the relevant PIMMUR principles. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external audit and reproductions

full rationale

The paper proposes the PIMMUR principles from observed flaws in LLM collective behavior studies, then applies them via a systematic audit of 39 external papers and fresh reproductions of five experiments. No equations, fitted parameters, or predictions reduce to the paper's own inputs by construction. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the derivation. The 89.7% violation rate and reversal findings are measured against independently selected literature and re-implemented setups, rendering the chain self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces the PIMMUR acronym and principles as a new organizing lens without prior independent validation; the audit depends on the representativeness of the 39-study sample and on the assumption that violations causally explain the observed artifacts.

axioms (1)
  • domain assumption The 39 studies constitute a representative sample of recent LLM collective-behavior research.
    The violation percentage and generalizability rest on this sampling choice.
invented entities (1)
  • PIMMUR principles no independent evidence
    purpose: Structured checklist for validity in LLM society simulations
    Newly coined framework; no independent evidence outside this paper is provided.

pith-pipeline@v0.9.0 · 5736 in / 1320 out tokens · 40874 ms · 2026-05-18T14:26:26.071344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

    physics.soc-ph 2026-05 accept novelty 6.0

    Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Llm-coordination: Evaluating and analyzing multi-agent coordination abilities in large language models

    Saaket Agashe, Yue Fan, Anthony Reyna, and Xin Eric Wang. Llm-coordination: Evaluating and analyzing multi-agent coordination abilities in large language models. InFindings of the Associ- ation for Computational Linguistics: NAACL 2025, pp. 8038–8057,

  2. [2]

    Richardson, Austin C

    Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C Kozlowski, Bernard Koch, James Evans, Erik Brynjolfsson, and Michael Bernstein. Llm social simulations are a promising research method.arXiv preprint arXiv:2504.02234,

  3. [3]

    Introducing claude 4.Anthropic Blog Mar 22 2025,

    Anthropic. Introducing claude 4.Anthropic Blog Mar 22 2025,

  4. [4]

    Mind the (belief) gap: Group identity in the world of llms

    Angana Borah, Marwa Houalla, and Rada Mihalcea. Mind the (belief) gap: Group identity in the world of llms. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 8441–18463,

  5. [5]

    ELEPHANT: Measuring and understanding social sycophancy in LLMs

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Social sycophancy: A broader understanding of llm sycophancy.arXiv preprint arXiv:2505.13995,

  6. [6]

    Herd behavior: Investigating peer influence in llm-based multi-agent systems.arXiv preprint arXiv:2505.21588,

    Young-Min Cho, Sharath Chandra Guntuku, and Lyle Ungar. Herd behavior: Investigating peer influence in llm-based multi-agent systems.arXiv preprint arXiv:2505.21588,

  7. [7]

    Large language models can achieve social balance.arXiv preprint arXiv:2410.04054,

    10 Preprint Pedro Cisneros-Velarde. Large language models can achieve social balance.arXiv preprint arXiv:2410.04054,

  8. [8]

    Emergence of scale-free networks in social interactions among large language models.arXiv preprint arXiv:2312.06619,

    Giordano De Marzo, Luciano Pietronero, and David Garcia. Emergence of scale-free networks in social interactions among large language models.arXiv preprint arXiv:2312.06619,

  9. [9]

    Syceval: Evaluating llm sycophancy

    Aaron Fanous, Jacob Goldberg, Ank A Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. Syceval: Evaluating llm sycophancy.arXiv preprint arXiv:2502.08177,

  10. [10]

    S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents

    Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S3: Social-network simulation system with large language model-empowered agents.arXiv preprint arXiv:2307.14984,

  11. [11]

    Simcse: Simple contrastive learning of sentence embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910,

  12. [12]

    The power of stories: Narrative priming shapes how llm agents collaborate and compete.arXiv preprint arXiv:2505.03961,

    Gerrit Großmann, Larisa Ivanova, Sai Leela Poduru, Mohaddeseh Tabrizian, Islam Mesabah, David A Selby, and Sebastian J V ollmer. The power of stories: Narrative priming shapes how llm agents collaborate and compete.arXiv preprint arXiv:2505.03961,

  13. [13]

    Large language model driven agents for simulating echo chamber formation.arXiv preprint arXiv:2502.18138, 2025

    Chenhao Gu, Ling Luo, Zainab Razia Zaidi, and Shanika Karunasekera. Large language model driven agents for simulating echo chamber formation.arXiv preprint arXiv:2502.18138,

  14. [14]

    Diversity of thought elicits stronger reasoning capabilities in multi-agent debate frameworks.arXiv preprint arXiv:2410.12853,

    Mahmood Hegazy. Diversity of thought elicits stronger reasoning capabilities in multi-agent debate frameworks.arXiv preprint arXiv:2410.12853,

  15. [15]

    War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

    Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227,

  16. [16]

    Apathetic or empathetic? evaluating llms’ emotional alignments with humans.Advances in Neural Information Processing Systems, 37:97053–97087, 2024a

    Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, and Michael R Lyu. Apathetic or empathetic? evaluating llms’ emotional alignments with humans.Advances in Neural Information Processing Systems, 37:97053–97087, 2024a. Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxi- ang J...

  17. [17]

    Gemini 2.5: Our most intelligent ai model.Google Blog Mar 25 2025,

    Koray Kavukcuoglu. Gemini 2.5: Our most intelligent ai model.Google Blog Mar 25 2025,

  18. [18]

    Maik Larooij and Petter Törnberg

    URLhttps://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/. Maik Larooij and Petter Törnberg. Do large language models solve the problems of agent-based modeling? a critical review of generative social simulations.arXiv preprint arXiv:2504.03274,

  19. [19]

    Exploring social desirabil- ity response bias in large language models: Evidence from gpt-4 simulations.arXiv preprint arXiv:2410.15442,

    Sanguk Lee, Kai-Qi Yang, Tai-Quan Peng, Ruth Heo, and Hui Liu. Exploring social desirabil- ity response bias in large language models: Evidence from gpt-4 simulations.arXiv preprint arXiv:2410.15442,

  20. [20]

    Curse of knowledge: When complex evaluation context benefits yet biases llm judges.arXiv preprint arXiv:2509.03419, 2025a

    Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, and Deqing Yang. Curse of knowledge: When complex evaluation context benefits yet biases llm judges.arXiv preprint arXiv:2509.03419, 2025a. Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona Diab, and Maarten Sap. Big5-chat: Shaping llm personalities through training on...

  21. [21]

    Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs

    Yingjie Li, Yun Luo, Xiaotian Xie, and Yue Zhang. Task calibration: Calibrating large language models on inference tasks. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 6937–6951, 2025c. Yuxuan Li, Aoi Naito, and Hirokazu Shirado. Assessing collective reasoning in multi-agent llms via hidden profile tasks.arXiv preprint arXiv:2...

  22. [22]

    The llama 4 herd: The beginning of a new era of natively multimodal ai in- novation.Meta Blog Apr 5 2025,

    Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai in- novation.Meta Blog Apr 5 2025,

  23. [23]

    Unveiling the truth and facilitating change: To- wards agent-based large-scale social movement simulation

    Xinyi Mou, Zhongyu Wei, and Xuan-Jing Huang. Unveiling the truth and facilitating change: To- wards agent-based large-scale social movement simulation. InFindings of the Association for Computational Linguistics ACL 2024, pp. 4789–4809,

  24. [24]

    Agentsense: Benchmarking social intelligence of language agents through interactive scenarios

    Xinyi Mou, Jingcong Liang, Jiayu Lin, Xinnong Zhang, Xiawei Liu, Shiyue Yang, Rong Ye, Lei Chen, Haoyu Kuang, Xuanjing Huang, and Zhongyu Wei. Agentsense: Benchmarking social intelligence of language agents through interactive scenarios. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Ling...

  25. [25]

    Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836,

    Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836,

  26. [26]

    Probing and

    Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, and Felix Hofstätter. Probing evaluation awareness of language models.arXiv preprint arXiv:2507.01786,

  27. [27]

    Can generative agent- based modeling replicate the friendship paradox in social media simulations? InProceedings of the 17th ACM Web Science Conference 2025, pp

    Gian Marco Orlando, Valerio La Gatta, Diego Russo, and Vincenzo Moscato. Can generative agent- based modeling replicate the friendship paradox in social media simulations? InProceedings of the 17th ACM Web Science Conference 2025, pp. 510–515,

  28. [28]

    AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

    Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhi- heng Zheng, Jing Yi Wang, Di Zhou, et al. Agentsociety: Large-scale simulation of llm- driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691,

  29. [29]

    Consensagent: Towards efficient and effective consensus in multi-agent llm interactions through sycophancy mitigation

    Priya Pitre, Naren Ramakrishnan, and Xuan Wang. Consensagent: Towards efficient and effective consensus in multi-agent llm interactions through sycophancy mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 22112–22133,

  30. [30]

    Risk analysis techniques for governed llm-based multi-agent systems.arXiv preprint arXiv:2508.05687,

    Alistair Reid, Simon O’Callaghan, Liam Carroll, and Tiberio Caetano. Risk analysis techniques for governed llm-based multi-agent systems.arXiv preprint arXiv:2508.05687,

  31. [31]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applica- tions.arXiv preprint arXiv:2402.07927,

  32. [32]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al. The prompt report: a system- atic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608,

  33. [33]

    Spurious Rewards: Rethinking Training Signals in RLVR

    Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

  34. [34]

    you are grounded!

    Vered Shwartz, Rachel Rudinger, and Oyvind Tafjord. “you are grounded!”: Latent name artifacts in pre-trained language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6850–6861,

  35. [35]

    Llms can’t handle peer pressure: Crumbling under multi-agent social interactions.arXiv preprint arXiv:2508.18321,

    Maojia Song, Tej Deep Pala, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, and Soujanya Poria. Llms can’t handle peer pressure: Crumbling under multi-agent social interactions.arXiv preprint arXiv:2508.18321,

  36. [36]

    Gensim: A general social simulation platform with large language model based agents

    Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, et al. Gensim: A general social simulation platform with large language model based agents. InProceedings of the 2025 Conference of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language Techno...

  37. [37]

    Towards Operational Validation of LLM-Agent Social Simulations: A Replicated Study of a Reddit-like Technology Forum

    Aleksandar Tomaševi´c, Darja Cvetkovi ´c, Sara Major, Slobodan Maleti ´c, Miroslav An ¯delkovi´c, Ana Vrani ´c, Boris Stupovski, Dušan Vudragovi ´c, Aleksandar Bogojevi ´c, and Marija Mitrovi ´c Dankulov. Operational validation of large-language-model agent social simulation: Evidence from voat v/technology.arXiv preprint arXiv:2508.21740,

  38. [38]

    Simulation system towards solving societal-scale manipulation

    14 Preprint Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri Krishnakumar, Dan Zhao, Zachary Yang, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Andreea Musulan, et al. Simulation system towards solving societal-scale manipulation. InNeurIPS 2024 Workshop: Socially Re- sponsible Language Modelling Research (SoLaR),

  39. [39]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

  40. [40]

    arXiv preprint arXiv:2302.08399 , year=

    Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399,

  41. [41]

    Llm-based human simulations have not yet been reliable.arXiv preprint arXiv:2501.08579, 2025

    Qian Wang, Jiaying Wu, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, and Bingsheng He. What limits llm-based human simulation: Llms or our design?arXiv preprint arXiv:2501.08579, 2025a. Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. Incharacter: Evaluating personality fidelity in...

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  43. [43]

    Oasis: Open agents social interaction simulations on one million agents

    Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Martin Ma, Bowen Dong, Prateek Gupta, et al. Oasis: Open agents social interaction simulations on one million agents. InNeurIPS 2024 Workshop on Open-World Agents,

  44. [44]

    Twinmarket: A scalable behavioral and social simulation for financial markets

    Y ANG Yuzhe, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets. InICLR 2026 Workshop: Advances in Financial AI: Opportunities, Innovations, and Responsible AI,

  45. [45]

    Exploring collaboration mechanisms for llm agents: A social psychology view

    Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for llm agents: A social psychology view. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14544–14607, 2024a. Xinnong Zhang, Jiayu Lin, Libo Sun, Weihong Qi, Yihang Yang, Yue...

  46. [46]

    Is this the real life? is this just fantasy? the misleading success of simulating social interactions with llms

    15 Preprint Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim, and Maarten Sap. Is this the real life? is this just fantasy? the misleading success of simulating social interactions with llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 21692–21714,

  47. [47]

    Sotopia-s4: a user-friendly system for flexible, customizable, and large-scale social simulation

    Xuhui Zhou, Zhe Su, Sophie Feng, Jiaxu Zhou, Jen-tse Huang, Hsien-Te Kao, Spencer Lynch, Svitlana V olkova, Tongshuang Wu, Anita Woolley, et al. Sotopia-s4: a user-friendly system for flexible, customizable, and large-scale social simulation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational L...