pith. sign in

arxiv: 2605.18890 · v1 · pith:7Y6W6P33new · submitted 2026-05-17 · ⚛️ physics.soc-ph · cs.AI· cs.CY· cs.MA

Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

Pith reviewed 2026-05-20 13:35 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.AIcs.CYcs.MA
keywords LLM social simulationsrobustness auditsgenerative agentsagent-based modelingPrisoner's Dilemmaecho chamberspolarizationcooperation
0
0 comments X

The pith

Scientific claims from LLM social simulations must be limited by the robustness audits performed to support them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative agents allow detailed simulations of social processes such as cooperation and polarization, yet the many design choices required for these agents create pathways for small setup differences to produce large shifts in collective behavior. The paper shows through repeated Prisoner's Dilemma and echo-chamber case studies that changes in persona wording, instruction framing, network homophily, and hub assignment can move key metrics by tens of percentage points, with the size of the shift varying sharply across model families. Because these sensitivities are not uniform, the authors argue that any explanatory claim, intervention test, or policy suggestion drawn from such simulations is only as credible as the specific robustness checks that accompany it. They introduce TRAILS, a taxonomy organized at agent, interaction, and system levels, to make systematic auditing a required step rather than an optional extra.

Core claim

Scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them, since minor perturbations in agent specification, interaction protocols, and environment design can cascade through repeated interactions to alter macro-level outcomes such as cooperation rates or polarization metrics.

What carries the argument

TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a three-level structure that organizes checks at the agent (micro), interaction (meso), and system (macro) layers of simulation design.

If this is right

  • Any statement about social mechanisms such as cooperation or norm formation must be accompanied by evidence that the result holds under reasonable variations in agent persona format and game-instruction wording.
  • Network-based findings on polarization or echo chambers require explicit tests of homophily levels and hub assignment before the results can be attributed to the modeled social process.
  • Robustness is not a general property of LLM agents but must be measured separately for each claim and each model family.
  • Simulations intended to evaluate interventions or guide decisions need documented sensitivity analysis at all three levels of the TRAILS taxonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sensitivity pattern could affect LLM agent use in non-social domains such as market simulations or organizational modeling.
  • Standardized robustness benchmarks across model families would let researchers choose simulation backbones that minimize implementation artifacts.
  • Longer simulation runs or different memory architectures might either amplify or dampen the observed sensitivities, offering a direct next test.

Load-bearing premise

The specific perturbations tested in the case studies are representative of the design choices that researchers typically make when building LLM social simulations.

What would settle it

A broad survey of published LLM social simulation studies that finds reported outcomes remain stable when the same persona, framing, and network perturbations are applied would undermine the claim that robustness audits are generally required.

Figures

Figures reproduced from arXiv: 2605.18890 by Ding Chen, Emilio Ferrara, Jinyi Ye, Lei Cao.

Figure 1
Figure 1. Figure 1: The butterfly effect in LLM social simulations. Two persona prompts that differ only in surface format while preserving the same content produce a 76-percentage-point gap in cooperation rate in a 10-round Prisoner’s Dilemma (gpt-5.2, N = 30 seeds per condition; two-sided Mann– Whitney U, p < 0.001). We argue that LLM social simulations should not support claims stronger than their robustness audits can jus… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of persona format on Prisoner’s Dilemma outcomes. Results are shown for the single-agent setting (left) and two-agent setting (right). Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across persona formats (p < .05, two￾sided Mann–Whitney U test); gray cells indicate non-significant comparisons. Bar plots show run-level distributions, with error … view at source ↗
Figure 3
Figure 3. Figure 3: Effects of network homophily and hub assignment on echo-chamber outcomes. Left panels show example input networks with increasing stance homophily, operationalized as higher initial network assortativity. Right panels show example networks with the same degree sequence but different hub assignments, where high-degree nodes are occupied by anti-stance, pro-stance, mixed, or randomly assigned agents. Boxplot… view at source ↗
Figure 4
Figure 4. Figure 4: Repeated Prisoner’s Dilemma case-study design. We hold the underlying game fixed across conditions: agents play a 10-round repeated Prisoner’s Dilemma with the same payoff matrix, using 30 independent simulations per condition. We then vary three prompt dimensions—persona format, game-instruction framing, and memory format—and run each variant in two interaction modes: a single-agent mode, where one LLM ag… view at source ↗
Figure 5
Figure 5. Figure 5: Echo-chamber case-study design. We simulate 100 LLM agents discussing whether advanced AI systems should be regulated on a fixed social network. Each agent has a persona, follower count, and frozen stance toward AI regulation. In each round, activated agents see recent posts from their direct neighbors only and choose one action: POST, REPOST, REPLY, or SILENT. We vary five architectural perturbations—inpu… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of game-instruction framing on Prisoner’s Dilemma outcomes. Results are shown for the single-agent setting (left) and two-agent setting (right). The heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across instruction framings (p < .05, two-sided Mann–Whitney U test), while gray cells indicate non-significant comparisons. Bar plots show run-level d… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of memory representation on Prisoner’s Dilemma outcomes. Results are shown for the single-agent setting (left) and two-agent setting (right). Memory conditions vary the history format (table vs. narrative) and whether summary statistics are included. Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate (p < .05, two-sided Mann– Whitney U test), while … view at source ↗
Figure 8
Figure 8. Figure 8: Effects of activation probability, memory window, and recommendation feed size on echo-chamber outcomes. Boxplots show the effects of varying activation probability, memory window, and recommendation feed size while holding the input network fixed. The top row reports final stance assortativity and the bottom row reports weighted same-group edge ratio, both computed on the simulated interaction network. Si… view at source ↗
Figure 9
Figure 9. Figure 9: Persona-format effects in the single-agent Prisoner’s Dilemma across four models. Bar plots show average payoff and cooperation rate for each persona format, opponent policy, and model. TitForTat Random AlwaysCooperate AlwaysDefect PLAIN DESCRIPTIVE PLAIN TABULAR Persona Format DESCRIPTIVE TABULAR +0.96 p=0.000 -0.54 p=0.000 -0.05 p=0.032 +1.60 p=0.000 -0.40 p=0.003 -2.00 p=0.000 -0.10 p=0.000 +0.64 p=0.00… view at source ↗
Figure 10
Figure 10. Figure 10: Pairwise persona-format differences in the single-agent Prisoner’s Dilemma across four models. Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across persona formats. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Persona-format effects in the two-agent Prisoner’s Dilemma across four models. Bar plots show average payoff and cooperation rate when both LLM agents use the same persona format. Agent A Agent B PLAIN DESCRIPTIVE PLAIN TABULAR Persona Format DESCRIPTIVE TABULAR +1.42 p=0.000 +1.54 p=0.000 -0.10 p=0.012 +0.12 p=0.012 -1.52 p=0.000 -1.42 p=0.000 GPT-5.2 Payoff Agent A Agent B +1.50 p=0.000 +1.47 p=0.000 +0… view at source ↗
Figure 12
Figure 12. Figure 12: Pairwise persona-format differences in the two-agent Prisoner’s Dilemma across four models. Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across persona formats. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Game-instruction framing effects in the single-agent Prisoner’s Dilemma across four models. Bar plots show average payoff and cooperation rate for each instruction framing, opponent policy, and model. TitForTat Random AlwaysCooperate AlwaysDefect CANONICAL MORALIZED CANONICAL RISK MORALIZED RISK Game Instruction Framing -0.31 p=0.000 +0.41 p=0.000 +0.32 p=0.003 -0.58 p=0.000 +0.62 p=0.000 -0.99 p=0.000 GP… view at source ↗
Figure 14
Figure 14. Figure 14: Pairwise game-instruction framing differences in the single-agent Prisoner’s Dilemma across four models. Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across instruction framings. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Game-instruction framing effects in the two-agent Prisoner’s Dilemma across four models. Bar plots show average payoff and cooperation rate when both LLM agents receive the same game framing. Agent A Agent B CANONICAL MORALIZED CANONICAL RISK MORALIZED RISK Game Instruction Framing -0.42 p=0.000 -0.42 p=0.000 +0.29 p=0.002 +0.32 p=0.001 GPT-5.2 Payoff Agent A Agent B -0.25 p=0.000 -0.27 p=0.000 Claude Hai… view at source ↗
Figure 16
Figure 16. Figure 16: Pairwise game-instruction framing differences in the two-agent Prisoner’s Dilemma across four models. Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across instruction framings. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Memory-representation effects in the single-agent Prisoner’s Dilemma across four models. Bar plots show average payoff and cooperation rate for each memory condition, opponent policy, and model. Table-Raw Narrative-Raw Table+Stats Narrative+Stats Memory Representation Table-Raw Narrative-Raw Table+Stats Narrative+Stats Memory Representation GPT-5.2 Payoff Table-Raw Narrative-Raw Table+Stats Narrative+Stat… view at source ↗
Figure 18
Figure 18. Figure 18: Pairwise memory-representation differences in the single-agent Prisoner’s Dilemma across four models. Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across memory conditions. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Memory-representation effects in the two-agent Prisoner’s Dilemma across four models. Bar plots show average payoff and cooperation rate when both LLM agents use the same memory representation. Table-Raw Narrative-Raw Table+Stats Narrative+Stats Table-Raw Narrative-Raw Table+Stats Narrative+Stats Memory Representation -0.09 p=0.021 -0.06 p=0.038 -0.08 p=0.040 GPT-5.2 Agent A Payoff Table-Raw Narrative-Raw… view at source ↗
Figure 20
Figure 20. Figure 20: Pairwise memory-representation differences in the two-agent Prisoner’s Dilemma across four models. Heatmaps report statistically significant pairwise differences in average payoff and cooperation rate across memory conditions. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Cross-model results for initial network homophily. Boxplots compare final stance assor￾tativity and weighted same-group edge ratio across input networks with different initial stance assorta￾tivity levels. Results are shown separately for gpt-5.2, claude-haiku-4-5, gemini-2.5-flash, and deepseek-v3. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Cross-model results for hub assignment. Boxplots compare final stance assortativity and weighted same-group edge ratio when the highest-degree nodes are assigned to anti-regulation, pro-regulation, mixed, or random agents. Results are shown separately for each model. Activation prob 0.3 Activation prob 0.5 0.20 0.22 0.24 0.26 0.28 0.30 Stance assortativity GPT-5.2 Activation prob 0.3 Activation prob 0.5 0… view at source ↗
Figure 23
Figure 23. Figure 23: Cross-model results for activation probability. Boxplots compare final stance assorta￾tivity and weighted same-group edge ratio under different agent activation probabilities. Results are shown separately for each model. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Cross-model results for memory window. Boxplots compare final stance assortativity and weighted same-group edge ratio under different memory-window lengths. Results are shown separately for each model. Feed size 5 Feed size 10 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 Stance assortativity p = 0.014 GPT-5.2 Feed size 5 Feed size 10 0.42 0.44 0.46 0.48 0.50 Claude Haiku 4.5 Feed size 5 Feed size 10 0.58 0.60 … view at source ↗
Figure 25
Figure 25. Figure 25: Cross-model results for recommendation feed size. Boxplots compare final stance assortativity and weighted same-group edge ratio when agents are exposed to different numbers of recent neighbor posts. Results are shown separately for each model. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
read the original abstract

The scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them. Generative agents bring new expressive power to agent-based modeling, enabling simulations of collective social processes like cooperation, polarization, and norm formation. Yet they also introduce complexity through additional architectural choices, such as agent specification, memory representation, interaction protocols, and environment design. Small perturbations that appear minor to researchers can cascade into macro-level outcomes through repeated interaction, creating a "butterfly effect." Consequently, scientific claims drawn from LLM social simulations may reflect implementation artifacts rather than the social mechanisms being modeled. We support this position with two case studies: a repeated Prisoner's Dilemma and a social media echo chamber simulation. Across multiple models, minor perturbations in persona format and game-instruction framing shift cooperation rates by up to 76 percentage points, while network homophily and hub assignment produce significant and consistent shifts in polarization metrics. We also find that sensitivity is unevenly distributed across both architectural choices and model families: the same perturbation that produces the 76 pp shift in one frontier model only shifts another by 1 pp. Robustness is therefore a property that should be measured per claim and per model, not assumed. To address this validation gap, we introduce TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a robustness-audit taxonomy spanning three levels of simulation design: agent (micro-level), interaction (meso-level), and system (macro-level). We call for robustness to become a first-order validation requirement before LLM social simulations are used to explain mechanisms, evaluate interventions, or inform decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that scientific claims drawn from LLM social simulations should be no stronger than the robustness audits supporting them. It posits that generative agents introduce architectural choices (agent specification, memory, interaction protocols) that can produce a 'butterfly effect' in which minor perturbations cascade to macro outcomes. This is supported by two case studies—a repeated Prisoner's Dilemma and a social media echo chamber—showing cooperation rate shifts of up to 76 percentage points under persona format and game-instruction changes, plus consistent polarization shifts under network homophily and hub assignment variations, with uneven sensitivity across model families. The paper introduces the TRAILS taxonomy (agent/micro, interaction/meso, system/macro levels) and advocates making robustness a first-order validation requirement.

Significance. If the reported sensitivities hold under fuller controls, the work would usefully caution the growing literature on generative-agent simulations of cooperation, polarization, and norm formation. The empirical demonstrations of large, model-dependent outcome shifts and the concrete TRAILS framework provide a practical starting point for standardizing audits, analogous to robustness checks already expected in traditional agent-based modeling. Credit is due for the reproducible-style case studies and the falsifiable prediction that un-audited claims risk reflecting implementation artifacts.

major comments (2)
  1. [Case studies] Case studies section: the manuscript reports large effect sizes (up to 76 pp cooperation shifts) and 'significant and consistent' polarization changes, yet does not detail the full set of statistical tests, exclusion criteria, or number of runs per condition. Because the central claim rests on these empirical demonstrations of fragility, the absence of these controls is load-bearing for readers' ability to assess whether the observed butterfly effects are robust to reasonable analysis choices.
  2. [Discussion / Implications] Discussion of generalizability: while the tested perturbations (persona format, framing, homophily, hub assignment) produce clear sensitivities, the paper does not include a sampling or citation analysis of how frequently these exact variations appear in published LLM social-simulation studies. This weakens the inference that the demonstrated sensitivities are representative of typical researcher practice rather than specific to the chosen conditions.
minor comments (2)
  1. [TRAILS taxonomy] TRAILS taxonomy: the three-level structure is clearly motivated, but an explicit mapping table linking each level to the specific perturbations used in the two case studies would strengthen the claim that TRAILS directly addresses the observed sensitivities.
  2. [Figures/Tables] Figure and table captions: ensure all model versions, temperature settings, and prompt templates are listed verbatim so that the reported percentage-point shifts can be exactly reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. The comments highlight opportunities to improve transparency in our empirical demonstrations and to better situate the findings within existing literature. We address each major comment below and have incorporated revisions to strengthen the manuscript while preserving its core argument.

read point-by-point responses
  1. Referee: [Case studies] Case studies section: the manuscript reports large effect sizes (up to 76 pp cooperation shifts) and 'significant and consistent' polarization changes, yet does not detail the full set of statistical tests, exclusion criteria, or number of runs per condition. Because the central claim rests on these empirical demonstrations of fragility, the absence of these controls is load-bearing for readers' ability to assess whether the observed butterfly effects are robust to reasonable analysis choices.

    Authors: We agree that additional methodological detail is warranted to allow readers to evaluate the reliability of the reported effect sizes. In the revised manuscript we have added a new subsection to the Methods that specifies the number of independent runs per condition (50 runs for the repeated Prisoner's Dilemma and 30 runs for the echo-chamber simulations), the statistical procedures (two-sample t-tests with Bonferroni correction, Cohen's d effect sizes, and chi-square tests for categorical outcomes), and the absence of any data exclusion. We have also updated the figures to display 95% confidence intervals and p-values. These changes directly support the claim that the observed shifts are not artifacts of analysis choices. revision: yes

  2. Referee: [Discussion / Implications] Discussion of generalizability: while the tested perturbations (persona format, framing, homophily, hub assignment) produce clear sensitivities, the paper does not include a sampling or citation analysis of how frequently these exact variations appear in published LLM social-simulation studies. This weakens the inference that the demonstrated sensitivities are representative of typical researcher practice rather than specific to the chosen conditions.

    Authors: We acknowledge that a systematic citation or sampling analysis would provide stronger evidence of prevalence. Such an analysis, however, would constitute a separate meta-review and lies beyond the scope of the present work, which focuses on controlled demonstrations of fragility. In the revised Discussion we have added targeted citations to recent LLM social-simulation papers that employ comparable persona formats, instruction framings, and network-construction methods, thereby illustrating that the tested perturbations are not idiosyncratic. We have also clarified the language to frame our results as cautionary examples rather than universal claims, reinforcing that robustness must be assessed per study. revision: partial

Circularity Check

0 steps flagged

No circularity: central position rests on direct empirical case studies of outcome sensitivity rather than any self-referential derivation or fitted prediction

full rationale

The paper advances a normative claim that scientific conclusions from LLM social simulations must be bounded by robustness audits, supported by two explicit case studies (repeated Prisoner's Dilemma and echo chamber simulation) that measure large shifts in cooperation rates and polarization metrics under controlled perturbations. No equations, fitted parameters, or predictions are presented; the argument does not invoke self-citations for uniqueness theorems, smuggle ansatzes, or rename known results. The derivation chain is therefore self-contained as an empirical demonstration rather than a reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLM social simulations are intended to capture genuine social mechanisms whose robustness should be testable, plus the empirical observation that small implementation choices can cascade.

axioms (1)
  • domain assumption LLM-based generative agents can usefully simulate collective social processes such as cooperation and polarization
    The paper builds its critique on the premise that these simulations are used to explain mechanisms and evaluate interventions.
invented entities (1)
  • TRAILS taxonomy no independent evidence
    purpose: Structured framework for auditing robustness at agent, interaction, and system levels
    Newly introduced in the paper to address the identified validation gap.

pith-pipeline@v0.9.0 · 5834 in / 1244 out tokens · 45423 ms · 2026-05-20T13:35:57.910014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 5 internal anchors

  1. [1]

    Cooperation, competition, and maliciousness: Llm-stakeholders interactive negotiation, 2023

    Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. Cooperation, competition, and maliciousness: Llm-stakeholders interactive negotiation, 2023. URLhttps: //arxiv.org/abs/2309.17234

  2. [2]

    Playing repeated games with large language models.Nature Human Behaviour, 9:1134–1143,

    Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9:1134–1143,

  3. [3]

    URL https://doi.org/10.1038/s41562-025 -02172-y

    doi: 10.1038/s41562-025-02172-y. URL https://doi.org/10.1038/s41562-025 -02172-y

  4. [4]

    Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang

    Altera.AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, Peter Y . Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang. Project sid: Many-agent simulations toward ai civilization, 2024. URLhttps://arxiv.org/abs/2411.00114

  5. [5]

    Kozlowski, Bernard Koch, Erik Brynjolfsson, James Evans, and Michael S

    Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C. Kozlowski, Bernard Koch, Erik Brynjolfsson, James Evans, and Michael S. Bernstein. Position: LLM social simulations are a promising research method. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste- Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of...

  6. [6]

    Ai agents as policymakers in simulated epidemics

    Goshi Aoki and Navid Ghaffarzadegan. Ai agents as policymakers in simulated epidemics. arXiv preprint arXiv:2601.04245, 2026. URLhttps://arxiv.org/abs/2601.04245

  7. [7]

    Emergent social conventions and collective bias in llm populations.Science Advances, 11(20):eadu9368, 2025

    Ariel Flint Ashery, Luca Maria Aiello, and Andrea Baronchelli. Emergent social conventions and collective bias in llm populations.Science Advances, 11(20):eadu9368, 2025. URL https://www.science.org/doi/10.1126/sciadv.adu9368

  8. [8]

    Sensitivity to initial conditions in agent- based models

    Francesco Bertolotti, Angela Locoro, and Luca Mari. Sensitivity to initial conditions in agent- based models. InMulti-Agent Systems and Agreement Technologies, volume 12520 ofLecture Notes in Computer Science, pages 501–508. Springer, 2020. doi: 10.1007/978-3-030-66412-1 _32. URLhttps://doi.org/10.1007/978-3-030-66412-1_32

  9. [9]

    Playing games with gpt: What can we learn about a large language model from canonical strategic games?Economics Bulletin, 44(1):25–37, 2024

    Philip Brookins and Jason DeBacker. Playing games with gpt: What can we learn about a large language model from canonical strategic games?Economics Bulletin, 44(1):25–37, 2024. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4493398

  10. [10]

    Chateval: Towards better llm-based evaluators through multi-agent debate,

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate,

  11. [11]

    URLhttps://arxiv.org/abs/2308.07201

  12. [12]

    AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors, 2023. URLhttps://arxiv.org/abs/2308.10848

  13. [13]

    On the limits of agency in agent-based models

    Ayush Chopra, Shashank Kumar, Nurullah Giray Kuru, Ramesh Raskar, and Arnau Quera- Bofarull. On the limits of agency in agent-based models. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 500–509, 2025. URL https://dl.acm.org/doi/10.5555/3709347.3743565

  14. [15]

    URLhttps://arxiv.org/abs/2311.09665. 11

  15. [16]

    Simulating opinion dynamics with networks of llm-based agents

    Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy Rogers. Simulating opinion dynamics with networks of llm-based agents. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 3326–3346, 2024. URL https://aclanthology.org/2024.findin gs-naacl.211/

  16. [17]

    Routledge, 2013

    Jacob Cohen.Statistical power analysis for the behavioral sciences. Routledge, 2013

  17. [18]

    Recognition of behavioural intention in repeated games using machine learning

    Alessandro Di Stefano, Chrisina Jayne, Claudio Angione, and The Anh Han. Recognition of behavioural intention in repeated games using machine learning. InArtificial Life Conference Proceedings, volume 1, page 103. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA, 2023. URL https://direct.mit.edu/isal/proceedings/isal2023/35/103/ 116860?

  18. [19]

    The butterfly effect in artificial intelligence systems: Implications for AI bias and fairness.Machine Learning with Applications, 15:100525, 2024

    Emilio Ferrara. The butterfly effect in artificial intelligence systems: Implications for AI bias and fairness.Machine Learning with Applications, 15:100525, 2024. URL https: //doi.org/10.1016/j.mlwa.2024.100525

  19. [20]

    Agent- based modelling meets generative ai in social network simulations

    Antonino Ferraro, Antonio Galli, Valerio La Gatta, Marco Postiglione, Gian Marco Or- lando, Diego Russo, Giuseppe Riccio, Antonio Romano, and Vincenzo Moscato. Agent- based modelling meets generative ai in social network simulations. InInternational Con- ference on Advances in Social Networks Analysis and Mining, pages 155–170, 2024. URL https://link.spri...

  20. [21]

    Nicoló Fontana, Francesco Pierri, and Luca Maria Aiello. Nicer than humans: How do large language models behave in the prisoner’s dilemma? InProceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 522–535, 2025. URL https: //arxiv.org/abs/2406.13605

  21. [22]

    Evaluating effect size in psychological research: Sense and nonsense.Advances in Methods and Practices in Psychological Science, 2(2):156–168, 2019

    David C Funder and Daniel J Ozer. Evaluating effect size in psychological research: Sense and nonsense.Advances in Methods and Practices in Psychological Science, 2(2):156–168, 2019. URLhttps://journals.sagepub.com/doi/10.1177/2515245919847202

  22. [23]

    Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1259, 2024

    Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1259, 2024. doi: 10.1057/s41599-024-03611-3. URLhttps://doi.org/10.1057/s41599-024-03611-3

  23. [24]

    Large language model driven agents for simulating echo chamber formation.arXiv preprint arXiv:2502.18138, 2025

    Chenhao Gu, Ling Luo, Zainab Razia Zaidi, and Shanika Karunasekera. Large language model driven agents for simulating echo chamber formation.arXiv preprint arXiv:2502.18138, 2025. URLhttps://arxiv.org/abs/2502.18138

  24. [25]

    Scandinavian Journal of Statistics6(2), 65–70 (1979),http://www.jstor.org/stable/4615733

    Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, pages 65–70, 1979. URLhttps://www.jstor.org/stable/4615733

  25. [26]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024. URL https://...

  26. [27]

    Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023

    John J Horton, Apostolos Filippas, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023. URL https://dl.acm.org/doi/10.1145/3670865 .3673513

  27. [28]

    Can a society of generative agents simulate human behavior and inform public health policy? a case study on vaccine hesitancy.arXiv preprint arXiv:2503.09639, 2025

    Abe Bohan Hou, Hongru Du, Yichen Wang, Jingyu Zhang, Zixiao Wang, Paul Pu Liang, Daniel Khashabi, Lauren Gardner, and Tianxing He. Can a society of generative agents simulate human behavior and inform public health policy? a case study on vaccine hesitancy.arXiv preprint arXiv:2503.09639, 2025. URLhttps://arxiv.org/abs/2503.09639

  28. [29]

    War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227, 2023

    Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars, 2023. URLhttps://arxiv.org/abs/2311.17227. 12

  29. [30]

    Policysim: An llm-based agent social simulation sandbox for proactive policy optimization

    Renhong Huang, Ning Tang, Jiarong Xu, Yuxuan Cao, Qingqian Tu, Sheng Guo, Bo Zheng, Huiyuan Liu, and Yang Yang. Policysim: An llm-based agent social simulation sandbox for proactive policy optimization. InProceedings of the ACM Web Conference 2026, pages 4781–4792, 2026. URLhttps://dl.acm.org/doi/abs/10.1145/3774904.3792555

  30. [31]

    Explicit cooperation shapes human-like multi-agent llm negotiation

    Yanru Jiang and Gül¸ sah Akçakır. Explicit cooperation shapes human-like multi-agent llm negotiation. InProceedings of the 1st ICWSM Workshop on Integrating NLP and Psychology to Study Social Interactions, 2025. doi: 10.36190/2025.34. URL https://workshop-proceed ings.icwsm.org/abstract.php?id=2025_34

  31. [32]

    Validation is the central challenge for generative social simulation: A critical review of llms in agent-based modeling.Artificial Intelligence Review, 59 (1):15, 2025

    Maik Larooij and Petter Törnberg. Validation is the central challenge for generative social simulation: A critical review of llms in agent-based modeling.Artificial Intelligence Review, 59 (1):15, 2025. doi: 10.1007/s10462-025-11412-6. URL https://doi.org/10.1007/s10462 -025-11412-6

  32. [33]

    Llm generated persona is a promise with a catch.arXiv preprint arXiv:2503.16527, 2025

    Ang Li, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng. Llm generated persona is a promise with a catch.arXiv preprint arXiv:2503.16527, 2025. URL https://neurips.cc/v irtual/2025/loc/san-diego/poster/121924

  33. [34]

    Econagent: Large language model-empowered agents for simulating macroeconomic activities

    Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: Large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15523–15536. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024. acl-long.829/

  34. [35]

    Spontaneous giving and calculated greed in language models

    Yuxuan Li and Hirokazu Shirado. Spontaneous giving and calculated greed in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5271–5286, 2025. URLhttps://aclanthology.org/2025.emnlp-main.267/

  35. [36]

    Mosaic: Modeling social ai for content dissemination and regulation in multi-agent simulations

    Genglin Liu, Vivian Le, Salman Rahman, Elisa Kreiss, Marzyeh Ghassemi, and Saadia Gabriel. Mosaic: Modeling social ai for content dissemination and regulation in multi-agent simulations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6390–6417. Association for Computational Linguistics, 2025. doi: 10.18653/...

  36. [37]

    From skepticism to acceptance: Simulating the attitude dynamics toward fake news

    Yuhan Liu, Xiuying Chen, Xiaoqing Zhang, Xing Gao, Ji Zhang, and Rui Yan. From skepticism to acceptance: Simulating the attitude dynamics toward fake news. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, pages 7886–7894. International Joint Conferences on Artificial Intelligence Organization, 2024. doi: 10.249...

  37. [38]

    Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024

    Nunzio Lorè and Babak Heydari. Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024. URL https://www.nature.com/articles/s41598-024-69032-z

  38. [39]

    Macy and Robert Willer

    Michael W. Macy and Robert Willer. From factors to actors: Computational sociology and agent-based modeling.Annual Review of Sociology, 28(1):143–166, 2002. doi: 10.1146/annure v.soc.28.110601.141117. URL https://www.annualreviews.org/content/journals/1 0.1146/annurev.soc.28.110601.141117

  39. [40]

    Roco: Dialectic multi-robot collaboration with large language models, 2023

    Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models, 2023. URLhttps://arxiv.org/abs/2307.04738

  40. [41]

    Mf-llm: Simulating collective decision dynamics via a mean-field large language model framework, 2025

    Qirui Mi, Mengyue Yang, Xiangning Yu, Zhiyu Zhao, Cheng Deng, Bo An, Haifeng Zhang, Xu Chen, and Jun Wang. Mf-llm: Simulating collective decision dynamics via a mean-field large language model framework, 2025. URLhttps://arxiv.org/abs/2504.21582

  41. [42]

    State of what art? a call for multi-prompt llm evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

    Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt llm evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024. URL https://aclanthology.org/2024.ta cl-1.52/. 13

  42. [43]

    Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation

    Xinyi Mou, Zhongyu Wei, Qi Huang, and Xuanjing Wu. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4789–4809. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.findings-acl.285. URL https: //aclantholog...

  43. [44]

    Emergent coordinated behaviors in networked llm agents: Modeling the strategic dynamics of information operations

    Gian Marco Orlando, Jinyi Ye, Valerio La Gatta, Mahdi Saeedi, Vincenzo Moscato, Emilio Ferrara, and Luca Luceri. Emergent coordinated behaviors in networked llm agents: Modeling the strategic dynamics of information operations. InProceedings of the ACM Web Conference 2026, pages 4805–4816, 2026

  44. [45]

    Validation and verification of agent-based models in the social sciences

    Paul Ormerod and Bridget Rosewell. Validation and verification of agent-based models in the social sciences. In Flaminio Squazzoni, editor,Epistemological Aspects of Computer Simulation in the Social Sciences, volume 5466 ofLecture Notes in Computer Science, pages 130–140. Springer, Berlin, Heidelberg, 2009. doi: 10.1007/978-3-642-01109-2_10. URL https://...

  45. [46]

    S., O’Brien, J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23. Association for Computing Machinery, 2023. doi: 10.1145/3586183.3606763. URL https:...

  46. [47]

    Large language models sensitivity to the order of options in multiple-choice questions

    Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico, 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.130. URL https: //aclanthology.o...

  47. [48]

    The future is now: Revolutionising decision-making with ai-driven simulations

    PHF Science. The future is now: Revolutionising decision-making with ai-driven simulations. https://www.phfscience.nz/news-publications/the-future-is-now-revolutio nising-decision-making-with-ai-driven-simulations/ , December 2024. Accessed: 2026-05-02

  48. [49]

    Emergence of human-like polarization among large language model agents

    Jinghua Piao, Zhihong Lu, Chen Gao, Fengli Xu, Qinghua Hu, Fernando P Santos, Yong Li, and James Evans. Emergence of human-like polarization among large language model agents. arXiv preprint arXiv:2501.05171, 2025. URLhttps://arxiv.org/abs/2501.05171

  49. [50]

    AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

    Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025. URL https:/...

  50. [51]

    Sandboxsocial: A sandbox for social media using multimodal ai agents

    Maximilian Puelma Touzel, Sneheel Sarangi, Gayatri Krishnakumar, Busra Tugce Gurbuz, Austin Welch, Zachary Yang, Andreea Musulan, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Camille Thibault, Reihaneh Rabbany, Jean-François Godbout, Dan Zhao, and Kellin Pelrine. Sandboxsocial: A sandbox for social media using multimodal ai agents. InProceedings of the Thirty-Fou...

  51. [52]

    Position: Time to close the validation gap in llm social simulations, 2026

    Maximilian Puelma Touzel, Sneheel Sarangi, Aurelien Bück-Kaeffer, Zachary Yang, Jean- François Godbout, and Reihaneh Rabbany. Position: Time to close the validation gap in llm social simulations, 2026. URL https://www.complexdatalab.com/stamina/papers/pu elmatouzel_CloseEvalGap.pdf. Preprint

  52. [53]

    C hat D ev: Communicative Agents for Software Development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186. As...

  53. [54]

    Benchmarking prompt sensitivity in large language models.arXiv preprint arXiv:2502.06065, 2025

    Aryan Razavi, Aref Jafari, Alona Fyshe, and Gholamreza Haffari. Benchmarking prompt sensitivity in large language models.arXiv preprint arXiv:2502.06065, 2025. URL https: //arxiv.org/abs/2502.06065

  54. [55]

    Simworld: An open- ended realistic simulator for autonomous agents in physical and social worlds, 2025

    Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, and Lianhui Qin. Simworld: An open- ended realistic s...

  55. [56]

    Bases: Large-scale web search user simulation with large language model based agents

    Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024. URLhttps://aclanthology.org/2024.findings-emnlp.50/

  56. [57]

    Emergence of social norms in generative agent societies: Principles and architecture

    Siyue Ren, Zhiyao Cui, Ruiqi Song, Zhen Wang, and Shuyue Hu. Emergence of social norms in generative agent societies: Principles and architecture. In Kate Larson, editor,Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 7895–7903. International Joint Conferences on Artificial Intelligence Organizati...

  57. [58]

    Reynolds

    Craig W. Reynolds. Flocks, herds and schools: A distributed behavioral model. InProceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, pages 25–34. Association for Computing Machinery, 1987. doi: 10.1145/37402.37406. URLhttps://dl.acm.org/doi/10.1145/37402.37406

  58. [59]

    The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance

    Abel Salinas and Fred Morstatter. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4629–4651, 2024. URL https://aclantholo gy.org/2024.findings-acl.275/

  59. [60]

    John Wiley & Sons, Chichester, UK, 2008

    Andrea Saltelli, Marco Ratto, Terry Andres, Francesca Campolongo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, and Stefano Tarantola.Global Sensitivity Analysis: The Primer. John Wiley & Sons, Chichester, UK, 2008. ISBN 9780470059975. doi: 10.1002/9780470725184. URLhttps://doi.org/10.1002/9780470725184

  60. [61]

    Robert G. Sargent. Verification and validation of simulation models. InProceedings of the 2010 Winter Simulation Conference, pages 166–183. IEEE, 2010. doi: 10.1109/WSC.2010.5679166. URLhttps://doi.org/10.1109/WSC.2010.5679166

  61. [62]

    Schelling

    Thomas C. Schelling. Dynamic models of segregation.Journal of Mathematical Sociology, 1 (2):143–186, 1971. doi: 10.1080/0022250X.1971.9989794. URL https://www.tandfonlin e.com/doi/abs/10.1080/0022250X.1971.9989794

  62. [63]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2310.11324

  63. [64]

    Specification curve analysis.Nature Human Behaviour, 4(11):1208–1214, 2020

    Uri Simonsohn, Joseph P Simmons, and Leif D Nelson. Specification curve analysis.Nature Human Behaviour, 4(11):1208–1214, 2020. URL https://www.nature.com/articles/s4 1562-020-0912-z

  64. [65]

    Increasing trans- parency through a multiverse analysis.Perspectives on Psychological Science, 11(5):702–712,

    Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. Increasing trans- parency through a multiverse analysis.Perspectives on Psychological Science, 11(5):702–712,

  65. [66]

    URLhttps://pubmed.ncbi.nlm.nih.gov/27694465/

  66. [67]

    Simulating social media using large language models to evaluate alternative news feed algorithms.arXiv preprint arXiv:2310.05984, 2023

    Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. Simulating social media using large language models to evaluate alternative news feed algorithms.arXiv preprint arXiv:2310.05984, 2023. URLhttps://arxiv.org/abs/2310.05984. 15

  67. [68]

    Cunningham, Simon Osindero, William S

    Alexander Sasha Vezhnevets, Jayd Matyas, Logan Cross, Davide Paglieri, Minsuk Chang, William A. Cunningham, Simon Osindero, William S. Isaac, and Joel Z. Leibo. Multi-actor generative artificial intelligence as a game engine, 2025. URL https://arxiv.org/abs/25 07.08892

  68. [69]

    Decoding echo chambers: Llm-powered simulations revealing polarization in social networks

    Chenxi Wang, Zongfang Liu, Dequan Yang, and Xiuying Chen. Decoding echo chambers: Llm-powered simulations revealing polarization in social networks. InProceedings of the 31st International Conference on Computational Linguistics, pages 3913–3923, 2025. URL https://aclanthology.org/2025.coling-main.264/

  69. [70]

    YuLan-OneSim: Towards the next generation of social simulator with large language models

    Lei Wang, Heyang Gao, Xiaohe Bo, Xu Chen, and Ji-Rong Wen. YuLan-OneSim: Towards the next generation of social simulator with large language models. InNeurIPS 2025 Workshop on Scientific Methods for Understanding Deep Learning, 2025. URL https://arxiv.org/abs/ 2505.07581

  70. [71]

    User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025

    Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025. doi: 10.1145/3708985. URL https://doi.org/10 .1145/3708985

  71. [72]

    Humanoid agents: Platform for simulating human-like generative agents

    Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. Humanoid agents: Platform for simulating human-like generative agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 167–176. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-demo.15. URL https: //aclanthology...

  72. [73]

    Making models match: Replicating an agent-based model

    Uri Wilensky and William Rand. Making models match: Replicating an agent-based model. Journal of Artificial Societies and Social Simulation, 10(4):2, 2007. URL https://www.jass s.org/10/4/2.html

  73. [74]

    Will systems of llm agents lead to cooperation: An investigation into a social dilemma

    Richard Willis, Yali Du, and Joel Z Leibo. Will systems of llm agents lead to cooperation: An investigation into a social dilemma. In24th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2025, pages 2786–2788. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2025. URL https://dl.acm.o rg/doi/10.55...

  74. [75]

    Empirical validation of agent-based models: Alternatives and prospects.Journal of Artificial Societies and Social Simulation, 10(2): 8, 2007

    Paul Windrum, Giorgio Fagiolo, and Alessio Moneta. Empirical validation of agent-based models: Alternatives and prospects.Journal of Artificial Societies and Social Simulation, 10(2): 8, 2007. URLhttps://ideas.repec.org/a/jas/jasssj/2006-40-2.html

  75. [76]

    Twinmarket: A scalable behavioral and social simulation for financial markets.arXiv preprint arXiv:2502.01506, 2025

    Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), volume 39 ofNeurIPS, 2025. URLhttps://arxiv.org/abs/2502.01506

  76. [77]

    Oasis: Open agent social interaction simulations with one million agents.arXiv preprint arXiv:2411.11581, 2024

    Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, and Jing Shao. Oasis: Open agent social interaction simulations with one million agent...

  77. [78]

    Simulating social network with llm agents: an analysis of information propagation and echo chambers

    Wenzhen Zheng and Xijin Tang. Simulating social network with llm agents: an analysis of information propagation and echo chambers. InInternational Symposium on Knowledge and Systems Sciences, pages 63–77. Springer, 2024. URL https://link.springer.com/chap ter/10.1007/978-981-96-0178-3_5

  78. [79]

    The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies

    Jiaxu Zhou, Jen-tse Huang, Xuhui Zhou, Man Ho Lam, Xintao Wang, Hao Zhu, Wenxuan Wang, and Maarten Sap. The pimmur principles: Ensuring validity in collective behavior of llm societies.arXiv preprint arXiv:2509.18052, 2025. URL https://arxiv.org/abs/2509.1 8052. 16

  79. [80]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2023. URL https://arxiv.org/abs/23 07.13854

  80. [81]

    Sotopia: Interactive evaluation for social intelligence in language agents

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Zhengyang Qi, Haofei Yu, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/f orum?id=mM7VurbA4r

Showing first 80 references.