Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits
Pith reviewed 2026-05-20 13:35 UTC · model grok-4.3
The pith
Scientific claims from LLM social simulations must be limited by the robustness audits performed to support them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them, since minor perturbations in agent specification, interaction protocols, and environment design can cascade through repeated interactions to alter macro-level outcomes such as cooperation rates or polarization metrics.
What carries the argument
TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a three-level structure that organizes checks at the agent (micro), interaction (meso), and system (macro) layers of simulation design.
If this is right
- Any statement about social mechanisms such as cooperation or norm formation must be accompanied by evidence that the result holds under reasonable variations in agent persona format and game-instruction wording.
- Network-based findings on polarization or echo chambers require explicit tests of homophily levels and hub assignment before the results can be attributed to the modeled social process.
- Robustness is not a general property of LLM agents but must be measured separately for each claim and each model family.
- Simulations intended to evaluate interventions or guide decisions need documented sensitivity analysis at all three levels of the TRAILS taxonomy.
Where Pith is reading between the lines
- The same sensitivity pattern could affect LLM agent use in non-social domains such as market simulations or organizational modeling.
- Standardized robustness benchmarks across model families would let researchers choose simulation backbones that minimize implementation artifacts.
- Longer simulation runs or different memory architectures might either amplify or dampen the observed sensitivities, offering a direct next test.
Load-bearing premise
The specific perturbations tested in the case studies are representative of the design choices that researchers typically make when building LLM social simulations.
What would settle it
A broad survey of published LLM social simulation studies that finds reported outcomes remain stable when the same persona, framing, and network perturbations are applied would undermine the claim that robustness audits are generally required.
Figures
read the original abstract
The scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them. Generative agents bring new expressive power to agent-based modeling, enabling simulations of collective social processes like cooperation, polarization, and norm formation. Yet they also introduce complexity through additional architectural choices, such as agent specification, memory representation, interaction protocols, and environment design. Small perturbations that appear minor to researchers can cascade into macro-level outcomes through repeated interaction, creating a "butterfly effect." Consequently, scientific claims drawn from LLM social simulations may reflect implementation artifacts rather than the social mechanisms being modeled. We support this position with two case studies: a repeated Prisoner's Dilemma and a social media echo chamber simulation. Across multiple models, minor perturbations in persona format and game-instruction framing shift cooperation rates by up to 76 percentage points, while network homophily and hub assignment produce significant and consistent shifts in polarization metrics. We also find that sensitivity is unevenly distributed across both architectural choices and model families: the same perturbation that produces the 76 pp shift in one frontier model only shifts another by 1 pp. Robustness is therefore a property that should be measured per claim and per model, not assumed. To address this validation gap, we introduce TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a robustness-audit taxonomy spanning three levels of simulation design: agent (micro-level), interaction (meso-level), and system (macro-level). We call for robustness to become a first-order validation requirement before LLM social simulations are used to explain mechanisms, evaluate interventions, or inform decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that scientific claims drawn from LLM social simulations should be no stronger than the robustness audits supporting them. It posits that generative agents introduce architectural choices (agent specification, memory, interaction protocols) that can produce a 'butterfly effect' in which minor perturbations cascade to macro outcomes. This is supported by two case studies—a repeated Prisoner's Dilemma and a social media echo chamber—showing cooperation rate shifts of up to 76 percentage points under persona format and game-instruction changes, plus consistent polarization shifts under network homophily and hub assignment variations, with uneven sensitivity across model families. The paper introduces the TRAILS taxonomy (agent/micro, interaction/meso, system/macro levels) and advocates making robustness a first-order validation requirement.
Significance. If the reported sensitivities hold under fuller controls, the work would usefully caution the growing literature on generative-agent simulations of cooperation, polarization, and norm formation. The empirical demonstrations of large, model-dependent outcome shifts and the concrete TRAILS framework provide a practical starting point for standardizing audits, analogous to robustness checks already expected in traditional agent-based modeling. Credit is due for the reproducible-style case studies and the falsifiable prediction that un-audited claims risk reflecting implementation artifacts.
major comments (2)
- [Case studies] Case studies section: the manuscript reports large effect sizes (up to 76 pp cooperation shifts) and 'significant and consistent' polarization changes, yet does not detail the full set of statistical tests, exclusion criteria, or number of runs per condition. Because the central claim rests on these empirical demonstrations of fragility, the absence of these controls is load-bearing for readers' ability to assess whether the observed butterfly effects are robust to reasonable analysis choices.
- [Discussion / Implications] Discussion of generalizability: while the tested perturbations (persona format, framing, homophily, hub assignment) produce clear sensitivities, the paper does not include a sampling or citation analysis of how frequently these exact variations appear in published LLM social-simulation studies. This weakens the inference that the demonstrated sensitivities are representative of typical researcher practice rather than specific to the chosen conditions.
minor comments (2)
- [TRAILS taxonomy] TRAILS taxonomy: the three-level structure is clearly motivated, but an explicit mapping table linking each level to the specific perturbations used in the two case studies would strengthen the claim that TRAILS directly addresses the observed sensitivities.
- [Figures/Tables] Figure and table captions: ensure all model versions, temperature settings, and prompt templates are listed verbatim so that the reported percentage-point shifts can be exactly reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation of minor revision. The comments highlight opportunities to improve transparency in our empirical demonstrations and to better situate the findings within existing literature. We address each major comment below and have incorporated revisions to strengthen the manuscript while preserving its core argument.
read point-by-point responses
-
Referee: [Case studies] Case studies section: the manuscript reports large effect sizes (up to 76 pp cooperation shifts) and 'significant and consistent' polarization changes, yet does not detail the full set of statistical tests, exclusion criteria, or number of runs per condition. Because the central claim rests on these empirical demonstrations of fragility, the absence of these controls is load-bearing for readers' ability to assess whether the observed butterfly effects are robust to reasonable analysis choices.
Authors: We agree that additional methodological detail is warranted to allow readers to evaluate the reliability of the reported effect sizes. In the revised manuscript we have added a new subsection to the Methods that specifies the number of independent runs per condition (50 runs for the repeated Prisoner's Dilemma and 30 runs for the echo-chamber simulations), the statistical procedures (two-sample t-tests with Bonferroni correction, Cohen's d effect sizes, and chi-square tests for categorical outcomes), and the absence of any data exclusion. We have also updated the figures to display 95% confidence intervals and p-values. These changes directly support the claim that the observed shifts are not artifacts of analysis choices. revision: yes
-
Referee: [Discussion / Implications] Discussion of generalizability: while the tested perturbations (persona format, framing, homophily, hub assignment) produce clear sensitivities, the paper does not include a sampling or citation analysis of how frequently these exact variations appear in published LLM social-simulation studies. This weakens the inference that the demonstrated sensitivities are representative of typical researcher practice rather than specific to the chosen conditions.
Authors: We acknowledge that a systematic citation or sampling analysis would provide stronger evidence of prevalence. Such an analysis, however, would constitute a separate meta-review and lies beyond the scope of the present work, which focuses on controlled demonstrations of fragility. In the revised Discussion we have added targeted citations to recent LLM social-simulation papers that employ comparable persona formats, instruction framings, and network-construction methods, thereby illustrating that the tested perturbations are not idiosyncratic. We have also clarified the language to frame our results as cautionary examples rather than universal claims, reinforcing that robustness must be assessed per study. revision: partial
Circularity Check
No circularity: central position rests on direct empirical case studies of outcome sensitivity rather than any self-referential derivation or fitted prediction
full rationale
The paper advances a normative claim that scientific conclusions from LLM social simulations must be bounded by robustness audits, supported by two explicit case studies (repeated Prisoner's Dilemma and echo chamber simulation) that measure large shifts in cooperation rates and polarization metrics under controlled perturbations. No equations, fitted parameters, or predictions are presented; the argument does not invoke self-citations for uniqueness theorems, smuggle ansatzes, or rename known results. The derivation chain is therefore self-contained as an empirical demonstration rather than a reduction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based generative agents can usefully simulate collective social processes such as cooperation and polarization
invented entities (1)
-
TRAILS taxonomy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cooperation, competition, and maliciousness: Llm-stakeholders interactive negotiation, 2023
Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. Cooperation, competition, and maliciousness: Llm-stakeholders interactive negotiation, 2023. URLhttps: //arxiv.org/abs/2309.17234
-
[2]
Playing repeated games with large language models.Nature Human Behaviour, 9:1134–1143,
Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9:1134–1143,
-
[3]
URL https://doi.org/10.1038/s41562-025 -02172-y
doi: 10.1038/s41562-025-02172-y. URL https://doi.org/10.1038/s41562-025 -02172-y
-
[4]
Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang
Altera.AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, Peter Y . Wang, Mathew Willows, Feitong Yang, and Guangyu Robert Yang. Project sid: Many-agent simulations toward ai civilization, 2024. URLhttps://arxiv.org/abs/2411.00114
-
[5]
Kozlowski, Bernard Koch, Erik Brynjolfsson, James Evans, and Michael S
Jacy Reese Anthis, Ryan Liu, Sean M Richardson, Austin C. Kozlowski, Bernard Koch, Erik Brynjolfsson, James Evans, and Michael S. Bernstein. Position: LLM social simulations are a promising research method. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste- Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of...
work page 2025
-
[6]
Ai agents as policymakers in simulated epidemics
Goshi Aoki and Navid Ghaffarzadegan. Ai agents as policymakers in simulated epidemics. arXiv preprint arXiv:2601.04245, 2026. URLhttps://arxiv.org/abs/2601.04245
-
[7]
Ariel Flint Ashery, Luca Maria Aiello, and Andrea Baronchelli. Emergent social conventions and collective bias in llm populations.Science Advances, 11(20):eadu9368, 2025. URL https://www.science.org/doi/10.1126/sciadv.adu9368
-
[8]
Sensitivity to initial conditions in agent- based models
Francesco Bertolotti, Angela Locoro, and Luca Mari. Sensitivity to initial conditions in agent- based models. InMulti-Agent Systems and Agreement Technologies, volume 12520 ofLecture Notes in Computer Science, pages 501–508. Springer, 2020. doi: 10.1007/978-3-030-66412-1 _32. URLhttps://doi.org/10.1007/978-3-030-66412-1_32
-
[9]
Philip Brookins and Jason DeBacker. Playing games with gpt: What can we learn about a large language model from canonical strategic games?Economics Bulletin, 44(1):25–37, 2024. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4493398
work page 2024
-
[10]
Chateval: Towards better llm-based evaluators through multi-agent debate,
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate,
-
[11]
URLhttps://arxiv.org/abs/2308.07201
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors, 2023. URLhttps://arxiv.org/abs/2308.10848
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
On the limits of agency in agent-based models
Ayush Chopra, Shashank Kumar, Nurullah Giray Kuru, Ramesh Raskar, and Arnau Quera- Bofarull. On the limits of agency in agent-based models. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 500–509, 2025. URL https://dl.acm.org/doi/10.5555/3709347.3743565
- [15]
-
[16]
Simulating opinion dynamics with networks of llm-based agents
Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy Rogers. Simulating opinion dynamics with networks of llm-based agents. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 3326–3346, 2024. URL https://aclanthology.org/2024.findin gs-naacl.211/
work page 2024
-
[17]
Jacob Cohen.Statistical power analysis for the behavioral sciences. Routledge, 2013
work page 2013
-
[18]
Recognition of behavioural intention in repeated games using machine learning
Alessandro Di Stefano, Chrisina Jayne, Claudio Angione, and The Anh Han. Recognition of behavioural intention in repeated games using machine learning. InArtificial Life Conference Proceedings, volume 1, page 103. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA, 2023. URL https://direct.mit.edu/isal/proceedings/isal2023/35/103/ 116860?
work page 2023
-
[19]
Emilio Ferrara. The butterfly effect in artificial intelligence systems: Implications for AI bias and fairness.Machine Learning with Applications, 15:100525, 2024. URL https: //doi.org/10.1016/j.mlwa.2024.100525
-
[20]
Agent- based modelling meets generative ai in social network simulations
Antonino Ferraro, Antonio Galli, Valerio La Gatta, Marco Postiglione, Gian Marco Or- lando, Diego Russo, Giuseppe Riccio, Antonio Romano, and Vincenzo Moscato. Agent- based modelling meets generative ai in social network simulations. InInternational Con- ference on Advances in Social Networks Analysis and Mining, pages 155–170, 2024. URL https://link.spri...
-
[21]
Nicoló Fontana, Francesco Pierri, and Luca Maria Aiello. Nicer than humans: How do large language models behave in the prisoner’s dilemma? InProceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 522–535, 2025. URL https: //arxiv.org/abs/2406.13605
-
[22]
David C Funder and Daniel J Ozer. Evaluating effect size in psychological research: Sense and nonsense.Advances in Methods and Practices in Psychological Science, 2(2):156–168, 2019. URLhttps://journals.sagepub.com/doi/10.1177/2515245919847202
-
[23]
Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives.Humanities and Social Sciences Communications, 11(1):1259, 2024. doi: 10.1057/s41599-024-03611-3. URLhttps://doi.org/10.1057/s41599-024-03611-3
-
[24]
Chenhao Gu, Ling Luo, Zainab Razia Zaidi, and Shanika Karunasekera. Large language model driven agents for simulating echo chamber formation.arXiv preprint arXiv:2502.18138, 2025. URLhttps://arxiv.org/abs/2502.18138
-
[25]
Scandinavian Journal of Statistics6(2), 65–70 (1979),http://www.jstor.org/stable/4615733
Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, pages 65–70, 1979. URLhttps://www.jstor.org/stable/4615733
-
[26]
Metagpt: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024. URL https://...
work page 2024
-
[27]
John J Horton, Apostolos Filippas, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023. URL https://dl.acm.org/doi/10.1145/3670865 .3673513
-
[28]
Abe Bohan Hou, Hongru Du, Yichen Wang, Jingyu Zhang, Zixiao Wang, Paul Pu Liang, Daniel Khashabi, Lauren Gardner, and Tianxing He. Can a society of generative agents simulate human behavior and inform public health policy? a case study on vaccine hesitancy.arXiv preprint arXiv:2503.09639, 2025. URLhttps://arxiv.org/abs/2503.09639
-
[29]
Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars, 2023. URLhttps://arxiv.org/abs/2311.17227. 12
-
[30]
Policysim: An llm-based agent social simulation sandbox for proactive policy optimization
Renhong Huang, Ning Tang, Jiarong Xu, Yuxuan Cao, Qingqian Tu, Sheng Guo, Bo Zheng, Huiyuan Liu, and Yang Yang. Policysim: An llm-based agent social simulation sandbox for proactive policy optimization. InProceedings of the ACM Web Conference 2026, pages 4781–4792, 2026. URLhttps://dl.acm.org/doi/abs/10.1145/3774904.3792555
-
[31]
Explicit cooperation shapes human-like multi-agent llm negotiation
Yanru Jiang and Gül¸ sah Akçakır. Explicit cooperation shapes human-like multi-agent llm negotiation. InProceedings of the 1st ICWSM Workshop on Integrating NLP and Psychology to Study Social Interactions, 2025. doi: 10.36190/2025.34. URL https://workshop-proceed ings.icwsm.org/abstract.php?id=2025_34
-
[32]
Maik Larooij and Petter Törnberg. Validation is the central challenge for generative social simulation: A critical review of llms in agent-based modeling.Artificial Intelligence Review, 59 (1):15, 2025. doi: 10.1007/s10462-025-11412-6. URL https://doi.org/10.1007/s10462 -025-11412-6
-
[33]
Llm generated persona is a promise with a catch.arXiv preprint arXiv:2503.16527, 2025
Ang Li, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng. Llm generated persona is a promise with a catch.arXiv preprint arXiv:2503.16527, 2025. URL https://neurips.cc/v irtual/2025/loc/san-diego/poster/121924
-
[34]
Econagent: Large language model-empowered agents for simulating macroeconomic activities
Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: Large language model-empowered agents for simulating macroeconomic activities. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15523–15536. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024. acl-long.829/
work page 2024
-
[35]
Spontaneous giving and calculated greed in language models
Yuxuan Li and Hirokazu Shirado. Spontaneous giving and calculated greed in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5271–5286, 2025. URLhttps://aclanthology.org/2025.emnlp-main.267/
work page 2025
-
[36]
Mosaic: Modeling social ai for content dissemination and regulation in multi-agent simulations
Genglin Liu, Vivian Le, Salman Rahman, Elisa Kreiss, Marzyeh Ghassemi, and Saadia Gabriel. Mosaic: Modeling social ai for content dissemination and regulation in multi-agent simulations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6390–6417. Association for Computational Linguistics, 2025. doi: 10.18653/...
-
[37]
From skepticism to acceptance: Simulating the attitude dynamics toward fake news
Yuhan Liu, Xiuying Chen, Xiaoqing Zhang, Xing Gao, Ji Zhang, and Rui Yan. From skepticism to acceptance: Simulating the attitude dynamics toward fake news. InProceedings of the Thirty- Third International Joint Conference on Artificial Intelligence, pages 7886–7894. International Joint Conferences on Artificial Intelligence Organization, 2024. doi: 10.249...
-
[38]
Nunzio Lorè and Babak Heydari. Strategic behavior of large language models and the role of game structure versus contextual framing.Scientific Reports, 14(1):18490, 2024. URL https://www.nature.com/articles/s41598-024-69032-z
work page 2024
-
[39]
Michael W. Macy and Robert Willer. From factors to actors: Computational sociology and agent-based modeling.Annual Review of Sociology, 28(1):143–166, 2002. doi: 10.1146/annure v.soc.28.110601.141117. URL https://www.annualreviews.org/content/journals/1 0.1146/annurev.soc.28.110601.141117
-
[40]
Roco: Dialectic multi-robot collaboration with large language models, 2023
Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models, 2023. URLhttps://arxiv.org/abs/2307.04738
-
[41]
Qirui Mi, Mengyue Yang, Xiangning Yu, Zhiyu Zhao, Cheng Deng, Bo An, Haifeng Zhang, Xu Chen, and Jun Wang. Mf-llm: Simulating collective decision dynamics via a mean-field large language model framework, 2025. URLhttps://arxiv.org/abs/2504.21582
-
[42]
Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt llm evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024. URL https://aclanthology.org/2024.ta cl-1.52/. 13
work page 2024
-
[43]
Xinyi Mou, Zhongyu Wei, Qi Huang, and Xuanjing Wu. Unveiling the truth and facilitating change: Towards agent-based large-scale social movement simulation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4789–4809. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.findings-acl.285. URL https: //aclantholog...
-
[44]
Gian Marco Orlando, Jinyi Ye, Valerio La Gatta, Mahdi Saeedi, Vincenzo Moscato, Emilio Ferrara, and Luca Luceri. Emergent coordinated behaviors in networked llm agents: Modeling the strategic dynamics of information operations. InProceedings of the ACM Web Conference 2026, pages 4805–4816, 2026
work page 2026
-
[45]
Validation and verification of agent-based models in the social sciences
Paul Ormerod and Bridget Rosewell. Validation and verification of agent-based models in the social sciences. In Flaminio Squazzoni, editor,Epistemological Aspects of Computer Simulation in the Social Sciences, volume 5466 ofLecture Notes in Computer Science, pages 130–140. Springer, Berlin, Heidelberg, 2009. doi: 10.1007/978-3-642-01109-2_10. URL https://...
-
[46]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23. Association for Computing Machinery, 2023. doi: 10.1145/3586183.3606763. URL https:...
-
[47]
Large language models sensitivity to the order of options in multiple-choice questions
Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, Mexico City, Mexico, 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.130. URL https: //aclanthology.o...
-
[48]
The future is now: Revolutionising decision-making with ai-driven simulations
PHF Science. The future is now: Revolutionising decision-making with ai-driven simulations. https://www.phfscience.nz/news-publications/the-future-is-now-revolutio nising-decision-making-with-ai-driven-simulations/ , December 2024. Accessed: 2026-05-02
work page 2024
-
[49]
Emergence of human-like polarization among large language model agents
Jinghua Piao, Zhihong Lu, Chen Gao, Fengli Xu, Qinghua Hu, Fernando P Santos, Yong Li, and James Evans. Emergence of human-like polarization among large language model agents. arXiv preprint arXiv:2501.05171, 2025. URLhttps://arxiv.org/abs/2501.05171
-
[50]
Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691, 2025. URL https:/...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Sandboxsocial: A sandbox for social media using multimodal ai agents
Maximilian Puelma Touzel, Sneheel Sarangi, Gayatri Krishnakumar, Busra Tugce Gurbuz, Austin Welch, Zachary Yang, Andreea Musulan, Hao Yu, Ethan Kosak-Hine, Tom Gibbs, Camille Thibault, Reihaneh Rabbany, Jean-François Godbout, Dan Zhao, and Kellin Pelrine. Sandboxsocial: A sandbox for social media using multimodal ai agents. InProceedings of the Thirty-Fou...
-
[52]
Position: Time to close the validation gap in llm social simulations, 2026
Maximilian Puelma Touzel, Sneheel Sarangi, Aurelien Bück-Kaeffer, Zachary Yang, Jean- François Godbout, and Reihaneh Rabbany. Position: Time to close the validation gap in llm social simulations, 2026. URL https://www.complexdatalab.com/stamina/papers/pu elmatouzel_CloseEvalGap.pdf. Preprint
work page 2026
-
[53]
C hat D ev: Communicative Agents for Software Development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186. As...
-
[54]
Benchmarking prompt sensitivity in large language models.arXiv preprint arXiv:2502.06065, 2025
Aryan Razavi, Aref Jafari, Alona Fyshe, and Gholamreza Haffari. Benchmarking prompt sensitivity in large language models.arXiv preprint arXiv:2502.06065, 2025. URL https: //arxiv.org/abs/2502.06065
-
[55]
Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, and Lianhui Qin. Simworld: An open- ended realistic s...
-
[56]
Bases: Large-scale web search user simulation with large language model based agents
Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024. URLhttps://aclanthology.org/2024.findings-emnlp.50/
work page 2024
-
[57]
Emergence of social norms in generative agent societies: Principles and architecture
Siyue Ren, Zhiyao Cui, Ruiqi Song, Zhen Wang, and Shuyue Hu. Emergence of social norms in generative agent societies: Principles and architecture. In Kate Larson, editor,Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 7895–7903. International Joint Conferences on Artificial Intelligence Organizati...
-
[58]
Craig W. Reynolds. Flocks, herds and schools: A distributed behavioral model. InProceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, pages 25–34. Association for Computing Machinery, 1987. doi: 10.1145/37402.37406. URLhttps://dl.acm.org/doi/10.1145/37402.37406
-
[59]
Abel Salinas and Fred Morstatter. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4629–4651, 2024. URL https://aclantholo gy.org/2024.findings-acl.275/
work page 2024
-
[60]
John Wiley & Sons, Chichester, UK, 2008
Andrea Saltelli, Marco Ratto, Terry Andres, Francesca Campolongo, Jessica Cariboni, Debora Gatelli, Michaela Saisana, and Stefano Tarantola.Global Sensitivity Analysis: The Primer. John Wiley & Sons, Chichester, UK, 2008. ISBN 9780470059975. doi: 10.1002/9780470725184. URLhttps://doi.org/10.1002/9780470725184
-
[61]
Robert G. Sargent. Verification and validation of simulation models. InProceedings of the 2010 Winter Simulation Conference, pages 166–183. IEEE, 2010. doi: 10.1109/WSC.2010.5679166. URLhttps://doi.org/10.1109/WSC.2010.5679166
-
[62]
Thomas C. Schelling. Dynamic models of segregation.Journal of Mathematical Sociology, 1 (2):143–186, 1971. doi: 10.1080/0022250X.1971.9989794. URL https://www.tandfonlin e.com/doi/abs/10.1080/0022250X.1971.9989794
-
[63]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2310.11324
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Specification curve analysis.Nature Human Behaviour, 4(11):1208–1214, 2020
Uri Simonsohn, Joseph P Simmons, and Leif D Nelson. Specification curve analysis.Nature Human Behaviour, 4(11):1208–1214, 2020. URL https://www.nature.com/articles/s4 1562-020-0912-z
work page 2020
-
[65]
Sara Steegen, Francis Tuerlinckx, Andrew Gelman, and Wolf Vanpaemel. Increasing trans- parency through a multiverse analysis.Perspectives on Psychological Science, 11(5):702–712,
- [66]
-
[67]
Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. Simulating social media using large language models to evaluate alternative news feed algorithms.arXiv preprint arXiv:2310.05984, 2023. URLhttps://arxiv.org/abs/2310.05984. 15
-
[68]
Cunningham, Simon Osindero, William S
Alexander Sasha Vezhnevets, Jayd Matyas, Logan Cross, Davide Paglieri, Minsuk Chang, William A. Cunningham, Simon Osindero, William S. Isaac, and Joel Z. Leibo. Multi-actor generative artificial intelligence as a game engine, 2025. URL https://arxiv.org/abs/25 07.08892
work page 2025
-
[69]
Decoding echo chambers: Llm-powered simulations revealing polarization in social networks
Chenxi Wang, Zongfang Liu, Dequan Yang, and Xiuying Chen. Decoding echo chambers: Llm-powered simulations revealing polarization in social networks. InProceedings of the 31st International Conference on Computational Linguistics, pages 3913–3923, 2025. URL https://aclanthology.org/2025.coling-main.264/
work page 2025
-
[70]
YuLan-OneSim: Towards the next generation of social simulator with large language models
Lei Wang, Heyang Gao, Xiaohe Bo, Xu Chen, and Ji-Rong Wen. YuLan-OneSim: Towards the next generation of social simulator with large language models. InNeurIPS 2025 Workshop on Scientific Methods for Understanding Deep Learning, 2025. URL https://arxiv.org/abs/ 2505.07581
-
[71]
Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji-Rong Wen. User behavior simulation with large language model-based agents.ACM Transactions on Information Systems, 43(2):1–37, 2025. doi: 10.1145/3708985. URL https://doi.org/10 .1145/3708985
-
[72]
Humanoid agents: Platform for simulating human-like generative agents
Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. Humanoid agents: Platform for simulating human-like generative agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 167–176. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-demo.15. URL https: //aclanthology...
-
[73]
Making models match: Replicating an agent-based model
Uri Wilensky and William Rand. Making models match: Replicating an agent-based model. Journal of Artificial Societies and Social Simulation, 10(4):2, 2007. URL https://www.jass s.org/10/4/2.html
work page 2007
-
[74]
Will systems of llm agents lead to cooperation: An investigation into a social dilemma
Richard Willis, Yali Du, and Joel Z Leibo. Will systems of llm agents lead to cooperation: An investigation into a social dilemma. In24th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2025, pages 2786–2788. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2025. URL https://dl.acm.o rg/doi/10.55...
-
[75]
Paul Windrum, Giorgio Fagiolo, and Alessio Moneta. Empirical validation of agent-based models: Alternatives and prospects.Journal of Artificial Societies and Social Simulation, 10(2): 8, 2007. URLhttps://ideas.repec.org/a/jas/jasssj/2006-40-2.html
work page 2007
-
[76]
Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), volume 39 ofNeurIPS, 2025. URLhttps://arxiv.org/abs/2502.01506
-
[77]
Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, and Jing Shao. Oasis: Open agent social interaction simulations with one million agent...
-
[78]
Simulating social network with llm agents: an analysis of information propagation and echo chambers
Wenzhen Zheng and Xijin Tang. Simulating social network with llm agents: an analysis of information propagation and echo chambers. InInternational Symposium on Knowledge and Systems Sciences, pages 63–77. Springer, 2024. URL https://link.springer.com/chap ter/10.1007/978-981-96-0178-3_5
-
[79]
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
Jiaxu Zhou, Jen-tse Huang, Xuhui Zhou, Man Ho Lam, Xintao Wang, Hao Zhu, Wenxuan Wang, and Maarten Sap. The pimmur principles: Ensuring validity in collective behavior of llm societies.arXiv preprint arXiv:2509.18052, 2025. URL https://arxiv.org/abs/2509.1 8052. 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2023. URL https://arxiv.org/abs/23 07.13854
work page 2023
-
[81]
Sotopia: Interactive evaluation for social intelligence in language agents
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Zhengyang Qi, Haofei Yu, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/f orum?id=mM7VurbA4r
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.