arxiv: 2605.00197 · v1 · submitted 2026-04-30 · 💻 cs.MA · cs.AI

Recognition: unknown

The textit{Silicon Society} Cookbook: Design Space of LLM-based Social Simulations

Aur\'elien B\"uck-Kaeffer (1 , 2 , 4) , Sneheel Sarangi (1 , 2) , Maximilian Puelma Touzel (1 , 3) , Reihaneh Rabbany (1

show 6 more authors

Zachary Yang (1 Jean-Fran\c{c}ois Godbout (2 3) ((1) McGill University (2) Mila - Quebec Artificial Intelligence Institute (3) Universit\'e de Montr\'eal (4) Ubisoft La Forge)

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:43 UTC · model grok-4.3

classification 💻 cs.MA cs.AI

keywords LLM social simulationagent-based modelingdesign space analysisbase model impactsurvey proxysilicon societyparameter interactions

0 comments

The pith

The choice of base LLM dominates outcomes in LLM-based social simulations, while design parameters interact in non-additive ways.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper maps how different choices in building simulated social networks with large language models affect what the agents end up believing and how they interact. It varies the base model and the way agents are linked, then uses survey answers as a stand-in for measuring opinions. The results show the design space does not behave like a simple grid: some factors add up cleanly, but others combine in more tangled ways, and swapping the underlying LLM produces the largest shifts. This matters because people are already running such simulations outside research labs, so clearer guidance on which settings matter most can reduce wasted effort and improve how believable the outputs are.

Core claim

Using surveys as a proxy for agent opinions, our findings suggest that the geometry of the design space is non-trivial, with some parameters behaving in additive ways while others display more complex interactions. In particular, the choice of the base LLM is the most important variable impacting the simulation outcomes.

What carries the argument

Systematic variation of base LLM and network-connection parameters, measured through repeated survey responses collected from the agents.

If this is right

Researchers can obtain most of the outcome variation by changing only the base model rather than exhaustively tuning every network detail.
Some parameter pairs can be adjusted independently because their effects add; others must be co-tuned because they interact.
Validation efforts for realism should prioritize testing across multiple base LLMs before claiming general results.
Existing LLM social simulations may need re-evaluation if their reported behaviors are tied to a single model choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dominance of base-LLM choice suggests that progress in general-purpose models will automatically improve simulation quality more than refinements in network topology.
Builders of large-scale social sims could develop lightweight model-selection protocols that test a few candidate LLMs on small survey batteries before full deployment.
The non-additive interactions imply that open-source simulation toolkits should include automated design-space search rather than simple grid sweeps.
If the survey proxy holds only for certain topics, the same framework could be extended to measure other outputs such as polarization or information spread.

Load-bearing premise

Survey answers given by the LLM agents faithfully stand in for the opinions and interaction patterns that would appear in the full running simulation.

What would settle it

Re-running the identical design sweeps but replacing survey questions with direct logs of agent-to-agent messages or emergent group behaviors and finding that the ranking of which parameter matters most reverses or flattens.

read the original abstract

Studies attempting to simulate human behavior with $\textit{Silicon Societies}$ grow in numbers while LLM-only social networks have started appearing outside of controlled settings. However, the design space of these networks remains under-studied, which contributes to a gap in validating model realism. To enable future works to make more informed design decisions, we perform a systematic analysis of the consequences and interactions of key design choices in simulated social networks, including the choice of base model used to model individual agents, and how they are connected to each other. Using surveys as a proxy for agent opinions, our findings suggest that the geometry of the design space is non-trivial, with some parameters behaving in additive ways while others display more complex interactions. In particular, the choice of the base LLM is the most important variable impacting the simulation outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic analysis of the design space for LLM-based social simulations (termed 'Silicon Societies'), focusing on parameters such as the choice of base LLM and agent connectivity structures. Using survey responses collected from LLM agents as a proxy for opinions and interaction dynamics, the authors conclude that the design space geometry is non-trivial, with some parameters exhibiting additive effects and others more complex interactions, and that the base LLM is the dominant variable influencing simulation outcomes.

Significance. If the survey-proxy assumption holds and is validated against full simulation runs, the work would offer practical guidance for designing more realistic and reproducible LLM social simulations, addressing a noted gap in model validation. It could help future studies avoid arbitrary design choices and improve fidelity to human social networks, particularly by emphasizing base-model selection.

major comments (2)

[Abstract and Results] Abstract and Results: The central claims about non-trivial design-space geometry and base-LLM dominance rest entirely on treating survey responses as a faithful proxy for agent opinions and emergent network dynamics. No quantitative validation (e.g., correlation coefficients, ablation studies, or direct comparison of survey metrics to full simulation outcomes such as opinion convergence, polarization, or network structure) is reported, leaving open the possibility that the proxy diverges from actual interaction behaviors due to missing conversational context or non-linear emergence.
[Methodology and Results] Methodology/Results: The abstract states clear directional findings yet supplies no quantitative results, error bars, exclusion criteria, or statistical tests for the survey comparisons. This absence makes it impossible to assess the magnitude or reliability of the reported additive vs. complex interactions or the ranking of variable importance.

minor comments (2)

[Methodology] The manuscript should include explicit details on survey question design, prompting regimes, and how responses are aggregated to serve as proxies, to allow replication and assessment of the proxy's validity.
[Results] Figures or tables summarizing parameter interactions would benefit from clearer labeling of additive vs. non-additive effects and inclusion of confidence intervals.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed comments, which have prompted us to strengthen the presentation of our methodology and results. We address each major comment point by point below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The central claims about non-trivial design-space geometry and base-LLM dominance rest entirely on treating survey responses as a faithful proxy for agent opinions and emergent network dynamics. No quantitative validation (e.g., correlation coefficients, ablation studies, or direct comparison of survey metrics to full simulation outcomes such as opinion convergence, polarization, or network structure) is reported, leaving open the possibility that the proxy diverges from actual interaction behaviors due to missing conversational context or non-linear emergence.

Authors: We acknowledge that the survey-based proxy is central to our analysis and that direct quantitative validation against full simulation runs was not performed. This design choice enabled a broad, systematic sweep of the design space at feasible computational cost; full multi-turn simulations for every parameter combination would have been prohibitive. In the revised manuscript we have added an expanded justification for the proxy (drawing on prior LLM-agent survey literature), a dedicated limitations subsection discussing risks of divergence due to missing conversational context, and preliminary correlation checks on a small held-out set of full simulations. We have not, however, been able to conduct exhaustive ablation studies across the entire design space. revision: partial
Referee: [Methodology and Results] Methodology/Results: The abstract states clear directional findings yet supplies no quantitative results, error bars, exclusion criteria, or statistical tests for the survey comparisons. This absence makes it impossible to assess the magnitude or reliability of the reported additive vs. complex interactions or the ranking of variable importance.

Authors: We agree that the original abstract and results presentation were too qualitative. The revised manuscript now includes quantitative metrics (e.g., variance explained by each factor), error bars derived from repeated survey administrations, explicit exclusion criteria for low-quality responses, and statistical tests (ANOVA and post-hoc comparisons) for assessing variable importance and interaction effects. The abstract has been updated to report the dominant role of base-LLM choice together with the key quantitative finding on variance explained. revision: yes

standing simulated objections not resolved

Comprehensive quantitative validation of the survey proxy via full simulation runs and direct comparison to emergent metrics (opinion convergence, polarization, network structure) across all design-parameter combinations, which would require computational resources substantially beyond the scope of the present study.

Circularity Check

0 steps flagged

No significant circularity; empirical analysis with no self-referential derivations

full rationale

The paper conducts an empirical study of LLM social simulation design choices, reporting observed patterns in survey responses used as a proxy for agent opinions. No equations, fitted parameters, predictions derived from subsets of data, or mathematical derivations are present. The central claims about design space geometry and variable importance follow directly from the survey data comparisons rather than reducing to self-definitions, self-citations, or ansatzes by construction. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical design-space study; no mathematical axioms, free parameters, or new postulated entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5529 in / 998 out tokens · 37147 ms · 2026-05-09T19:43:08.443354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 23 canonical work pages · 4 internal anchors

[1]

In: 29th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24)

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...

work page doi:10.1145/3620665.3640366 2024
[2]

URL https://aclanthology.org/2021

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K. Eckstein, Noémi Éltető, Thomas L. Griffiths, Susanne Haridi, Akshay K. Jagadish, Li Ji-An, Alexander Kipnis, Sreejan Kumar, Tobias Ludwig, Marvin Mathony, Marcelo Mattar, Alireza Modirshanechi, Surabhi S. Nath, Joshua C. Peter...

work page arXiv 2024
[3]

A foundation model to predict and capture human cognition

Marcel Binz, Elif Akata, Matthias Bethge, Franziska Br \"a ndle, Fred Callaway, Julian Coda-Forno, Peter Dayan, Can Demircan, Maria K Eckstein, No \'e mi \'E ltet o , et al. A foundation model to predict and capture human cognition. Nature, 644 0 (8078): 0 1002--1009, 2025

2025
[4]

Directed scale-free graphs

B\' e la Bollob\' a s, Christian Borgs, Jennifer Chayes, and Oliver Riordan. Directed scale-free graphs. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '03, pp.\ 132–139, USA, 2003. Society for Industrial and Applied Mathematics. ISBN 0898715385

2003
[5]

Shannon entropy, renyi entropy, and information

PA Bromiley, NA Thacker, and E Bouhova-Thacker. Shannon entropy, renyi entropy, and information. Statistics and Inf. Series (2004-004), 9 0 (2004): 0 2--8, 2004

2004
[6]

BluePrint : A social media user dataset for llm persona evaluation and training, 2025

Aurélien Bück-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-François Godbout, Reihaneh Rabbany, and Zachary Yang. BluePrint : A social media user dataset for llm persona evaluation and training, 2025. URL https://arxiv.org/abs/2510.02343

work page arXiv 2025
[7]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023. URL https://github.com/unslothai/unsloth

2023
[8]

Agent-based models

Scott De Marchi and Scott E Page. Agent-based models. Annual Review of political science, 17 0 (1): 0 1--20, 2014

2014
[9]

Publicationes Mathematicae Debrecen , author =

P \'a l Erd o s and Alfr \'e d R \'e nyi. On random graphs. i. Publicationes Mathematicae Debrecen, 6 0 (3--4): 0 290--297, 1959. doi:10.5486/PMD.1959.6.3-4.12. URL https://www.renyi.hu/ p_erdos/1959-11.pdf

work page doi:10.5486/pmd.1959.6.3-4.12 1959
[10]

Hagberg, Daniel A

Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference (SciPy2008), pp.\ 11--15, Pasadena, CA, USA, August 2008

2008
[11]

R., Millman, K

Charles R. Harris, K. Jarrod Millman, St \' e fan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern \' a ndez del R \' i o, Mark Wiebe, Pearu Peterson, Pierre G \' e rard-M...

work page doi:10.1038/s41586-020-2649-2 2020
[12]

The anatomy of the moltbook social graph

David Holtz. The anatomy of the moltbook social graph. arXiv preprint arXiv:2602.10131, 2026

work page arXiv 2026
[13]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, and Paul Röttger. Simbench: Benchmarking the ability of large language models to simulate human behaviors, 2025. URL https://arxiv.org/abs/2510.17516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

doi:10.48550/arXiv.2510.22954 , url =

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond), 2025. URL https://arxiv.org/abs/2510.22954

work page arXiv 2025
[16]

Humans welcome to observe

Yukun Jiang, Yage Zhang, Xinyue Shen, Michael Backes, and Yang Zhang. " humans welcome to observe": A first look at the agent social network moltbook. arXiv preprint arXiv:2602.10127, 2026

work page arXiv 2026
[17]

Do large language models solve the problems of agent-based modeling? a critical review of generative social simulations, 2025

Maik Larooij and Petter Törnberg. Do large language models solve the problems of agent-based modeling? a critical review of generative social simulations, 2025. URL https://arxiv.org/abs/2504.03274

work page arXiv 2025
[18]

arXiv preprint arXiv:2503.16527 , year=

Ang Li, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng. Llm generated persona is a promise with a catch, 2025. URL https://arxiv.org/abs/2503.16527

work page arXiv 2025
[19]

Malthouse

Xinyi Li, Yu Xu, Yongfeng Zhang, and Edward C. Malthouse. Large language model-driven multi-agent simulation for news diffusion under different network structures, 2024. URL https://arxiv.org/abs/2410.13909

work page arXiv 2024
[20]

AI Agents Alone Are Not (Yet) Sufficient for Social Simulation

Yiming Li and Dacheng Tao. Position: Ai agents are not (yet) a panacea for social simulation, 2026. URL https://arxiv.org/abs/2603.00113

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Real world community oriented high-definition social simulation: Combining reinforcement learning and large language models

Peng Lu, Mengdi Li, Yuhao Ke, and Siyang Liao. Real world community oriented high-definition social simulation: Combining reinforcement learning and large language models. Cities, 168: 0 106468, 2026

2026
[22]

Bail, Anikó Hannák, and Christopher Barrie

Nicolò Pagan, Petter Törnberg, Christopher A. Bail, Anikó Hannák, and Christopher Barrie. Computational turing test reveals systematic differences between human and ai language, 2025. URL https://arxiv.org/abs/2511.04195

work page arXiv 2025
[23]

Reinhart, B

Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown. Do llms write like humans? variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122 0 (8), February 2025. ISSN 1091-6490. doi:10.1073/pnas.2422455122. URL http://dx.doi.org/10.1073/pnas.2422455122

work page doi:10.1073/pnas.2422455122 2025
[24]

Questions and answers in attitude surveys: Experiments on question form, wording, and context

Howard Schuman and Stanley Presser. Questions and answers in attitude surveys: Experiments on question form, wording, and context. Sage, 1996

1996
[25]

Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant

Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, and Seraphina Goldfarb-Tarrant. Lost in simulation: Llm-simulated users are unreliable proxies for human users in agentic evaluations. arXiv preprint arXiv:2601.17087, 2026

work page arXiv 2026
[26]

Does instruction tuning reduce diversity? a case study using code generation

Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, and Osbert Bastani. Does instruction tuning reduce diversity? a case study using code generation
[27]

Emergence of fragility in llm-based social networks: the case of moltbook, 2026

Luca Sodano, Sofia Sciangula, Amulya Galmarini, and Francesco Bertolotti. Emergence of fragility in llm-based social networks: the case of moltbook, 2026. URL https://arxiv.org/abs/2603.23279

work page arXiv 2026
[28]

Gemma Team. Gemma 3. 2025. URL https://goo.gle/Gemma3Report

2025
[29]

Do llms exhibit human-like response biases? a case study in survey design

Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. Do llms exhibit human-like response biases? a case study in survey design. Transactions of the Association for Computational Linguistics, 12: 0 1011--1026, 2024

2024
[30]

The need for a socially-grounded persona framework for user simulation.arXiv preprint arXiv:2601.07110, 2026

Pranav Narayanan Venkit, Yu Li, Yada Pruksachatkun, and Chien-Sheng Wu. The need for a socially-grounded persona framework for user simulation, 2026. URL https://arxiv.org/abs/2601.07110

work page arXiv 2026
[31]

Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A

Alexander Sasha Vezhnevets, John P Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A Du \'e \ n ez-Guzm \'a n, William A Cunningham, Simon Osindero, Danny Karmon, and Joel Z Leibo. Generative agent-based modeling with actions grounded in physical, social, or digital space using concordia. arXiv preprint arXiv:2312.03664, 2023

work page arXiv 2023
[32]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

2020
[33]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review arXiv 2024
[34]

arXiv preprint arXiv:2411.11581 , year=

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, and Jing Shao. Oasis: Open agent social interaction simulations with one million agent...

work page arXiv 2025
[35]

Twinmarket: A scalable behavioral and social simulation for financial markets

YANG Yuzhe, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, and Benyou Wang. Twinmarket: A scalable behavioral and social simulation for financial markets. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[36]

Twhin-bert: A socially-enriched pre-trained language model for multilingual tweet representations

Xinyang Zhang, Yury Malkov, Omar Florez, Serim Park, Brian McWilliams, Jiawei Han, and Ahmed El-Kishky. Twhin-bert: A socially-enriched pre-trained language model for multilingual tweet representations. arXiv preprint arXiv:2209.07562, 2022

work page arXiv 2022
[37]

Rolesimllm: Towards large-scale and comprehensive social propagation simulation via role-based llm-driven agents

Jiaxing Zheng, Changqing Li, Peng Wu, and Li Pan. Rolesimllm: Towards large-scale and comprehensive social propagation simulation via role-based llm-driven agents. Information Processing & Management, 63 0 (5): 0 104689, 2026

2026