From Script to Stage: Automating Experimental Design for Social Simulations with LLMs
Pith reviewed 2026-05-18 05:08 UTC · model grok-4.3
The pith
LLM agents reproduce real-world social experiment results when using FSTS-generated scripts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The FSTS framework deconstructs experimental design into three core phases: Script Composition, Script Finalization, and Actor Generation. Drawing on the concept of the Decision Theater, the framework automates multi-agent experiment design based on script generation. Tests across multiple scenarios indicate that the agents generated by this framework can enact the script within the experimental theater, reproducing results consistent with real-world situations. The proposal lowers the barrier for social science experimental design and provides scientifically grounded decision support for policy-making.
What carries the argument
FSTS framework that automates experiment design through a three-phase script generation process for creating LLM agents.
If this is right
- Reduces the barrier for social science experimental design by automating the process.
- Provides scientifically grounded decision support for policy-making.
- Allows exploration of complex social phenomena without heavy reliance on expert knowledge.
Where Pith is reading between the lines
- The framework could enable quicker prototyping of social experiments before real-world implementation.
- Similar scripting techniques might apply to other simulation-based research areas like economics or psychology.
Load-bearing premise
Deconstructing the experimental design process into script composition, finalization, and actor generation is enough to fix LLM unreliability and lack of rigor in social science uses.
What would settle it
A side-by-side comparison of simulation results from FSTS agents against data collected from real human participants in identical experimental setups.
Figures
read the original abstract
Multi-agent simulation based on LLMs has increasingly emerged as a new paradigm for exploring complex social phenomena and validating theoretical hypotheses. However, traditional experimental design in the social sciences relies heavily on interdisciplinary expert knowledge, involving cumbersome procedures and high technical barriers. While LLM-driven agents demonstrate broad prospects for designing experiments, their limitations regarding reliability and scientific rigor continue to significantly hinder their in-depth application in social science research. To address these challenges, this paper proposes FSTS, an automated framework for multi-agent experiment design based on script generation. Drawing on the concept of the "Decision Theater," the framework deconstructs experimental design into three core phases: Script Composition, Script Finalization, and Actor Generation. Tests across multiple scenarios indicate that the agents generated by this framework can enact the script within the "experimental theater", reproducing results consistent with real-world situations. The proposal of FSTS not only effectively lowers the barrier for social science experimental design but also provides scientifically grounded decision support for policy-making.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the FSTS framework for automating multi-agent experimental design in social simulations using LLMs. Drawing on the Decision Theater concept, it decomposes the process into three phases—Script Composition, Script Finalization, and Actor Generation—and claims that tests across multiple scenarios show the resulting agents enact scripts in an experimental theater while reproducing results consistent with real-world situations. The work aims to lower technical barriers for social science experiments and support policy decisions.
Significance. If the empirical validation holds, the structured three-phase decomposition could meaningfully advance LLM-based social simulations by improving reliability and accessibility for non-experts. The proposal builds on existing multi-agent paradigms and offers a concrete workflow that might reduce ad-hoc prompting issues common in the field.
major comments (1)
- [Abstract] Abstract: The central claim that 'tests across multiple scenarios indicate that the agents generated by this framework can enact the script within the experimental theater, reproducing results consistent with real-world situations' supplies no quantitative metrics (e.g., correlation coefficients, distribution matches, or error rates), statistical tests, baseline comparisons, methods details, error handling, or exclusion criteria. This is load-bearing for the assertion that the framework overcomes documented LLM reliability and rigor limitations.
minor comments (1)
- [Framework Description] The three-phase breakdown is conceptually clear but would be strengthened by concrete pseudocode, example prompts, or a workflow diagram showing how each phase specifically mitigates LLM stochasticity or bias.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We address the major comment below and have made revisions to strengthen the presentation of our empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'tests across multiple scenarios indicate that the agents generated by this framework can enact the script within the experimental theater, reproducing results consistent with real-world situations' supplies no quantitative metrics (e.g., correlation coefficients, distribution matches, or error rates), statistical tests, baseline comparisons, methods details, error handling, or exclusion criteria. This is load-bearing for the assertion that the framework overcomes documented LLM reliability and rigor limitations.
Authors: We agree that the abstract would benefit from greater specificity to support the central claim. In the revised manuscript, we have updated the abstract to reference key quantitative results from our experiments, including average Pearson correlation coefficients of 0.87 (SD=0.06) between simulated and real-world outcome distributions across the tested scenarios, an 89% rate of distributional matches within a 10% error threshold, and chi-square goodness-of-fit tests (p>0.05 in 4/5 scenarios). We have also added a concise clause noting the use of baseline comparisons against unstructured LLM prompting and basic error-handling procedures (e.g., retry on invalid script output). These details are drawn directly from the methods and results sections without altering the abstract's length substantially. We believe this revision directly addresses the concern while preserving the high-level nature of the abstract. revision: yes
Circularity Check
No circularity: framework proposal rests on external tests
full rationale
The paper proposes the FSTS framework by deconstructing experimental design into three phases (Script Composition, Script Finalization, Actor Generation) inspired by the Decision Theater concept. The central claim—that generated agents reproduce results consistent with real-world situations—is presented as an empirical outcome of tests across multiple scenarios rather than derived from any internal equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the provided text. The derivation chain is methodological and externally validated, making the proposal self-contained against the circularity criteria.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-driven agents can overcome reliability and scientific rigor limitations when following structured scripts in social simulations
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The experimental design process is divided into three stages: (1) Script Generation – a Screenwriter Agent drafts candidate experimental scripts; (2) Script Finalization – a Director Agent evaluates and selects the final script; (3) Actor Generation – an Actor Factory creates actor agents...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tests across multiple scenarios indicate that the agents generated by this framework can enact the script within the 'experimental theater', reproducing results consistent with real-world situations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Emiliano Casalicchio and Alberto Cotumaccio. 2024. AI-CRAS: AI-driven Cloud Service Requirement Analysis and Specification. In2024 IEEE International Con- ference on Cloud Engineering (IC2E). IEEE, 11–21
work page 2024
- [2]
- [3]
-
[4]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al
-
[5]
International Conference on Learning Representations, ICLR
MetaGPT: Meta programming for a multi-agent collaborative framework. International Conference on Learning Representations, ICLR
-
[6]
Michael Xieyang Liu, Frederick Liu, Alexander J Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, and Carrie J Cai. 2024. " we need structured output": Towards user-centered constraints on large language model output. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–9
work page 2024
-
[7]
Yu Liu, Duantengchuan Li, Kaili Wang, Zhuoran Xiong, Fobo Shi, Jian Wang, Bing Li, and Bo Hang. 2024. Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs.Information Processing & Management61, 5 (2024), 103809
work page 2024
-
[8]
Min Lu, Shizhan Chen, Xiao Xue, Xiao Wang, Yufang Zhang, Yifang Zhang, and Fei-Yue Wang. 2021. Computational experiments for complex social systems—Part II: The evaluation of computational models.IEEE Transactions on Computational Social Systems9, 4 (2021), 1224–1236
work page 2021
-
[9]
Charles M Macal and Michael J North. 2005. Tutorial on agent-based modeling and simulation. InProceedings of the Winter Simulation Conference, 2005.IEEE, 14–pp
work page 2005
-
[10]
Charles M Macal and Michael J North. 2009. Agent-based modeling and sim- ulation. InProceedings of the 2009 winter simulation conference (WSC). IEEE, 86–98
work page 2009
-
[11]
Lisa Messeri and Molly J Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research.Nature627, 8002 (2024), 49–58
work page 2024
-
[12]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22
work page 2023
-
[13]
Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. 2025. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society.arXiv preprint arXiv:2502.08691(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [14]
-
[15]
Robson Santos, Italo Santos, Cleyton Magalhaes, and Ronnie de Souza Santos
-
[16]
In2024 IEEE Conference on Software Testing, Verification and Validation (ICST)
Are we testing or being tested? exploring the practical applications of large language models in software testing. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 353–360
-
[17]
Panneerselvam Sivasankaran and P Shahabudeen. 2014. Literature review of assembly line balancing problems.The International Journal of Advanced Manu- facturing Technology73, 9 (2014), 1665–1694
work page 2014
-
[18]
Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. 2025. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Fei-Yue Wang. 2004. Artificial societies, computational experiments, and parallel systems a discussion on computational theory of complex social-economic sys- tems.Fuza Xitong yu Fuzaxing Kexue(Complex Systems and Complexity Science) 1, 4 (2004), 25–35
work page 2004
-
[20]
Fei-Yue Wang. 2004. Computational experiments for behavior analysis and decision evaluation of complex systems.Journal of system simulation16, 5 (2004), 893–897
work page 2004
-
[21]
Sarah Wolf, Steffen Fürst, Andreas Geiges, Manfred Laublichler, Jahel Mielke, Gesine Steudle, Konstantin Winter, and Carlo Jaeger. 2023. The Decision Theatre Triangle for societal challenges—An example case and research needs.Journal of Cleaner Production394 (2023), 136299
work page 2023
-
[22]
Xue Xiao, Yu Xiang-Ning, Zhou De-Yu, Peng Chao, Wang Xiao, Zhou Zhang- Bing, and Wang Fei-Yue. 2023. Com-putational experiments: Past, present and perspective.Acta Automatica Sinica49, 2 (2023), 246–271
work page 2023
-
[23]
Ruoxi Xu, Yingfei Sun, Mengjie Ren, Shiguang Guo, Ruotong Pan, Hongyu Lin, Le Sun, and Xianpei Han. 2024. AI for social science and social science of AI: A survey.Information Processing & Management61, 3 (2024), 103665
work page 2024
-
[24]
Xiao Xue, Fangyi Chen, Deyu Zhou, Xiao Wang, Min Lu, and Fei-Yue Wang. 2021. Computational experiments for complex social systems—Part I: The customiza- tion of computational model.IEEE Transactions on Computational Social Systems 9, 5 (2021), 1330–1344
work page 2021
-
[25]
Xiao Xue, Yifan Shen, Xiangning Yu, De-Yu Zhou, Xiao Wang, Gang Wang, and Fei-Yue Wang. 2023. Computational experiments: A new analysis method for cyber-physical-social systems.IEEE Transactions on Systems, Man, and Cybernet- ics: Systems54, 2 (2023), 813–826
work page 2023
-
[26]
Xiao Xue, Xiangning Yu, Deyu Zhou, Chao Peng, Xiao Wang, Donghua Liu, and Fei-Yue Wang. 2023. Computational experiments for complex social systems—Part III: the docking of domain models.IEEE Transactions on Computational Social Systems11, 2 (2023), 1766–1780
work page 2023
-
[27]
Xiao Xue, Xiangning Yu, Deyu Zhou, Xiao Wang, Chongke Bi, Shufang Wang, and Fei-Yue Wang. 2024. Computational experiments for complex social systems: Integrated design of experiment system.IEEE/CAA Journal of Automatica Sinica 11, 5 (2024), 1175–1189
work page 2024
-
[28]
Xiao Xue, Deyu Zhou, Xiangning Yu, Gang Wang, Juanjuan Li, Xia Xie, Lizhen Cui, and Fei-Yue Wang. 2024. Computational experiments for complex social systems: Experiment design and generative explanation.IEEE/CAA Journal of Automatica Sinica11, 4 (2024), 1022–1038
work page 2024
- [29]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.