Evaluating LLMs as Human Surrogates in Controlled Experiments
Pith reviewed 2026-05-15 14:37 UTC · model grok-4.3
The pith
Off-the-shelf LLMs reproduce the direction of human belief updates in experiments but differ in effect magnitudes and moderation patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting each human observation into a structured prompt and generating a single 0-10 outcome variable without task-specific training, off-the-shelf LLMs reproduce several directional effects observed in humans from a canonical survey experiment on accuracy perception, but effect magnitudes and moderation patterns vary across models.
What carries the argument
Converting individual human observations into structured prompts so LLMs generate matching 0-10 responses, then applying identical statistical analyses to both human and synthetic datasets.
If this is right
- LLM responses can support inferences about aggregate belief-updating patterns under controlled experimental conditions.
- LLM-generated data does not reliably reproduce the size of human effects or the patterns by which other variables moderate them.
- Off-the-shelf LLMs can function as behavioral surrogates only within the limits shown by direct statistical comparison.
Where Pith is reading between the lines
- Different LLMs may need separate validation before use as surrogates, since performance varies by model.
- Researchers could run LLM versions of an experiment first to screen designs before recruiting human participants.
- Prompt engineering or fine-tuning might narrow the gap to human effect sizes, though the paper does not test this.
Load-bearing premise
Converting each human observation into a structured prompt preserves the original experimental conditions without introducing systematic differences in how the model interprets the task.
What would settle it
An off-the-shelf LLM producing effect magnitudes and moderation patterns that are statistically indistinguishable from the human data in repeated runs of the same accuracy-perception survey would falsify the claim that models do not consistently match human-scale effects.
Figures
read the original abstract
Large language models (LLMs) are increasingly used to simulate human responses in behavioral research, yet it remains unclear when LLM-generated data support the same experimental inferences as human data. We evaluate this by directly comparing off-the-shelf LLM-generated responses with human responses from a canonical survey experiment on accuracy perception. Each human observation is converted into a structured prompt, and models generate a single 0--10 outcome variable without task-specific training; identical statistical analyses are applied to human and synthetic responses. We find that LLMs reproduce several directional effects observed in humans, but effect magnitudes and moderation patterns vary across models. Off-the-shelf LLMs therefore capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects, clarifying when LLM-generated data can function as behavioral surrogates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates off-the-shelf LLMs as human surrogates by converting each observation from a canonical survey experiment on accuracy perception into a structured prompt, generating 0-10 ratings without task-specific training, and applying identical statistical analyses to compare LLM and human responses. It reports that LLMs reproduce several directional effects observed in humans but that effect magnitudes and moderation patterns vary across models, concluding that LLMs capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects.
Significance. If the central comparisons hold after addressing validation gaps, the work is significant because it supplies empirical head-to-head evidence that clarifies the boundary conditions for using LLM-generated data as behavioral surrogates, which can inform cost-effective scaling of experiments while highlighting risks of mismatched effect sizes.
major comments (1)
- [Methods] Methods section: The claim that LLM responses match human belief-updating rests on the assumption that converting each human observation into a structured prompt preserves experimental conditions without introducing systematic differences in task interpretation (e.g., via added formatting or pre-training priors). No quantitative validation such as prompt ablation, human re-rating of the LLM prompts, or sensitivity checks on prompt structure is reported, leaving this translation step untested despite being load-bearing for the surrogate claim.
minor comments (1)
- [Abstract] Abstract: Sample sizes, specific model names, exact statistical outputs (effect sizes, confidence intervals), and power calculations are omitted, which reduces immediate verifiability of the reported directional matches and magnitude differences.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We have carefully considered the major comment regarding the methods and provide our response below, along with plans for revision.
read point-by-point responses
-
Referee: [Methods] Methods section: The claim that LLM responses match human belief-updating rests on the assumption that converting each human observation into a structured prompt preserves experimental conditions without introducing systematic differences in task interpretation (e.g., via added formatting or pre-training priors). No quantitative validation such as prompt ablation, human re-rating of the LLM prompts, or sensitivity checks on prompt structure is reported, leaving this translation step untested despite being load-bearing for the surrogate claim.
Authors: We agree with the referee that the prompt translation step is central to our claims and that additional validation would strengthen the manuscript. The prompts were designed to replicate the exact survey questions and experimental conditions presented to human participants, with no additional task-specific instructions. However, we acknowledge the absence of quantitative validations such as ablations or sensitivity checks in the submitted version. In the revised manuscript, we will add a new subsection to the Methods detailing the prompt construction process, include examples in an appendix, and report sensitivity analyses varying key prompt elements (e.g., formatting and role descriptions) to demonstrate that the main findings are robust. We believe these additions will address the concern without altering the core conclusions. revision: yes
Circularity Check
No circularity: empirical head-to-head comparison
full rationale
The paper conducts a direct empirical evaluation by converting human survey observations into structured prompts for off-the-shelf LLMs, generating responses, and applying identical statistical analyses to compare aggregate patterns against the original human data. No equations, derivations, fitted parameters, or self-referential logic are present in the described method or claims. The central result—that LLMs reproduce directional effects but vary in magnitudes—is obtained by external benchmarking against human data rather than by construction from any internal fit or prior self-citation. This is a standard controlled comparison with no load-bearing reductions to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Converting each human observation into a structured prompt accurately represents the original experimental task for the LLM without introducing model-specific interpretation biases.
Reference graph
Works this paper leans on
-
[1]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Suhaib Abdurahman, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. 2024. Perils and opportunities in using large language models in psychological research. PNAS nexus, 3(7):pgae245
work page 2024
-
[4]
Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. 2023. Using large language models to simulate multiple humans and replicate human subject studies. In International conference on machine learning, pages 337--371. PMLR
work page 2023
-
[5]
Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337--351
work page 2023
-
[6]
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, and 1 others. 2025. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. In Proceedings of the 63rd Annual Meeting of the Association for Comput...
work page 2025
-
[7]
Marcel Binz and Eric Schulz. 2023. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120
work page 2023
-
[8]
James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. 2024. Synthetic replacements for human survey data? the perils of large language models. Political Analysis, 32(4):401--416
work page 2024
-
[9]
Colin F Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, J \"u rgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A Nosek, Thomas Pfeiffer, and 1 others. 2018. Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature human behaviour, 2(9):637--644
work page 2018
-
[10]
Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607--15631
work page 2023
-
[11]
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science, 349(6251):aac4716
work page 2015
-
[12]
Ziyan Cui, Ning Li, and Huaikang Zhou. 2024. Can ai replace human subjects? a large-scale replication of psychological experiments with llms. A Large-Scale Replication of Psychological Experiments with LLMs (August 25, 2024)
work page 2024
-
[13]
Ziyan Cui, Ning Li, and Huaikang Zhou. 2025. A large-scale replication of scenario-based experiments in psychology and management using large language models. Nature Computational Science, 5(8):627--634
work page 2025
-
[14]
Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-D \"u nner. 2024. Questioning the survey responses of large language models. Advances in Neural Information Processing Systems, 37:45850--45878
work page 2024
- [15]
-
[16]
Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. 2024. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1):1--24
work page 2024
-
[17]
Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120
work page 2023
-
[18]
John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research
work page 2023
- [19]
-
[20]
Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. This human study did not involve human subjects: Validating llm simulations as behavioral evidence
work page 2025
-
[21]
Austin C Kozlowski and James Evans. 2025. Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions. Sociological Methods & Research, 54(3):1017--1073
work page 2025
-
[22]
Fan Li and Ya Yang. 2024. Impact of artificial intelligence--generated content labels on perceived accuracy, message credibility, and sharing intentions for misinformation: Web-based, randomized, controlled experiment. JMIR Formative Research, 8(1):e60024
work page 2024
-
[23]
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023 a . Camel: Communicative agents for" mind" exploration of large language model society. Advances in neural information processing systems, 36:51991--52008
work page 2023
-
[24]
Jiaxuan Li, Lang Yu, and Allyson Ettinger. 2023 b . Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 804--815
work page 2023
-
[25]
Lisa Messeri and Molly J Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49--58
work page 2024
-
[26]
Marcus R Munaf \`o , Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis. 2017. A manifesto for reproducible science. Nature human behaviour, 1(1):0021
work page 2017
-
[27]
Michael Muthukrishna and Joseph Henrich. 2019. A problem in theory. Nature Human Behaviour, 3(3):221--229
work page 2019
-
[28]
Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22
work page 2023
-
[29]
Jan Pf \"a nder and Sacha Altay. 2025. Spotting false news and doubting true news: a systematic review and meta-analysis of news judgements. Nature human behaviour, 9(4):688--699
work page 2025
-
[30]
Paul E Smaldino and Richard McElreath. 2016. The natural selection of bad science. Royal Society open science, 3(9)
work page 2016
- [31]
-
[32]
Angelina Wang, Jamie Morgenstern, and John P Dickerson. 2025. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3):400--411
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.