pith. sign in

arxiv: 2604.15329 · v1 · submitted 2026-03-08 · 💻 cs.HC · cs.AI· cs.CL

Evaluating LLMs as Human Surrogates in Controlled Experiments

Pith reviewed 2026-05-15 14:37 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CL
keywords large language modelshuman surrogatesbehavioral experimentssurvey experimentsbelief updatingaccuracy perceptioneffect size comparison
0
0 comments X

The pith

Off-the-shelf LLMs reproduce the direction of human belief updates in experiments but differ in effect magnitudes and moderation patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper checks whether large language models can replace human participants in behavioral experiments by feeding the same survey questions to both groups. Human answers from a study on accuracy perception are turned into structured prompts, and several LLMs generate single 0-10 scores without any extra training. The exact same statistical tests are then run on the model outputs and the original human data. LLMs get the overall direction of changes right in several cases, yet the size of those changes and how other factors moderate them often fail to line up with human results. The work therefore shows the conditions under which ready-made LLMs can serve as behavioral stand-ins.

Core claim

By converting each human observation into a structured prompt and generating a single 0-10 outcome variable without task-specific training, off-the-shelf LLMs reproduce several directional effects observed in humans from a canonical survey experiment on accuracy perception, but effect magnitudes and moderation patterns vary across models.

What carries the argument

Converting individual human observations into structured prompts so LLMs generate matching 0-10 responses, then applying identical statistical analyses to both human and synthetic datasets.

If this is right

  • LLM responses can support inferences about aggregate belief-updating patterns under controlled experimental conditions.
  • LLM-generated data does not reliably reproduce the size of human effects or the patterns by which other variables moderate them.
  • Off-the-shelf LLMs can function as behavioral surrogates only within the limits shown by direct statistical comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Different LLMs may need separate validation before use as surrogates, since performance varies by model.
  • Researchers could run LLM versions of an experiment first to screen designs before recruiting human participants.
  • Prompt engineering or fine-tuning might narrow the gap to human effect sizes, though the paper does not test this.

Load-bearing premise

Converting each human observation into a structured prompt preserves the original experimental conditions without introducing systematic differences in how the model interprets the task.

What would settle it

An off-the-shelf LLM producing effect magnitudes and moderation patterns that are statistically indistinguishable from the human data in repeated runs of the same accuracy-perception survey would falsify the claim that models do not consistently match human-scale effects.

Figures

Figures reproduced from arXiv: 2604.15329 by Adnan Hoq, Tim Weninger.

Figure 1
Figure 1. Figure 1: H2: Exposure effects across humans and LLM surrogates. Bars show partial η 2 p from two-way ANOVA models for (left) the main effect of AI credi￾bility exposure and (right) the exposure × political af￾filiation interaction; asterisks denote F-test significance (* p<.05, ** p<.01, *** p<.001). Humans exhibit a small exposure effect (η 2 p ≈ .003) with modest mod￾eration (≈ .002). GPT 5.2 matches the human-sc… view at source ↗
Figure 2
Figure 2. Figure 2: H3: Headline-aware belief updating correspondence. Points compare human ∆ to model ∆ across headline–affiliation–label cells (n = 63). The dashed line is y = x; orange lines are model fits. All models capture the directional structure of belief updating (GPT 5.2 r = .861, Gemma .851, Llama .716), but differ in magnitude calibration: GPT 5.2 is closest to human scale, Gemma compresses shifts, and Llama show… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used to simulate human responses in behavioral research, yet it remains unclear when LLM-generated data support the same experimental inferences as human data. We evaluate this by directly comparing off-the-shelf LLM-generated responses with human responses from a canonical survey experiment on accuracy perception. Each human observation is converted into a structured prompt, and models generate a single 0--10 outcome variable without task-specific training; identical statistical analyses are applied to human and synthetic responses. We find that LLMs reproduce several directional effects observed in humans, but effect magnitudes and moderation patterns vary across models. Off-the-shelf LLMs therefore capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects, clarifying when LLM-generated data can function as behavioral surrogates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper evaluates off-the-shelf LLMs as human surrogates by converting each observation from a canonical survey experiment on accuracy perception into a structured prompt, generating 0-10 ratings without task-specific training, and applying identical statistical analyses to compare LLM and human responses. It reports that LLMs reproduce several directional effects observed in humans but that effect magnitudes and moderation patterns vary across models, concluding that LLMs capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects.

Significance. If the central comparisons hold after addressing validation gaps, the work is significant because it supplies empirical head-to-head evidence that clarifies the boundary conditions for using LLM-generated data as behavioral surrogates, which can inform cost-effective scaling of experiments while highlighting risks of mismatched effect sizes.

major comments (1)
  1. [Methods] Methods section: The claim that LLM responses match human belief-updating rests on the assumption that converting each human observation into a structured prompt preserves experimental conditions without introducing systematic differences in task interpretation (e.g., via added formatting or pre-training priors). No quantitative validation such as prompt ablation, human re-rating of the LLM prompts, or sensitivity checks on prompt structure is reported, leaving this translation step untested despite being load-bearing for the surrogate claim.
minor comments (1)
  1. [Abstract] Abstract: Sample sizes, specific model names, exact statistical outputs (effect sizes, confidence intervals), and power calculations are omitted, which reduces immediate verifiability of the reported directional matches and magnitude differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We have carefully considered the major comment regarding the methods and provide our response below, along with plans for revision.

read point-by-point responses
  1. Referee: [Methods] Methods section: The claim that LLM responses match human belief-updating rests on the assumption that converting each human observation into a structured prompt preserves experimental conditions without introducing systematic differences in task interpretation (e.g., via added formatting or pre-training priors). No quantitative validation such as prompt ablation, human re-rating of the LLM prompts, or sensitivity checks on prompt structure is reported, leaving this translation step untested despite being load-bearing for the surrogate claim.

    Authors: We agree with the referee that the prompt translation step is central to our claims and that additional validation would strengthen the manuscript. The prompts were designed to replicate the exact survey questions and experimental conditions presented to human participants, with no additional task-specific instructions. However, we acknowledge the absence of quantitative validations such as ablations or sensitivity checks in the submitted version. In the revised manuscript, we will add a new subsection to the Methods detailing the prompt construction process, include examples in an appendix, and report sensitivity analyses varying key prompt elements (e.g., formatting and role descriptions) to demonstrate that the main findings are robust. We believe these additions will address the concern without altering the core conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical head-to-head comparison

full rationale

The paper conducts a direct empirical evaluation by converting human survey observations into structured prompts for off-the-shelf LLMs, generating responses, and applying identical statistical analyses to compare aggregate patterns against the original human data. No equations, derivations, fitted parameters, or self-referential logic are present in the described method or claims. The central result—that LLMs reproduce directional effects but vary in magnitudes—is obtained by external benchmarking against human data rather than by construction from any internal fit or prior self-citation. This is a standard controlled comparison with no load-bearing reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM responses generated from human-derived prompts are comparable to human responses under the same statistical tests.

axioms (1)
  • domain assumption Converting each human observation into a structured prompt accurately represents the original experimental task for the LLM without introducing model-specific interpretation biases.
    Invoked in the method of prompt construction from human data.

pith-pipeline@v0.9.0 · 5427 in / 1114 out tokens · 57702 ms · 2026-05-15T14:37:37.278619+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Suhaib Abdurahman, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. 2024. Perils and opportunities in using large language models in psychological research. PNAS nexus, 3(7):pgae245

  4. [4]

    Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. 2023. Using large language models to simulate multiple humans and replicate human subject studies. In International conference on machine learning, pages 337--371. PMLR

  5. [5]

    Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337--351

  6. [6]

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, and 1 others. 2025. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. In Proceedings of the 63rd Annual Meeting of the Association for Comput...

  7. [7]

    Marcel Binz and Eric Schulz. 2023. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120

  8. [8]

    James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. 2024. Synthetic replacements for human survey data? the perils of large language models. Political Analysis, 32(4):401--416

  9. [9]

    Colin F Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, J \"u rgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A Nosek, Thomas Pfeiffer, and 1 others. 2018. Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature human behaviour, 2(9):637--644

  10. [10]

    Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607--15631

  11. [11]

    Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science, 349(6251):aac4716

  12. [12]

    Ziyan Cui, Ning Li, and Huaikang Zhou. 2024. Can ai replace human subjects? a large-scale replication of psychological experiments with llms. A Large-Scale Replication of Psychological Experiments with LLMs (August 25, 2024)

  13. [13]

    Ziyan Cui, Ning Li, and Huaikang Zhou. 2025. A large-scale replication of scenario-based experiments in psychology and management using large language models. Nature Computational Science, 5(8):627--634

  14. [14]

    Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-D \"u nner. 2024. Questioning the survey responses of large language models. Advances in Neural Information Processing Systems, 37:45850--45878

  15. [15]

    Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Bodapati, and Dan Roth. 2024. Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and llm-as-a-judge. arXiv preprint arXiv:2410.03775

  16. [16]

    Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. 2024. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1):1--24

  17. [17]

    Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120

  18. [18]

    John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research

  19. [19]

    Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. 2023. War and peace (waragent): Large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227

  20. [20]

    Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. This human study did not involve human subjects: Validating llm simulations as behavioral evidence

  21. [21]

    Austin C Kozlowski and James Evans. 2025. Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions. Sociological Methods & Research, 54(3):1017--1073

  22. [22]

    Fan Li and Ya Yang. 2024. Impact of artificial intelligence--generated content labels on perceived accuracy, message credibility, and sharing intentions for misinformation: Web-based, randomized, controlled experiment. JMIR Formative Research, 8(1):e60024

  23. [23]

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023 a . Camel: Communicative agents for" mind" exploration of large language model society. Advances in neural information processing systems, 36:51991--52008

  24. [24]

    Jiaxuan Li, Lang Yu, and Allyson Ettinger. 2023 b . Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 804--815

  25. [25]

    Lisa Messeri and Molly J Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49--58

  26. [26]

    Marcus R Munaf \`o , Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis. 2017. A manifesto for reproducible science. Nature human behaviour, 1(1):0021

  27. [27]

    Michael Muthukrishna and Joseph Henrich. 2019. A problem in theory. Nature Human Behaviour, 3(3):221--229

  28. [28]

    Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

  29. [29]

    Jan Pf \"a nder and Sacha Altay. 2025. Spotting false news and doubting true news: a systematic review and meta-analysis of news judgements. Nature human behaviour, 9(4):688--699

  30. [30]

    Paul E Smaldino and Richard McElreath. 2016. The natural selection of bad science. Royal Society open science, 3(9)

  31. [31]

    Petter T \"o rnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588

  32. [32]

    Angelina Wang, Jamie Morgenstern, and John P Dickerson. 2025. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3):400--411