Evaluating LLMs as Human Surrogates in Controlled Experiments

Adnan Hoq; Tim Weninger

arxiv: 2604.15329 · v1 · submitted 2026-03-08 · 💻 cs.HC · cs.AI· cs.CL

Evaluating LLMs as Human Surrogates in Controlled Experiments

Adnan Hoq , Tim Weninger This is my paper

Pith reviewed 2026-05-15 14:37 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CL

keywords large language modelshuman surrogatesbehavioral experimentssurvey experimentsbelief updatingaccuracy perceptioneffect size comparison

0 comments

The pith

Off-the-shelf LLMs reproduce the direction of human belief updates in experiments but differ in effect magnitudes and moderation patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper checks whether large language models can replace human participants in behavioral experiments by feeding the same survey questions to both groups. Human answers from a study on accuracy perception are turned into structured prompts, and several LLMs generate single 0-10 scores without any extra training. The exact same statistical tests are then run on the model outputs and the original human data. LLMs get the overall direction of changes right in several cases, yet the size of those changes and how other factors moderate them often fail to line up with human results. The work therefore shows the conditions under which ready-made LLMs can serve as behavioral stand-ins.

Core claim

By converting each human observation into a structured prompt and generating a single 0-10 outcome variable without task-specific training, off-the-shelf LLMs reproduce several directional effects observed in humans from a canonical survey experiment on accuracy perception, but effect magnitudes and moderation patterns vary across models.

What carries the argument

Converting individual human observations into structured prompts so LLMs generate matching 0-10 responses, then applying identical statistical analyses to both human and synthetic datasets.

If this is right

LLM responses can support inferences about aggregate belief-updating patterns under controlled experimental conditions.
LLM-generated data does not reliably reproduce the size of human effects or the patterns by which other variables moderate them.
Off-the-shelf LLMs can function as behavioral surrogates only within the limits shown by direct statistical comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Different LLMs may need separate validation before use as surrogates, since performance varies by model.
Researchers could run LLM versions of an experiment first to screen designs before recruiting human participants.
Prompt engineering or fine-tuning might narrow the gap to human effect sizes, though the paper does not test this.

Load-bearing premise

Converting each human observation into a structured prompt preserves the original experimental conditions without introducing systematic differences in how the model interprets the task.

What would settle it

An off-the-shelf LLM producing effect magnitudes and moderation patterns that are statistically indistinguishable from the human data in repeated runs of the same accuracy-perception survey would falsify the claim that models do not consistently match human-scale effects.

Figures

Figures reproduced from arXiv: 2604.15329 by Adnan Hoq, Tim Weninger.

**Figure 1.** Figure 1: H2: Exposure effects across humans and LLM surrogates. Bars show partial η 2 p from two-way ANOVA models for (left) the main effect of AI credibility exposure and (right) the exposure × political affiliation interaction; asterisks denote F-test significance (* p<.05, ** p<.01, *** p<.001). Humans exhibit a small exposure effect (η 2 p ≈ .003) with modest moderation (≈ .002). GPT 5.2 matches the human-sc… view at source ↗

**Figure 2.** Figure 2: H3: Headline-aware belief updating correspondence. Points compare human ∆ to model ∆ across headline–affiliation–label cells (n = 63). The dashed line is y = x; orange lines are model fits. All models capture the directional structure of belief updating (GPT 5.2 r = .861, Gemma .851, Llama .716), but differ in magnitude calibration: GPT 5.2 is closest to human scale, Gemma compresses shifts, and Llama show… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to simulate human responses in behavioral research, yet it remains unclear when LLM-generated data support the same experimental inferences as human data. We evaluate this by directly comparing off-the-shelf LLM-generated responses with human responses from a canonical survey experiment on accuracy perception. Each human observation is converted into a structured prompt, and models generate a single 0--10 outcome variable without task-specific training; identical statistical analyses are applied to human and synthetic responses. We find that LLMs reproduce several directional effects observed in humans, but effect magnitudes and moderation patterns vary across models. Off-the-shelf LLMs therefore capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects, clarifying when LLM-generated data can function as behavioral surrogates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a direct head-to-head on one canonical experiment by feeding individual human observations into LLM prompts and running identical stats, showing directional matches but inconsistent magnitudes.

read the letter

The core result is straightforward: off-the-shelf LLMs pick up some of the same directional belief-updating patterns that humans show in an accuracy-perception survey, yet the effect sizes and moderation patterns do not line up reliably across models. That distinction is useful for anyone thinking about substituting synthetic data for human subjects in controlled settings. What the work actually contributes is the protocol itself—converting each real observation into a structured prompt, collecting a single 0-10 rating with no fine-tuning, and then applying the exact same statistical tests to both datasets. Prior simulation studies usually stop at aggregate comparisons or population-level prompting; this one keeps the mapping one-to-one, which makes the mismatches easier to interpret. The abstract is clear on the takeaway and avoids overclaiming, which is a plus. The main limitation is the missing operational detail. Sample sizes, exact model versions, power calculations, and the numerical outputs are not reported in the summary, so it is difficult to gauge how large the gaps really are or whether they would survive a different prompt format. The stress-test point about prompt translation is fair: without an ablation or a check that the structured input preserves the original task salience, any match could partly reflect how the model parses the added formatting rather than genuine surrogacy. That said, the design is not circular and rests on external human data, so the empirical comparison still stands as a concrete test case. This is the kind of paper that belongs in a methods-focused HCI or behavioral-science venue. Readers who run experiments and are weighing LLM data for pilots or power analyses will find the protocol worth trying on their own tasks. It deserves a serious referee round; the central comparison is worth the time even if the authors need to add the quantitative specifics and a prompt-sensitivity check before publication.

Referee Report

1 major / 1 minor

Summary. The paper evaluates off-the-shelf LLMs as human surrogates by converting each observation from a canonical survey experiment on accuracy perception into a structured prompt, generating 0-10 ratings without task-specific training, and applying identical statistical analyses to compare LLM and human responses. It reports that LLMs reproduce several directional effects observed in humans but that effect magnitudes and moderation patterns vary across models, concluding that LLMs capture aggregate belief-updating patterns under controlled conditions but do not consistently match human-scale effects.

Significance. If the central comparisons hold after addressing validation gaps, the work is significant because it supplies empirical head-to-head evidence that clarifies the boundary conditions for using LLM-generated data as behavioral surrogates, which can inform cost-effective scaling of experiments while highlighting risks of mismatched effect sizes.

major comments (1)

[Methods] Methods section: The claim that LLM responses match human belief-updating rests on the assumption that converting each human observation into a structured prompt preserves experimental conditions without introducing systematic differences in task interpretation (e.g., via added formatting or pre-training priors). No quantitative validation such as prompt ablation, human re-rating of the LLM prompts, or sensitivity checks on prompt structure is reported, leaving this translation step untested despite being load-bearing for the surrogate claim.

minor comments (1)

[Abstract] Abstract: Sample sizes, specific model names, exact statistical outputs (effect sizes, confidence intervals), and power calculations are omitted, which reduces immediate verifiability of the reported directional matches and magnitude differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We have carefully considered the major comment regarding the methods and provide our response below, along with plans for revision.

read point-by-point responses

Referee: [Methods] Methods section: The claim that LLM responses match human belief-updating rests on the assumption that converting each human observation into a structured prompt preserves experimental conditions without introducing systematic differences in task interpretation (e.g., via added formatting or pre-training priors). No quantitative validation such as prompt ablation, human re-rating of the LLM prompts, or sensitivity checks on prompt structure is reported, leaving this translation step untested despite being load-bearing for the surrogate claim.

Authors: We agree with the referee that the prompt translation step is central to our claims and that additional validation would strengthen the manuscript. The prompts were designed to replicate the exact survey questions and experimental conditions presented to human participants, with no additional task-specific instructions. However, we acknowledge the absence of quantitative validations such as ablations or sensitivity checks in the submitted version. In the revised manuscript, we will add a new subsection to the Methods detailing the prompt construction process, include examples in an appendix, and report sensitivity analyses varying key prompt elements (e.g., formatting and role descriptions) to demonstrate that the main findings are robust. We believe these additions will address the concern without altering the core conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical head-to-head comparison

full rationale

The paper conducts a direct empirical evaluation by converting human survey observations into structured prompts for off-the-shelf LLMs, generating responses, and applying identical statistical analyses to compare aggregate patterns against the original human data. No equations, derivations, fitted parameters, or self-referential logic are present in the described method or claims. The central result—that LLMs reproduce directional effects but vary in magnitudes—is obtained by external benchmarking against human data rather than by construction from any internal fit or prior self-citation. This is a standard controlled comparison with no load-bearing reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM responses generated from human-derived prompts are comparable to human responses under the same statistical tests.

axioms (1)

domain assumption Converting each human observation into a structured prompt accurately represents the original experimental task for the LLM without introducing model-specific interpretation biases.
Invoked in the method of prompt construction from human data.

pith-pipeline@v0.9.0 · 5427 in / 1114 out tokens · 57702 ms · 2026-05-15T14:37:37.278619+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Suhaib Abdurahman, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. 2024. Perils and opportunities in using large language models in psychological research. PNAS nexus, 3(7):pgae245

work page 2024
[4]

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. 2023. Using large language models to simulate multiple humans and replicate human subject studies. In International conference on machine learning, pages 337--371. PMLR

work page 2023
[5]

Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337--351

work page 2023
[6]

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, and 1 others. 2025. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. In Proceedings of the 63rd Annual Meeting of the Association for Comput...

work page 2025
[7]

Marcel Binz and Eric Schulz. 2023. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120

work page 2023
[8]

James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. 2024. Synthetic replacements for human survey data? the perils of large language models. Political Analysis, 32(4):401--416

work page 2024
[9]

Colin F Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, J \"u rgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A Nosek, Thomas Pfeiffer, and 1 others. 2018. Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature human behaviour, 2(9):637--644

work page 2018
[10]

Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607--15631

work page 2023
[11]

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science, 349(6251):aac4716

work page 2015
[12]

Ziyan Cui, Ning Li, and Huaikang Zhou. 2024. Can ai replace human subjects? a large-scale replication of psychological experiments with llms. A Large-Scale Replication of Psychological Experiments with LLMs (August 25, 2024)

work page 2024
[13]

Ziyan Cui, Ning Li, and Huaikang Zhou. 2025. A large-scale replication of scenario-based experiments in psychology and management using large language models. Nature Computational Science, 5(8):627--634

work page 2025
[14]

Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-D \"u nner. 2024. Questioning the survey responses of large language models. Advances in Neural Information Processing Systems, 37:45850--45878

work page 2024
[15]

Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Bodapati, and Dan Roth. 2024. Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and llm-as-a-judge. arXiv preprint arXiv:2410.03775

work page arXiv 2024
[16]

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. 2024. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1):1--24

work page 2024
[17]

Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120

work page 2023
[18]

John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research

work page 2023
[19]

Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. 2023. War and peace (waragent): Large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227

work page arXiv 2023
[20]

Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. This human study did not involve human subjects: Validating llm simulations as behavioral evidence

work page 2025
[21]

Austin C Kozlowski and James Evans. 2025. Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions. Sociological Methods & Research, 54(3):1017--1073

work page 2025
[22]

Fan Li and Ya Yang. 2024. Impact of artificial intelligence--generated content labels on perceived accuracy, message credibility, and sharing intentions for misinformation: Web-based, randomized, controlled experiment. JMIR Formative Research, 8(1):e60024

work page 2024
[23]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023 a . Camel: Communicative agents for" mind" exploration of large language model society. Advances in neural information processing systems, 36:51991--52008

work page 2023
[24]

Jiaxuan Li, Lang Yu, and Allyson Ettinger. 2023 b . Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 804--815

work page 2023
[25]

Lisa Messeri and Molly J Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49--58

work page 2024
[26]

Marcus R Munaf \`o , Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis. 2017. A manifesto for reproducible science. Nature human behaviour, 1(1):0021

work page 2017
[27]

Michael Muthukrishna and Joseph Henrich. 2019. A problem in theory. Nature Human Behaviour, 3(3):221--229

work page 2019
[28]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

work page 2023
[29]

Jan Pf \"a nder and Sacha Altay. 2025. Spotting false news and doubting true news: a systematic review and meta-analysis of news judgements. Nature human behaviour, 9(4):688--699

work page 2025
[30]

Paul E Smaldino and Richard McElreath. 2016. The natural selection of bad science. Royal Society open science, 3(9)

work page 2016
[31]

Petter T \"o rnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588

work page arXiv 2023
[32]

Angelina Wang, Jamie Morgenstern, and John P Dickerson. 2025. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3):400--411

work page 2025

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Suhaib Abdurahman, Mohammad Atari, Farzan Karimi-Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza Dehghani. 2024. Perils and opportunities in using large language models in psychological research. PNAS nexus, 3(7):pgae245

work page 2024

[4] [4]

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. 2023. Using large language models to simulate multiple humans and replicate human subject studies. In International conference on machine learning, pages 337--371. PMLR

work page 2023

[5] [5]

Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337--351

work page 2023

[6] [6]

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, and 1 others. 2025. Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks. In Proceedings of the 63rd Annual Meeting of the Association for Comput...

work page 2025

[7] [7]

Marcel Binz and Eric Schulz. 2023. Using cognitive psychology to understand gpt-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120

work page 2023

[8] [8]

James Bisbee, Joshua D Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M Larson. 2024. Synthetic replacements for human survey data? the perils of large language models. Political Analysis, 32(4):401--416

work page 2024

[9] [9]

Colin F Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, J \"u rgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A Nosek, Thomas Pfeiffer, and 1 others. 2018. Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nature human behaviour, 2(9):637--644

work page 2018

[10] [10]

Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607--15631

work page 2023

[11] [11]

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science, 349(6251):aac4716

work page 2015

[12] [12]

Ziyan Cui, Ning Li, and Huaikang Zhou. 2024. Can ai replace human subjects? a large-scale replication of psychological experiments with llms. A Large-Scale Replication of Psychological Experiments with LLMs (August 25, 2024)

work page 2024

[13] [13]

Ziyan Cui, Ning Li, and Huaikang Zhou. 2025. A large-scale replication of scenario-based experiments in psychology and management using large language models. Nature Computational Science, 5(8):627--634

work page 2025

[14] [14]

Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-D \"u nner. 2024. Questioning the survey responses of large language models. Advances in Neural Information Processing Systems, 37:45850--45878

work page 2024

[15] [15]

Aparna Elangovan, Lei Xu, Jongwoo Ko, Mahsa Elyasi, Ling Liu, Sravan Bodapati, and Dan Roth. 2024. Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and llm-as-a-judge. arXiv preprint arXiv:2410.03775

work page arXiv 2024

[16] [16]

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. 2024. Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1):1--24

work page 2024

[17] [17]

Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120

work page 2023

[18] [18]

John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research

work page 2023

[19] [19]

Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. 2023. War and peace (waragent): Large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227

work page arXiv 2023

[20] [20]

Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. This human study did not involve human subjects: Validating llm simulations as behavioral evidence

work page 2025

[21] [21]

Austin C Kozlowski and James Evans. 2025. Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions. Sociological Methods & Research, 54(3):1017--1073

work page 2025

[22] [22]

Fan Li and Ya Yang. 2024. Impact of artificial intelligence--generated content labels on perceived accuracy, message credibility, and sharing intentions for misinformation: Web-based, randomized, controlled experiment. JMIR Formative Research, 8(1):e60024

work page 2024

[23] [23]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023 a . Camel: Communicative agents for" mind" exploration of large language model society. Advances in neural information processing systems, 36:51991--52008

work page 2023

[24] [24]

Jiaxuan Li, Lang Yu, and Allyson Ettinger. 2023 b . Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 804--815

work page 2023

[25] [25]

Lisa Messeri and Molly J Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002):49--58

work page 2024

[26] [26]

Marcus R Munaf \`o , Brian A Nosek, Dorothy VM Bishop, Katherine S Button, Christopher D Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J Ware, and John PA Ioannidis. 2017. A manifesto for reproducible science. Nature human behaviour, 1(1):0021

work page 2017

[27] [27]

Michael Muthukrishna and Joseph Henrich. 2019. A problem in theory. Nature Human Behaviour, 3(3):221--229

work page 2019

[28] [28]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

work page 2023

[29] [29]

Jan Pf \"a nder and Sacha Altay. 2025. Spotting false news and doubting true news: a systematic review and meta-analysis of news judgements. Nature human behaviour, 9(4):688--699

work page 2025

[30] [30]

Paul E Smaldino and Richard McElreath. 2016. The natural selection of bad science. Royal Society open science, 3(9)

work page 2016

[31] [31]

Petter T \"o rnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588

work page arXiv 2023

[32] [32]

Angelina Wang, Jamie Morgenstern, and John P Dickerson. 2025. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3):400--411

work page 2025